Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 27
Filter
1.
Theor Popul Biol ; 155: 67-76, 2024 02.
Article in English | MEDLINE | ID: mdl-38092137

ABSTRACT

Consider the diffusion process defined by the forward equation ut(t,x)=12{xu(t,x)}xx-α{xu(t,x)}x for t,x≥0 and -∞<α<∞, with an initial condition u(0,x)=δ(x-x0). This equation was introduced and solved by Feller to model the growth of a population of independently reproducing individuals. We explore important coalescent processes related to Feller's solution. For any α and x0>0 we calculate the distribution of the random variable An(s;t), defined as the finite number of ancestors at a time s in the past of a sample of size n taken from the infinite population of a Feller diffusion at a time t since its initiation. In a subcritical diffusion we find the distribution of population and sample coalescent trees from time t back, conditional on non-extinction as t→∞. In a supercritical diffusion we construct a coalescent tree which has a single founder and derive the distribution of coalescent times.

2.
Syst Biol ; 71(6): 1362-1377, 2022 10 12.
Article in English | MEDLINE | ID: mdl-35699529

ABSTRACT

How long does speciation take? The answer to this important question in evolutionary biology lies in the genetic difference not only among species, but also among lineages within each species. With the advance of genome sequencing in non-model organisms and the statistical tools to improve accuracy in inferring evolutionary histories among recently diverged lineages, we now have the lineage-level trees to answer these questions. However, we do not yet have an analytical tool for inferring speciation processes from these trees. What is needed is a model of speciation processes that generates both the trees and species identities of extant lineages. The model should allow calculation of the probability that certain lineages belong to certain species and have an evolutionary history consistent with the tree. Here, we propose such a model and test the model performance on both simulated data and real data. We show that maximum-likelihood estimates of the model are highly accurate and give estimates from real data that generate patterns consistent with observations. We discuss how to extend the model to account for different rates and types of speciation processes across lineages in a species group. By linking evolutionary processes on lineage level to species level, the model provides a new phylogenetic approach to study not just when speciation happened, but how speciation happened. [Micro-macro evolution; Protracted birth-death process; speciation completion rate; SSE approach.].


Subject(s)
Extinction, Biological , Genetic Speciation , Likelihood Functions , Phylogeny
3.
Theor Popul Biol ; 134: 106-118, 2020 08.
Article in English | MEDLINE | ID: mdl-32562610

ABSTRACT

The stationary sampling distribution of a neutral decoupled Moran or Wright-Fisher diffusion with neutral mutations is known to first order for a general rate matrix with small but otherwise unconstrained mutation rates. Using this distribution as a starting point we derive results for maximum likelihood estimates of scaled mutation rates from site frequency data under three model assumptions: a twelve-parameter general rate matrix, a nine-parameter reversible rate matrix, and a six-parameter strand-symmetric rate matrix. The site frequency spectrum is assumed to be sampled from a fixed size population in equilibrium, and to consist of allele frequency data at a large number of unlinked sites evolving with a common mutation rate matrix without selective bias. We correct an error in a previous treatment of the same problem (Burden and Tang, 2017) affecting the estimators for the general and strand-symmetric rate matrices. The method is applied to a biological dataset consisting of a site frequency spectrum extracted from short autosomal introns in a sample of Drosophila melanogaster individuals.


Subject(s)
Genetics, Population , Mutation Rate , Animals , Drosophila melanogaster/genetics , Gene Frequency , Models, Genetic , Mutation
4.
Theor Popul Biol ; 130: 50-59, 2019 12.
Article in English | MEDLINE | ID: mdl-31585138

ABSTRACT

We consider the problem of estimating the elapsed time since the most recent common ancestor of a finite random sample drawn from a population which has evolved through a Bienaymé-Galton-Watson branching process. More specifically, we are interested in the diffusion limit appropriate to a supercritical process in the near-critical limit evolving over a large number of time steps. Our approach differs from earlier analyses in that we assume the only known information is the mean and variance of the number of offspring per parent, the observed total population size at the time of sampling, and the size of the sample. We obtain a formula for the probability that a finite random sample of the population is descended from a single ancestor in the initial population, and derive a confidence interval for the initial population size in terms of the final population size and the time since initiating the process. We also determine a joint likelihood surface from which confidence regions can be determined for simultaneously estimating two parameters, (1) the population size at the time of the most recent common ancestor, and (2) the time elapsed since the existence of the most recent common ancestor.


Subject(s)
Population Density , Biological Evolution , Likelihood Functions , Models, Genetic , Probability
5.
J Math Biol ; 79(6-7): 2315-2342, 2019 12.
Article in English | MEDLINE | ID: mdl-31531705

ABSTRACT

The transition distribution of a sample taken from a Wright-Fisher diffusion with general small mutation rates is found using a coalescent approach. The approximation is equivalent to having at most one mutation in the coalescent tree of the sample up to the most recent common ancestor with additional mutations occurring on the lineage from the most recent common ancestor to the time origin if complete coalescence occurs before the origin. The sampling distribution leads to an approximation for the transition density in the diffusion with small mutation rates. This new solution has interest because the transition density in a Wright-Fisher diffusion with general mutation rates is not known.


Subject(s)
Genetic Drift , Genetics, Population/methods , Models, Genetic , Mutation Rate , Gene Frequency
6.
J Math Biol ; 78(4): 1211-1224, 2019 03.
Article in English | MEDLINE | ID: mdl-30426201

ABSTRACT

The stationary distribution of a sample taken from a Wright-Fisher diffusion with general small mutation rates is found using a coalescent approach. The approximation is equivalent to having at most one mutation in the coalescent tree to the first order in the rates. The sample probabilities characterize an approximation for the stationary distribution from the Wright-Fisher diffusion. The approach is different from Burden and Tang (Theor Popul Biol 112:22-32, 2016; Theor Popul Biol 113:23-33, 2017) who use a probability flux argument to obtain the same results from a forward diffusion generator equation. The solution has interest because the solution is not known when rates are not small. An analogous solution is found for the configuration of alleles in a general exchangeable binary coalescent tree. In particular an explicit solution is found for a pure birth process tree when individuals reproduce at rate [Formula: see text].


Subject(s)
Models, Genetic , Mutation Rate , Alleles , Animals , Computational Biology , Gene Frequency , Genetics, Population , Markov Chains , Mathematical Concepts , Probability
7.
Theor Popul Biol ; 124: 70-80, 2018 12.
Article in English | MEDLINE | ID: mdl-30308179

ABSTRACT

The stationary distribution of the diffusion limit of the 2-island, 2-allele Wright-Fisher with small but otherwise arbitrary mutation and migration rates is investigated. Following a method developed by Burden and Tang (2016, 2017) for approximating the forward Kolmogorov equation, the stationary distribution is obtained to leading order as a set of line densities on the edges of the sample space, corresponding to states for which one island is bi-allelic and the other island is non-segregating, and a set of point masses at the corners of the sample space, corresponding to states for which both islands are simultaneously non-segregating. Analytic results for the corner probabilities and line densities are verified independently using the backward generator and for the corner probabilities using the coalescent.


Subject(s)
Gene Frequency , Genetics, Population , Models, Genetic , Alleles , Computer Simulation , Genetic Drift , Mutation , Probability
8.
Theor Popul Biol ; 120: 52-61, 2018 03.
Article in English | MEDLINE | ID: mdl-29233675

ABSTRACT

A population genetics model based on a multitype branching process, or equivalently a Galton-Watson branching process for multiple alleles, is presented. The diffusion limit forward Kolmogorov equation is derived for the case of neutral mutations. The asymptotic stationary solution is obtained and has the property that the extant population partitions into subpopulations whose relative sizes are determined by mutation rates. An approximate time-dependent solution is obtained in the limit of low mutation rates. This solution has the property that the system undergoes a rapid transition from a drift-dominated phase to a mutation-dominated phase in which the distribution collapses onto the asymptotic stationary distribution. The changeover point of the transition is determined by the per-generation growth factor and mutation rate. The approximate solution is confirmed using numerical simulations.


Subject(s)
Models, Genetic , Mutation Rate , Alleles , Animals , Computer Simulation , Genetics, Population , Humans , Mutation
9.
Theor Popul Biol ; 113: 23-33, 2017 02.
Article in English | MEDLINE | ID: mdl-27825765

ABSTRACT

A procedure is described for estimating evolutionary rate matrices from observed site frequency data. The procedure assumes (1) that the data are obtained from a constant size population evolving according to a stationary Wright-Fisher or decoupled Moran model; (2) that the data consist of a multiple alignment of a moderate number of sequenced genomes drawn randomly from the population; and (3) that within the genome a large number of independent, neutral sites evolving with a common mutation rate matrix can be identified. No restrictions are imposed on the scaled rate matrix other than that the off-diagonal elements are positive, their sum is ≪1, and that the rows of the matrix sum to zero. In particular the rate matrix is not assumed to be reversible. The key to the method is an approximate stationary solution to the diffusion limit, forward Kolmogorov equation for neutral evolution in the limit of low mutation rates.


Subject(s)
Biological Evolution , Evolution, Molecular , Gene Frequency , Models, Genetic , Genetic Drift , Genetics, Population
10.
Theor Popul Biol ; 112: 22-32, 2016 12.
Article in English | MEDLINE | ID: mdl-27495379

ABSTRACT

We address the problem of determining the stationary distribution of the multi-allelic, neutral-evolution Wright-Fisher model in the diffusion limit. A full solution to this problem for an arbitrary K×K mutation rate matrix involves solving for the stationary solution of a forward Kolmogorov equation over a (K-1)-dimensional simplex, and remains intractable. In most practical situations mutations rates are slow on the scale of the diffusion limit and the solution is heavily concentrated on the corners and edges of the simplex. In this paper we present a practical approximate solution for slow mutation rates in the form of a set of line densities along the edges of the simplex. The method of solution relies on parameterising the general non-reversible rate matrix as the sum of a reversible part and a set of (K-1)(K-2)/2 independent terms corresponding to fluxes of probability along closed paths around faces of the simplex. The solution is potentially a first step in estimating non-reversible evolutionary rate matrices from observed allele frequency spectra.


Subject(s)
Alleles , Genetic Drift , Models, Genetic , Mutation Rate , Gene Frequency , Genetics, Population , Mutation
11.
Theor Popul Biol ; 109: 63-74, 2016 06.
Article in English | MEDLINE | ID: mdl-27018000

ABSTRACT

Most population genetics studies have their origins in a Wright-Fisher or some closely related fixed-population model in which each individual randomly chooses its ancestor. Populations which vary in size with time are typically modelled via a coalescent derived from Wright-Fisher, but use a nonlinear time-scaling driven by a deterministically imposed population growth. An alternate, arguably more realistic approach, and one which we take here, is to allow the population size to vary stochastically via a Galton-Watson branching process. We study genetic drift in a population consisting of a number of distinct allele types in which each allele type evolves as an independent Galton-Watson branching process. We find the dynamics of the population is determined by a single parameter κ0=(2m0/σ(2))logλ, where m0 is the initial population size, λ is the mean number of offspring per individual; and σ(2) is the variance of the number of offspring. For 0≲κ0≪1, the dynamics are close to those of Wright-Fisher, with the added property that the population is prone to extinction. For κ0≫1 allele frequencies and ancestral lineages are stable and individual alleles do not fix throughout the population. The existence of a rapid changeover regime at κ0≈1 enables estimates to be made, together with confidence intervals, of the time and population size of the era of mitochondrial Eve.


Subject(s)
Genetic Drift , Genetics, Population , Models, Genetic , Alleles , Population Density , Population Growth
12.
BMC Bioinformatics ; 16: 145, 2015 May 06.
Article in English | MEDLINE | ID: mdl-25943746

ABSTRACT

BACKGROUND: Bisulphite sequencing enables the detection of cytosine methylation. The sequence of the methylation states of cytosines on any given read forms a methylation pattern that carries substantially more information than merely studying the average methylation level at individual positions. In order to understand better the complexity of DNA methylation landscapes in biological samples, it is important to study the diversity of these methylation patterns. However, the accurate quantification of methylation patterns is subject to sequencing errors and spurious signals due to incomplete bisulphite conversion of cytosines. RESULTS: A statistical model is developed which accounts for the distribution of DNA methylation patterns at any given locus. The model incorporates the effects of sequencing errors and spurious reads, and enables estimation of the true underlying distribution of methylation patterns. CONCLUSIONS: Calculation of the estimated distribution over methylation patterns is implemented in the R Bioconductor package MPFE. Source code and documentation of the package are also available for download at http://bioconductor.org/packages/3.0/bioc/html/MPFE.html .


Subject(s)
Algorithms , Bees/physiology , Brain/metabolism , DNA Methylation , High-Throughput Nucleotide Sequencing/methods , Models, Statistical , Animals , Cytosine/chemistry , Documentation , Programming Languages , Sulfites/chemistry
13.
PeerJ ; 2: e576, 2014.
Article in English | MEDLINE | ID: mdl-25337456

ABSTRACT

Background. A number of algorithms exist for analysing RNA-sequencing data to infer profiles of differential gene expression. Problems inherent in building algorithms around statistical models of over dispersed count data are formidable and frequently lead to non-uniform p-value distributions for null-hypothesis data and to inaccurate estimates of false discovery rates (FDRs). This can lead to an inaccurate measure of significance and loss of power to detect differential expression. Results. We use synthetic and real biological data to assess the ability of several available R packages to accurately estimate FDRs. The packages surveyed are based on statistical models of overdispersed Poisson data and include edgeR, DESeq, DESeq2, PoissonSeq and QuasiSeq. Also tested is an add-on package to edgeR and DESeq which we introduce called Polyfit. Polyfit aims to address the problem of a non-uniform null p-value distribution for two-class datasets by adapting the Storey-Tibshirani procedure. Conclusions. We find the best performing package in the sense that it achieves a low FDR which is accurately estimated over the full range of p-values, albeit with a very slow run time, is the QLSpline implementation of QuasiSeq. This finding holds provided the number of biological replicates in each condition is at least 4. The next best performing packages are edgeR and DESeq2. When the number of biological replicates is sufficiently high, and within a range accessible to multiplexed experimental designs, the Polyfit extension improves the performance DESeq (for approximately 6 or more replicates per condition), making its performance comparable with that of edgeR and DESeq2 in our tests with synthetic data.

14.
J Comput Biol ; 21(1): 41-63, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24160839

ABSTRACT

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D(2) statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D(2) statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D(2) distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D(2) statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D(2) distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D(2) distribution from the human genome.


Subject(s)
Markov Chains , Sequence Alignment/statistics & numerical data , Algorithms , Computational Biology , Computer Simulation , Genome, Human , Humans , Models, Genetic , Models, Statistical , Monte Carlo Method , Sequence Analysis, DNA/statistics & numerical data
15.
Nucleic Acids Res ; 41(5): 2779-96, 2013 Mar 01.
Article in English | MEDLINE | ID: mdl-23307556

ABSTRACT

Hybridization of nucleic acids on solid surfaces is a key process involved in high-throughput technologies such as microarrays and, in some cases, next-generation sequencing (NGS). A physical understanding of the hybridization process helps to determine the accuracy of these technologies. The goal of a widespread research program is to develop reliable transformations between the raw signals reported by the technologies and individual molecular concentrations from an ensemble of nucleic acids. This research has inputs from many areas, from bioinformatics and biostatistics, to theoretical and experimental biochemistry and biophysics, to computer simulations. A group of leading researchers met in Ploen Germany in 2011 to discuss present knowledge and limitations of our physico-chemical understanding of high-throughput nucleic acid technologies. This meeting inspired us to write this summary, which provides an overview of the state-of-the-art approaches based on physico-chemical foundation to modeling of the nucleic acids hybridization process on solid surfaces. In addition, practical application of current knowledge is emphasized.


Subject(s)
High-Throughput Nucleotide Sequencing , Oligonucleotide Array Sequence Analysis , Algorithms , Artifacts , Base Pairing , Calibration , DNA/chemistry , DNA/genetics , DNA Probes/chemistry , DNA Probes/genetics , Humans , Image Processing, Computer-Assisted , Models, Biological , Nucleic Acid Hybridization/methods , Surface Properties , Thermodynamics
16.
BMC Genomics ; 13: 484, 2012 Sep 17.
Article in English | MEDLINE | ID: mdl-22985019

ABSTRACT

BACKGROUND: RNA sequencing (RNA-Seq) has emerged as a powerful approach for the detection of differential gene expression with both high-throughput and high resolution capabilities possible depending upon the experimental design chosen. Multiplex experimental designs are now readily available, these can be utilised to increase the numbers of samples or replicates profiled at the cost of decreased sequencing depth generated per sample. These strategies impact on the power of the approach to accurately identify differential expression. This study presents a detailed analysis of the power to detect differential expression in a range of scenarios including simulated null and differential expression distributions with varying numbers of biological or technical replicates, sequencing depths and analysis methods. RESULTS: Differential and non-differential expression datasets were simulated using a combination of negative binomial and exponential distributions derived from real RNA-Seq data. These datasets were used to evaluate the performance of three commonly used differential expression analysis algorithms and to quantify the changes in power with respect to true and false positive rates when simulating variations in sequencing depth, biological replication and multiplex experimental design choices. CONCLUSIONS: This work quantitatively explores comparisons between contemporary analysis tools and experimental design choices for the detection of differential expression using RNA-Seq. We found that the DESeq algorithm performs more conservatively than edgeR and NBPSeq. With regard to testing of various experimental designs, this work strongly suggests that greater power is gained through the use of biological replicates relative to library (technical) replicates and sequencing depth. Strikingly, sequencing depth could be reduced as low as 15% without substantial impacts on false positive or true positive rates.


Subject(s)
Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Statistics as Topic/methods , Algorithms
17.
Stat Appl Genet Mol Biol ; 11(1): Article 3, 2012.
Article in English | MEDLINE | ID: mdl-22624182

ABSTRACT

The D(2) statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D(2) may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D(2)* and D(2c). We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D(2) and D(2)c, and to a somewhat lesser extent D(2)*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.


Subject(s)
Sequence Alignment , Sequence Analysis, DNA/methods , Base Sequence , Databases, Factual , Sequence Analysis, DNA/statistics & numerical data
18.
Bioinformatics ; 26(18): 2281-8, 2010 Sep 15.
Article in English | MEDLINE | ID: mdl-20639411

ABSTRACT

MOTIVATION: Clustering gene expression data given in terms of time-series is a challenging problem that imposes its own particular constraints. Traditional clustering methods based on conventional similarity measures are not always suitable for clustering time-series data. A few methods have been proposed recently for clustering microarray time-series, which take the temporal dimension of the data into account. The inherent principle behind these methods is to either define a similarity measure appropriate for temporal expression data, or pre-process the data in such a way that the temporal relationships between and within the time-series are considered during the subsequent clustering phase. RESULTS: We introduce pairwise gene expression profile alignment, which vertically shifts two profiles in such a way that the area between their corresponding curves is minimal. Based on the pairwise alignment operation, we define a new distance function that is appropriate for time-series profiles. We also introduce a new clustering method that involves multiple expression profile alignment, which generalizes pairwise alignment to a set of profiles. Extensive experiments on well-known datasets yield encouraging results of at least 80% classification accuracy.


Subject(s)
Algorithms , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Cluster Analysis , Gene Expression
19.
BMC Bioinformatics ; 11: 291, 2010 May 28.
Article in English | MEDLINE | ID: mdl-20509934

ABSTRACT

BACKGROUND: Post-hybridization washing is an essential part of microarray experiments. Both the quality of the experimental washing protocol and adequate consideration of washing in intensity calibration ultimately affect the quality of the expression estimates extracted from the microarray intensities. RESULTS: We conducted experiments on GeneChip microarrays with altered protocols for washing, scanning and staining to study the probe-level intensity changes as a function of the number of washing cycles. For calibration and analysis of the intensity data we make use of the 'hook' method which allows intensity contributions due to non-specific and specific hybridization of perfect match (PM) and mismatch (MM) probes to be disentangled in a sequence specific manner. On average, washing according to the standard protocol removes about 90% of the non-specific background and about 30-50% and less than 10% of the specific targets from the MM and PM, respectively. Analysis of the washing kinetics shows that the signal-to-noise ratio doubles roughly every ten stringent washing cycles. Washing can be characterized by time-dependent rate constants which reflect the heterogeneous character of target binding to microarray probes. We propose an empirical washing function which estimates the survival of probe bound targets. It depends on the intensity contribution due to specific and non-specific hybridization per probe which can be estimated for each probe using existing methods. The washing function allows probe intensities to be calibrated for the effect of washing. On a relative scale, proper calibration for washing markedly increases expression measures, especially in the limit of small and large values. CONCLUSIONS: Washing is among the factors which potentially distort expression measures. The proposed first-order correction method allows direct implementation in existing calibration algorithms for microarray data. We provide an experimental 'washing data set' which might be used by the community for developing amendments of the washing correction.


Subject(s)
Algorithms , Oligonucleotide Array Sequence Analysis/methods , Gene Expression Profiling/methods , Kinetics , Nucleic Acid Hybridization
20.
Phys Biol ; 7(1): 016004, 2009 Dec 21.
Article in English | MEDLINE | ID: mdl-20026877

ABSTRACT

The effect of target molecule depletion from the supernatant solution is incorporated into a physico-chemical model of hybridization on oligonucleotide microarrays. Two possible regimes are identified: local depletion, in which depletion by a given probe feature only affects that particular probe, and global depletion, in which all features responding to a given target species are affected. Examples are given of two existing spike-in data sets experiencing measurable effects of target depletion. The first of these, from an experiment by Suzuki et al using custom built arrays with a broad range of probe lengths and mismatch positions, is verified to exhibit local and not global depletion. The second data set, the well-known Affymetrix HGU133a latin square experiment, is shown to be very well explained by a global depletion model. It is shown that microarray calibrations relying on Langmuir isotherm models which ignore depletion effects will significantly underestimate specific target concentrations. It is also shown that a combined analysis of perfect match and mismatch probe signals in terms of a simple graphical summary, namely the hook curve method, can discriminate between cases of local and global depletion.


Subject(s)
Models, Chemical , Nucleic Acid Hybridization , Oligonucleotide Array Sequence Analysis/methods , Chemical Phenomena
SELECTION OF CITATIONS
SEARCH DETAIL
...