Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
1.
Nat Hum Behav ; 8(6): 1057-1064, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38649461

ABSTRACT

In widely used models of biological contagion, interventions that randomly rewire edges (generally making them 'longer') accelerate spread. However, recent work has argued that highly clustered, rather than random, networks facilitate the spread of threshold-based contagions, such as those motivated by myopic best response for adoption of new innovations, norms and products in games of strategic complement. Here we show that minor modifications to this model reverse this result, thereby harmonizing qualitative facts about how network structure affects contagion. We analyse the rate of spread over circular lattices with rewired edges and show that having a small probability of adoption below the threshold probability is enough to ensure that random rewiring accelerates the spread of a noisy threshold-based contagion. This conclusion is verified in simulations of empirical networks and remains valid with partial but frequent enough rewiring and when adoption decisions are reversible but infrequently so, as well as in high-dimensional lattice structures.


Subject(s)
Models, Theoretical , Humans
2.
Pac Symp Biocomput ; 28: 133-144, 2023.
Article in English | MEDLINE | ID: mdl-36540971

ABSTRACT

In an extant population, how much information do extant individuals provide on the pedigree of their ancestors? Recent work by Kim, Mossel, Ramnarayan and Turner (2020) studied this question under a number of simplifying assumptions, including random mating, fixed length inheritance blocks and sufficiently large founding population. They showed that under these conditions if the average number of offspring is a sufficiently large constant, then it is possible to recover a large fraction of the pedigree structure and genetic content by an algorithm they named REC-GEN.We are interested in studying the performance of REC-GEN on simulated data generated according to the model. As a first step, we improve the running time of the algorithm. However, we observe that even the faster version of the algorithm does not do well in any simulations in recovering the pedigree beyond 2 generations. We claim that this is due to the inbreeding present in any setting where the algorithm can be run, even on simulated data. To support the claim we show that a main step of the algorithm, called ancestral reconstruction, performs accurately in an idealized setting with no inbreeding but performs poorly in random mating populations.To overcome the poor behavior of REC-GEN we introduce a Belief-Propagation based heuristic that accounts for the inbreeding and performs much better in our simulations.


Subject(s)
Computational Biology , Inbreeding , Humans , Pedigree , Computer Simulation , Models, Genetic
3.
J Math Biol ; 84(5): 36, 2022 04 08.
Article in English | MEDLINE | ID: mdl-35394192

ABSTRACT

Species tree estimation faces many significant hurdles. Chief among them is that the trees describing the ancestral lineages of each individual gene-the gene trees-often differ from the species tree. The multispecies coalescent is commonly used to model this gene tree discordance, at least when it is believed to arise from incomplete lineage sorting, a population-genetic effect. Another significant challenge in this area is that molecular sequences associated to each gene typically provide limited information about the gene trees themselves. While the modeling of sequence evolution by single-site substitutions is well-studied, few species tree reconstruction methods with theoretical guarantees actually address this latter issue. Instead, a standard-but unsatisfactory-assumption is that gene trees are perfectly reconstructed before being fed into a so-called summary method. Hence much remains to be done in the development of inference methodologies that rigorously account for gene tree estimation error-or completely avoid gene tree estimation in the first place. In previous work, a data requirement trade-off was derived between the number of loci m needed for an accurate reconstruction and the length of the locus sequences k. It was shown that to reconstruct an internal branch of length f, one needs m to be of the order of [Formula: see text]. That previous result was obtained under the restrictive assumption that mutation rates as well as population sizes are constant across the species phylogeny. Here we further generalize this result beyond this assumption. Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent, which we refer to as a stochastic Farris transform. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with [Formula: see text] species, the rooted topology of the species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.


Subject(s)
Genetic Speciation , Models, Genetic , Phylogeny
4.
J Comput Biol ; 27(4): 613-625, 2020 04.
Article in English | MEDLINE | ID: mdl-31794679

ABSTRACT

Reconstruction of population histories is a central problem in population genetics. Existing coalescent-based methods, such as the seminal work of Li and Durbin, attempt to solve this problem using sequence data but have no rigorous guarantees. Determining the amount of data needed to correctly reconstruct population histories is a major challenge. Using a variety of tools from information theory, the theory of extremal polynomials, and approximation theory, we prove new sharp information-theoretic lower bounds on the problem of reconstructing population structure-the history of multiple subpopulations that merge, split, and change sizes over time. Our lower bounds are exponential in the number of subpopulations, even when reconstructing recent histories. We demonstrate the sharpness of our lower bounds by providing algorithms for distinguishing and learning population histories with matching dependence on the number of subpopulations. Along the way and of independent interest, we essentially determine the optimal number of samples needed to learn an exponential mixture distribution information-theoretically, proving the upper bound by analyzing natural (and efficient) algorithms for this problem.


Subject(s)
Computational Biology , Genetics, Population , Information Theory , Models, Genetic , Algorithms , Computer Simulation , Polymorphism, Single Nucleotide/genetics
5.
Entropy (Basel) ; 21(1)2019 Jan 02.
Article in English | MEDLINE | ID: mdl-33266744

ABSTRACT

This paper shows that one cannot learn the probability of rare events without imposing further structural assumptions. The event of interest is that of obtaining an outcome outside the coverage of an i.i.d. sample from a discrete distribution. The probability of this event is referred to as the "missing mass". The impossibility result can then be stated as: the missing mass is not distribution-free learnable in relative error. The proof is semi-constructive and relies on a coupling argument using a dithered geometric distribution. Via a reduction, this impossibility also extends to both discrete and continuous tail estimation. These results formalize the folklore that in order to predict rare events without restrictive modeling, one necessarily needs distributions with "heavy tails".

6.
Theor Popul Biol ; 100C: 26-38, 2015 03.
Article in English | MEDLINE | ID: mdl-25498195

ABSTRACT

Reconstructing past population size from present day genetic data is a major goal of population genetics. Recent empirical studies infer population size history using coalescent-based models applied to a small number of individuals. Here we provide tight bounds on the amount of exact coalescence time data needed to recover the population size history of a single, panmictic population at a certain level of accuracy. In practice, coalescence times are estimated from sequence data and so our lower bounds should be taken as rather conservative.

7.
J Theor Biol ; 360: 315-318, 2014 Nov 07.
Article in English | MEDLINE | ID: mdl-25108194

ABSTRACT

Inferring the ancestral state at the root of a phylogenetic tree from states observed at the leaves is a problem arising in evolutionary biology. The simplest technique - majority rule - estimates the root state by the most frequently occurring state at the leaves. Alternative methods - such as maximum parsimony - explicitly take the tree structure into account. Since either method can outperform the other on particular trees, it is useful to consider the accuracy of the methods on trees generated under some evolutionary null model, such as a Yule pure-birth model. In this short note, we answer a recently posed question concerning the performance of majority rule on Yule trees under a symmetric 2-state Markovian substitution model of character state change. We show that majority rule is accurate precisely when the ratio of the birth (speciation) rate of the Yule process to the substitution rate exceeds the value 4. By contrast, maximum parsimony has been shown to be accurate only when this ratio is at least 6. Our proof relies on a second moment calculation, coupling, and a novel application of a reflection principle.


Subject(s)
Biological Evolution , Classification/methods , Genetic Speciation , Inheritance Patterns/genetics , Models, Genetic , Phylogeny , Information Theory
8.
Proc Natl Acad Sci U S A ; 110(52): 20935-40, 2013 Dec 24.
Article in English | MEDLINE | ID: mdl-24277835

ABSTRACT

Spectral algorithms are classic approaches to clustering and community detection in networks. However, for sparse networks the standard versions of these algorithms are suboptimal, in some cases completely failing to detect communities even when other algorithms such as belief propagation can do so. Here, we present a class of spectral algorithms based on a nonbacktracking walk on the directed edges of the graph. The spectrum of this operator is much better-behaved than that of the adjacency matrix or other commonly used matrices, maintaining a strong separation between the bulk eigenvalues and the eigenvalues relevant to community structure even in the sparse case. We show that our algorithm is optimal for graphs generated by the stochastic block model, detecting communities all of the way down to the theoretical limit. We also show the spectrum of the nonbacktracking operator for some real-world networks, illustrating its advantages over traditional spectral clustering.


Subject(s)
Algorithms , Cluster Analysis , Data Interpretation, Statistical , Models, Statistical
9.
J Math Biol ; 67(4): 767-97, 2013 Oct.
Article in English | MEDLINE | ID: mdl-22875145

ABSTRACT

Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction. Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.


Subject(s)
Data Interpretation, Statistical , Evolution, Molecular , Models, Genetic , Mutation , Phylogeny , Algorithms , Base Sequence/genetics , Cluster Analysis
10.
Bull Math Biol ; 73(7): 1627-44, 2011 Jul.
Article in English | MEDLINE | ID: mdl-20931293

ABSTRACT

The accurate reconstruction of phylogenies from short molecular sequences is an important problem in computational biology. Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In Daskalakis et al. (in Probab. Theory Relat. Fields 2010), building on the work of Mossel (Trans. Am. Math. Soc. 356(6):2379-2404, 2004), a tight sequence-length requirement was obtained for the simple CFN model of substitution, that is, the case of a two-state symmetric rate matrix Q. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(log n) to poly(n), where n is the number of leaves) at the "critical" branch length g (ML)(Q) (if it exists) of the ancestral reconstruction problem defined roughly as follows: below g (ML)(Q) the sequence at the root can be accurately estimated from sequences at the leaves on deep trees, whereas above g (ML)(Q) information decays exponentially quickly down the tree.Here, we consider a more general evolutionary model, the GTR model, where the q×q rate matrix Q is reversible with q≥2. For this model, recent results of Roch (Preprint, 2009) show that the tree can be accurately reconstructed with sequences of length O(log (n)) when the branch lengths are below g (Lin)(Q), known as the Kesten-Stigum (KS) bound, up to which ancestral sequences can be accurately estimated using simple linear estimators. Although for the CFN model g (ML)(Q)=g (Lin)(Q) (in other words, linear ancestral estimators are in some sense best possible), it is known that for the more general GTR models one has g (ML)(Q)≥g (Lin)(Q) with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models Q and a phylogenetic reconstruction algorithm which recovers the tree from O(log n)-length sequences for some branch lengths in the range (g (Lin)(Q),g (ML)(Q)). Second, we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above g (ML)(Q).


Subject(s)
Base Sequence , Evolution, Molecular , Models, Genetic , Phylogeny , DNA/chemistry , DNA/genetics , Markov Chains
11.
Article in English | MEDLINE | ID: mdl-21116038

ABSTRACT

Markov models are extensively used in the analysis of molecular evolution. A recent line of research suggests that pairs of proteins with functional and physical interactions co-evolve with each other. Here, by analyzing hundreds of orthologous sets of three fungi and their co-evolutionary relations, we demonstrate that co-evolutionary assumption may violate the Markov assumption. Our results encourage developing alternative probabilistic models for the cases of extreme co-evolution.


Subject(s)
Evolution, Molecular , Markov Chains , Phylogeny , Animals , Humans , Proteins/genetics
12.
Article in English | MEDLINE | ID: mdl-20150678

ABSTRACT

We introduce a simple computationally efficient algorithm for reconstructing phylogenies from multiple gene trees in the presence of incomplete lineage sorting, that is, when the topology of the gene trees may differ from that of the species tree. We show that our technique is statistically consistent under standard stochastic assumptions, that is, it returns the correct tree given sufficiently many unlinked loci. We also show that it can tolerate moderate estimation errors.


Subject(s)
Algorithms , Biological Evolution , Chromosome Mapping/methods , Linkage Disequilibrium/genetics , Models, Genetic , Phylogeny , Animals , Computer Simulation , Humans
13.
J Theor Biol ; 258(1): 95-102, 2009 May 07.
Article in English | MEDLINE | ID: mdl-19490881

ABSTRACT

Phylogenetic trees describe the evolutionary history of a group of present-day species from a common ancestor. These trees are typically reconstructed from aligned DNA sequence data. In this paper we analytically address the following question: Is the amount of sequence data required to accurately reconstruct a tree significantly more than the amount required to test whether or not a candidate tree was the 'true' tree? By 'significantly', we mean that the two quantities do not behave the same way as a function of the number of species being considered. We prove that, for a certain type of model, the amount of information required is not significantly different; while for another type of model, the information required to test a tree is independent of the number of leaves, while that required to reconstruct it grows with this number. Our results combine probabilistic and combinatorial arguments.


Subject(s)
Algorithms , Biological Evolution , Computer Simulation , Models, Genetic , Phylogeny , Animals , Probability , Sequence Analysis, DNA
14.
Article in English | MEDLINE | ID: mdl-19179706

ABSTRACT

Ancestral maximum likelihood (AML) is a method that simultaneously reconstructs a phylogenetic tree and ancestral sequences from extant data (sequences at the leaves). The tree and ancestral sequences maximize the probability of observing the given data under a Markov model of sequence evolution, in which branch lengths are also optimized but constrained to take the same value on any edge across all sequence sites. AML differs from the more usual form of maximum likelihood (ML) in phylogenetics because ML averages over all possible ancestral sequences. ML has long been know to be statistically consistent--that is, it converges on the correct tree with probability approaching 1 as the sequence length grows. However, the statistical consistency of AML has not been formally determined, despite informal remarks in a literature that dates back 20 years. In this short note we prove a general result that implies that AML is statistically inconsistent. In particular we show that AML can 'shrink' short edges in a tree, resulting in a tree that has no internal resolution as the sequence length grows. Our results apply to any number of taxa.


Subject(s)
Computational Biology/methods , Models, Statistical , Phylogeny , Cluster Analysis , Markov Chains
15.
Bull Math Biol ; 70(4): 1115-39, 2008 May.
Article in English | MEDLINE | ID: mdl-18175189

ABSTRACT

In this paper, we apply new geometric and combinatorial methods to the study of phylogenetic mixtures. The focus of the geometric approach is to describe the geometry of phylogenetic mixture distributions for the two state random cluster model, which is a generalization of the two state symmetric (CFN) model. In particular, we show that the set of mixture distributions forms a convex polytope and we calculate its dimension; corollaries include a simple criterion for when a mixture of branch lengths on the star tree can mimic the site pattern frequency vector of a resolved quartet tree. Furthermore, by computing volumes of polytopes we can clarify how "common" non-identifiable mixtures are under the CFN model. We also present a new combinatorial result which extends any identifiability result for a specific pair of trees of size six to arbitrary pairs of trees. Next we present a positive result showing identifiability of rates-across-sites models. Finally, we answer a question raised in a previous paper concerning "mixed branch repulsion" on trees larger than quartet trees under the CFN model.


Subject(s)
Models, Genetic , Phylogeny , Cluster Analysis , Evolution, Molecular , Mathematics
16.
Article in English | MEDLINE | ID: mdl-17277418

ABSTRACT

We study distorted metrics on binary trees in the context of phylogenetic reconstruction. Given a binary tree T on n leaves with a path metric d, consider the pairwise distances {d(u,v)} between leaves. It is well known that these determine the tree and the d length of all edges. Here, we consider distortions [symbol: see text] of d such that, for all leaves u and v, it holds that absolute value(d(u,v) - [symbol: see text](u,v)) < f/2 if either d(u,v) < M + f/2 or [symbol: see text](u,v) < M + f/2, where d satisfies f < or = d(e) < or = g for all edges e. Given such distortions, we show how to reconstruct in polynomial time a forest T1, ..., Talpha such that the true tree T may be obtained from that forest by adding alpha - 1 edges and alpha - 1 < or = 2(-omega(M/g)) n. Our distorted metric result implies a reconstruction algorithm of phylogenetic forests with a small number of trees from sequences of length logarithmic in the number of species. The reconstruction algorithm is applicable for the general Markov model. Both the distorted metric result and its applications to phylogeny are almost tight.


Subject(s)
Computational Biology/methods , Models, Genetic , Phylogeny , Algorithms , Animals , Genome/genetics , Humans , Mutation
17.
Science ; 309(5744): 2207-9, 2005 Sep 30.
Article in English | MEDLINE | ID: mdl-16195459

ABSTRACT

Markov chain Monte Carlo (MCMC) algorithms play a critical role in the Bayesian approach to phylogenetic inference. We present a theoretical analysis of the rate of convergence of many of the widely used Markov chains. For N characters generated from a uniform mixture of two trees, we prove that the Markov chains take an exponentially long (in N) number of iterations to converge to the posterior distribution. Nevertheless, the likelihood plots for sample runs of the Markov chains deceivingly suggest that the chains converge rapidly to a unique tree. Our results rely on novel mathematical understanding of the log-likelihood function on the space of phylogenetic trees. The practical implications of our work are that Bayesian MCMC methods can be misleading when the data are generated from a mixture of trees. Thus, in cases of data containing potentially conflicting phylogenetic signals, phylogenetic reconstruction should be performed separately on each signal.


Subject(s)
Algorithms , Bayes Theorem , Markov Chains , Monte Carlo Method , Phylogeny , Likelihood Functions , Mathematics
18.
J Theor Biol ; 233(3): 327-36, 2005 Apr 07.
Article in English | MEDLINE | ID: mdl-15652142

ABSTRACT

We determine conditions under which a random biochemical system is likely to contain a subsystem that is both autocatalytic and able to survive on some ambient 'food' source. Such systems have previously been investigated for their relevance to origin-of-life models. In this paper we extend earlier work, by finding precisely the order of catalysation required for the emergence of such self-sustaining autocatalytic networks. This answers questions raised in earlier papers, yet also allows for a more general class of models. We also show that a recently described polynomial-time algorithm for determining whether a catalytic reaction system contains an autocatalytic, self-sustaining subsystem is unlikely to adapt to allow inhibitory catalysation--in this case we show that the associated decision problem is NP-complete.


Subject(s)
Biochemical Phenomena , Models, Chemical , Catalysis , Homeostasis , Probability
19.
Math Biosci ; 187(2): 189-203, 2004 Feb.
Article in English | MEDLINE | ID: mdl-14739084

ABSTRACT

We investigate a simple model that generates random partitions of the leaf set of a tree. Of particular interest is the reconstruction question: what number k of independent samples (partitions) are required to correctly reconstruct the underlying tree (with high probability)? We demonstrate a phase transition for k as a function of the mutation rate, from logarithmic to polynomial dependence on the size of the tree. We also describe a simple polynomial-time tree reconstruction algorithm that applies in the logarithmic region. This model and the associated reconstruction questions are motivated by a Markov model for genomic evolution in molecular biology.


Subject(s)
Evolution, Molecular , Models, Genetic , Phylogeny , Cluster Analysis , Markov Chains
20.
J Comput Biol ; 10(5): 669-76, 2003.
Article in English | MEDLINE | ID: mdl-14633391

ABSTRACT

We prove that it is impossible to reconstruct ancestral data at the root of "deep" phylogenetic trees with high mutation rates. Moreover, we prove that it is impossible to reconstruct the topology of "deep" trees with high mutation rates from a number of characters smaller than a low-degree polynomial in the number of leaves. Our impossibility results hold for all reconstruction methods. The proofs apply tools from information theory and percolation theory.


Subject(s)
Mutation , Phylogeny , Decision Trees , Models, Theoretical , Probability , Reproducibility of Results
SELECTION OF CITATIONS
SEARCH DETAIL
...