Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 13 de 13
Filter
Add more filters










Publication year range
1.
Bioinformatics ; 30(17): i541-8, 2014 Sep 01.
Article in English | MEDLINE | ID: mdl-25161245

ABSTRACT

MOTIVATION: Species trees provide insight into basic biology, including the mechanisms of evolution and how it modifies biomolecular function and structure, biodiversity and co-evolution between genes and species. Yet, gene trees often differ from species trees, creating challenges to species tree estimation. One of the most frequent causes for conflicting topologies between gene trees and species trees is incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent. While many methods have been developed to estimate species trees from multiple genes, some which have statistical guarantees under the multi-species coalescent model, existing methods are too computationally intensive for use with genome-scale analyses or have been shown to have poor accuracy under some realistic conditions. RESULTS: We present ASTRAL, a fast method for estimating species trees from multiple genes. ASTRAL is statistically consistent, can run on datasets with thousands of genes and has outstanding accuracy-improving on MP-EST and the population tree from BUCKy, two statistically consistent leading coalescent-based methods. ASTRAL is often more accurate than concatenation using maximum likelihood, except when ILS levels are low or there are too few gene trees. AVAILABILITY AND IMPLEMENTATION: ASTRAL is available in open source form at https://github.com/smirarab/ASTRAL/. Datasets studied in this article are available at http://www.cs.utexas.edu/users/phylo/datasets/astral. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Phylogeny , Algorithms , Animals , Data Interpretation, Statistical , Genes , Genetic Speciation , Genomics , Mammals/classification
2.
Pac Symp Biocomput ; : 250-61, 2013.
Article in English | MEDLINE | ID: mdl-23424130

ABSTRACT

Species tree estimation from multiple markers is complicated by the fact that gene trees can differ from each other (and from the true species tree) due to several biological processes, one of which is gene duplication and loss. Local search heuristics for two NP-hard optimization problems - minimize gene duplications (MGD) and minimize gene duplications and losses (MGDL) - are popular techniques for estimating species trees in the presence of gene duplication and loss. In this paper, we present an alternative approach to solving MGD and MGDL from rooted gene trees. First, we characterize each tree in terms of its "subtree-bipartitions" (a concept we introduce). Then we show that the MGD species tree is defined by a maximum weight clique in a vertex-weighted graph that can be computed from the subtree-bipartitions of the input gene trees, and the MGDL species tree is defined by a minimum weight clique in a similarly constructed graph. We also show that these optimal cliques can be found in polynomial time in the number of vertices of the graph using a dynamic programming algorithm (similar to that of Hallett and Lagergren(1)), because of the special structure of the graphs. Finally, we show that a constrained version of these problems, where the subtree-bipartitions of the species tree are drawn from the subtree-bipartitions of the input gene trees, can be solved in time that is polynomial in the number of gene trees and taxa. We have implemented our dynamic programming algorithm in a publicly available software tool, available at http://www.cs.utexas.edu/users/phylo/software/dynadup/.


Subject(s)
Gene Duplication , Phylogeny , Algorithms , Computational Biology , Gene Deletion , Genetics, Population/statistics & numerical data , Models, Genetic , Multigene Family , Software
3.
Pac Symp Biocomput ; : 247-58, 2012.
Article in English | MEDLINE | ID: mdl-22174280

ABSTRACT

We address the problem of Phylogenetic Placement, in which the objective is to insert short molecular sequences (called query sequences) into an existing phylogenetic tree and alignment on full-length sequences for the same gene. Phylogenetic placement has the potential to provide information beyond pure "species identification" (i.e., the association of metagenomic reads to existing species), because it can also give information about the evolutionary relationships between these query sequences and to known species. Approaches for phylogenetic placement have been developed that operate in two steps: first, an alignment is estimated for each query sequence to the alignment of the full-length sequences, and then that alignment is used to find the optimal location in the phylogenetic tree for the query sequence. Recent methods of this type include HMMALIGN+EPA, HMMALIGN+pplacer, and PaPaRa+EPA.We report on a study evaluating phylogenetic placement methods on biological and simulated data. This study shows that these methods have extremely good accuracy and computational tractability under conditions where the input contains a highly accurate alignment and tree for the full-length sequences, and the set of full-length sequences is sufficiently small and not too evolutionarily diverse; however, we also show that under other conditions accuracy declines and the computational requirements for memory and time exceed acceptable limits. We present SEPP, a general "boosting" technique to improve the accuracy and/or speed of phylogenetic placement techniques. The key algorithmic aspect of this booster is a dataset decomposition technique in SATé, a method that utilizes an iterative divide-and-conquer technique to co-estimate alignments and trees on large molecular sequence datasets. We show that SATé-boosting improves HMMALIGN+pplacer, placing short sequences more accurately when the set of input sequences has a large evolutionary diameter and produces placements of comparable accuracy in a fraction of the time for easier cases. SEPP software and the datasets used in this study are all available for free at http://www.cs.utexas.edu/users/phylo/software/sepp/submission.


Subject(s)
Metagenomics/statistics & numerical data , Phylogeny , Software , Algorithms , Bacteria/classification , Bacteria/genetics , Computational Biology , Databases, Nucleic Acid/statistics & numerical data , Evolution, Molecular , Metagenome/genetics , RNA, Bacterial/genetics , RNA, Ribosomal, 16S/genetics , Sequence Alignment/statistics & numerical data
4.
Pac Symp Biocomput ; : 25-36, 2008.
Article in English | MEDLINE | ID: mdl-18229674

ABSTRACT

Many multiple sequence alignment methods (MSAs) use guide trees in conjunction with a progressive alignment technique to generate a multiple sequence alignment but use differing techniques to produce the guide tree and to perform the progressive alignment. In this paper we explore the consequences of changing the guide tree used for the alignment routine. We evaluate four leading MSA methods (ProbCons, MAFFT, Muscle, and ClustalW) as well as a new MSA method (FTA, for "Fixed Tree Alignment") which we have developed, on a wide range of simulated datasets. Although improvements in alignment accuracy can be obtained by providing better guide trees, in general there is little effect on the "accuracy" (measured using the SP-score) of the alignment by improving the guide tree. However, RAxML-based phylogenetic analyses of alignments based upon better guide trees tend to be much more accurate. This impact is particularly significant for ProbCons, one of the best MSA methods currently available, and our method, FTA. Finally, for very good guide trees, phylogenies based upon FTA alignments are more accurate than phylogenies based upon ProbCons alignments, suggesting that further improvements in phylogenetic accuracy may be obtained through algorithms of this type.


Subject(s)
Phylogeny , Sequence Alignment/statistics & numerical data , Computational Biology , Computer Simulation , DNA/genetics , Databases, Genetic , Evolution, Molecular , Likelihood Functions , Software
5.
Bioinformatics ; 17 Suppl 1: S165-73, 2001.
Article in English | MEDLINE | ID: mdl-11473006

ABSTRACT

We report on new techniques we have developed for reconstructing phylogenies on whole genomes. Our mathematical techniques include new polynomial-time methods for bounding the inversion length of a candidate tree and new polynomial-time methods for estimating genomic distances which greatly improve the accuracy of neighbor-joining analyses. We demonstrate the power of these techniques through an extensive performance study based on simulating genome evolution under a wide range of model conditions. Combining these new tools with standard approaches (fast reconstruction with neighbor-joining, exploration of all possible refinements of strict consensus trees, etc.) has allowed us to analyze datasets that were previously considered computationally impractical. In particular, we have conducted a complete phylogenetic analysis of a subset of the Campanulaceae family, confirming various conjectures about the relationships among members of the subset and about the principal mechanism of evolution for their chloroplast genome. We give representative results of the extensive experimentation we conducted on both real and simulated datasets in order to validate and characterize our approaches. We find that our techniques provide very accurate reconstructions of the true tree topology even when the data are generated by processes that include a significant fraction of transpositions and when the data are close to saturation.


Subject(s)
Computational Biology , Genetic Techniques/statistics & numerical data , Phylogeny , Biological Evolution , Chloroplasts/genetics , Databases, Genetic , Genome, Plant , Magnoliopsida/genetics , Models, Genetic
6.
Bioinformatics ; 17 Suppl 1: S190-8, 2001.
Article in English | MEDLINE | ID: mdl-11473009

ABSTRACT

Absolute fast converging phylogenetic reconstruction methods are provably guaranteed to recover the true tree with high probability from sequences that grow only polynomially in the number of leaves, once the edge lengths are bounded arbitrarily from above and below. Only a few methods have been determined to be absolute fast converging; these have all been developed in just the last few years, and most are polynomial time. In this paper, we compare pre-existing fast converging methods as well as some new polynomial time methods that we have developed. Our study, based upon simulating evolution under a wide range of model conditions, establishes that our new methods outperform both neighbor joining and the previous fast converging methods, returning very accurate large trees, when these other methods do poorly.


Subject(s)
Computational Biology , Genetic Techniques/statistics & numerical data , Phylogeny , Computer Simulation , Databases, Genetic , Models, Genetic , Models, Statistical , Software
7.
Pac Symp Biocomput ; : 583-94, 2001.
Article in English | MEDLINE | ID: mdl-11262975

ABSTRACT

Phylogenies derived from gene order data may prove crucial in answering some fundamental open questions in biomolecular evolution. Yet very few techniques are available for such phylogenetic reconstructions. One method is breakpoint analysis, developed by Blanchette and Sankoff for solving the "breakpoint phylogeny." Our earlier studies confirmed the usefulness of this approach, but also found that BPAnalysis, the implementation developed by Sankoff and Blanchette, was too slow to use on all but very small datasets. We report here on a reimplementation of BPAnalysis using the principles of algorithmic engineering. Our faster (by 2 to 3 orders of magnitude) and flexible implementation allowed us to conduct studies on the characteristics of breakpoint analysis, in terms of running time, quality, and robustness, as well as to analyze datasets that had so far been considered out of reach. We report on these findings and also discuss future directions for our new implementation.


Subject(s)
Algorithms , Phylogeny , Databases, Factual , Genome , Software
8.
Article in English | MEDLINE | ID: mdl-10977071

ABSTRACT

The breakpoint phylogeny is an optimization problem proposed by Blanchette et al. for reconstructing evolutionary trees from gene order data. These same authors also developed and implemented BPAnalysis [3], a heuristic method (based upon solving many instances of the travelling salesman problem) for estimating the breakpoint phylogeny. We present a new heuristic for this purpose; although not polynomial-time, our heuristic is much faster in practice than BPAnalysis. We present and discuss the results of experimentation on synthetic datasets and on the flowering plant family Campanulaceae with three methods: our new method, BPAnalysis, and the neighbor-joining method [25] using several distance estimation techniques. Our preliminary results indicate that, on datasets with slow evolutionary rates and large numbers of genes in comparison with the number of taxa (genomes), all methods recover quite accurate reconstructions of the true evolutionary history (although BPAnalysis is too slow to be practical), but that on datasets where the rate of evolution is high relative to the number of genes, the accuracy of all three methods is poor.


Subject(s)
Algorithms , Phylogeny , Plants/genetics , Statistics as Topic/methods
9.
J Comput Biol ; 6(3-4): 369-86, 1999.
Article in English | MEDLINE | ID: mdl-10582573

ABSTRACT

The evolutionary history of a set of species is represented by a phylogenetic tree, which is a rooted, leaf-labeled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees from realistic length sequences have long been considered one of the major challenges in systematic biology. In this paper, we present a simple method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods under various Markov models of evolution. We analyze the performance of DCM-boosted distance methods under the Jukes-Cantor Markov model of biomolecular sequence evolution, and prove that for almost all trees, polylogarithmic length sequences suffice for complete accuracy with high probability, while polynomial length sequences always suffice. We also provide an experimental study based upon simulating sequence evolution on model trees. This study confirms substantial reductions in error rates at realistic sequence lengths.


Subject(s)
Biometry/methods , Phylogeny , Algorithms , Evaluation Studies as Topic , Markov Chains , Models, Statistical
10.
Article in English | MEDLINE | ID: mdl-10786294

ABSTRACT

In an earlier paper, we described a new method for phylogenetic tree reconstruction called the Disk Covering Method, or DCM. This is a general method which can be used with any existing phylogenetic method in order to improve its performance. We showed analytically and experimentally that when DCM is used in conjunction with polynomial time distance-based methods, it improves the accuracy of the trees reconstructed. In this paper, we discuss a variant on DCM, that we call DCM2. DCM2 is designed to be used with phylogenetic methods whose objective is the solution of NP-hard optimization problems. We show that DCM2 can be used to accelerate searches for Maximum Parsimony trees. We also motivate the need for solutions to NP-hard optimization problems by showing that on some very large and important datasets, the most popular (and presumably best performing) polynomial time distance methods have poor accuracy.


Subject(s)
Phylogeny , Sequence Analysis, DNA/methods , Software , Databases, Factual , Markov Chains , Reproducibility of Results
11.
J Comput Biol ; 5(3): 391-407, 1998.
Article in English | MEDLINE | ID: mdl-9773340

ABSTRACT

Evolutionary tree reconstruction is a challenging problem with important applications in biology and linguistics. In biology, one of the most promising approaches to tree reconstruction is to find the "maximum parsimony" tree, while in linguistics, the use of the "maximum compatibility" method has been very useful. However, these problems are NP-hard, and current approaches to solving these problems amount to heuristic searches through the space of possible tree topologies (a search which can, on large trees, take months to complete). In this paper, we present a new technique, Optimal Tree Refinement, for reconstructing very large trees. Our technique is motivated by recent experimental studies which have shown that certain polynomial time methods often return contractions of the true tree. We study the use of this technique in solving maximum parsimony and maximum compatibility, and present both hardness results and polynomial time algorithms.


Subject(s)
Algorithms , Phylogeny , Computational Biology/methods
12.
Proc Natl Acad Sci U S A ; 94(13): 6585-90, 1997 Jun 24.
Article in English | MEDLINE | ID: mdl-11607730

ABSTRACT

The inference of the evolutionary history of a set of languages is a complex problem. Although some languages are known to be related through descent from common ancestral languages, for other languages determining whether such a relationship holds is itself a difficult problem. In this paper we report on new methods, developed by linguists Johanna Nichols (University of California, Berkeley), Donald Ringe and Ann Taylor (University of Pennsylvania, Philadelphia), and me, for answering some of the most difficult questions in this domain. These methods and the results of the analyses based on these methods were presented in November 1995 at the Symposium on the Frontiers of Science held by the National Academy of Sciences.

13.
J Comput Biol ; 2(4): 515-25, 1995.
Article in English | MEDLINE | ID: mdl-8634903

ABSTRACT

We propose a new model of computation for deriving phylogenetic trees based upon a generalization of qualitative characters. The model we propose is based upon recent experimental research in molecular biology. We show that the general case of determining perfect compatibility of generalized ordered characters is an NP-complete problem, but can be solved in polynomial time for a special case.


Subject(s)
Mathematics , Models, Genetic , Phylogeny , Algorithms , Anatomy, Comparative , Animals , Chickens , Evolution, Molecular , Foot/anatomy & histology , Genome , Humans , Species Specificity , Tooth/anatomy & histology , Whales
SELECTION OF CITATIONS
SEARCH DETAIL
...