Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
Syst Biol ; 67(2): 320-327, 2018 Mar 01.
Article in English | MEDLINE | ID: mdl-29029295

ABSTRACT

Most existing measures of distance between phylogenetic trees are based on the geometry or topology of the trees. Instead, we consider distance measures which are based on the underlying probability distributions on genetic sequence data induced by trees. Monte Carlo schemes are necessary to calculate these distances approximately, and we describe efficient sampling procedures. Key features of the distances are the ability to include substitution model parameters and to handle trees with different taxon sets in a principled way. We demonstrate some of the properties of these new distance measures and compare them to existing distances, in particular by applying multidimensional scaling to data sets previously reported as containing phylogenetic islands. [Metric; probability distribution; multidimensional scaling; information geometry.


Subject(s)
Classification/methods , Models, Genetic , Phylogeny , Monte Carlo Method , Probability
2.
Mol Biol Evol ; 35(4): 984-1002, 2018 04 01.
Article in English | MEDLINE | ID: mdl-29149300

ABSTRACT

Most phylogenetic models assume that the evolutionary process is stationary and reversible. In addition to being biologically improbable, these assumptions also impair inference by generating models under which the likelihood does not depend on the position of the root. Consequently, the root of the tree cannot be inferred as part of the analysis. Yet identifying the root position is a key component of phylogenetic inference because it provides a point of reference for polarizing ancestor-descendant relationships and therefore interpreting the tree. In this paper, we investigate the effect of relaxing the unrealistic reversibility assumption and allowing the position of the root to be another unknown. We propose two hierarchical models that are centered on a reversible model but perturbed to allow nonreversibility. The models differ in the degree of structure imposed on the perturbations. The analysis is performed in the Bayesian framework using Markov chain Monte Carlo methods for which software is provided. We illustrate the performance of the two nonreversible models in analyses of simulated data using two types of topological priors. We then apply the models to a real biological data set, the radiation of polyploid yeasts, for which there is robust biological opinion about the root position. Finally, we apply the models to a second biological alignment for which the rooted tree is controversial: the ribosomal tree of life. We compare the two nonreversible models and conclude that both are useful in inferring the position of the root from real biological data.


Subject(s)
Models, Genetic , Phylogeny , Bayes Theorem , Markov Chains , Monte Carlo Method , Ribosomes , Saccharomyces cerevisiae
3.
Biometrika ; 104(4): 901-922, 2017 Dec.
Article in English | MEDLINE | ID: mdl-29422694

ABSTRACT

Evolutionary relationships are represented by phylogenetic trees, and a phylogenetic analysis of gene sequences typically produces a collection of these trees, one for each gene in the analysis. Analysis of samples of trees is difficult due to the multi-dimensionality of the space of possible trees. In Euclidean spaces, principal component analysis is a popular method of reducing high-dimensional data to a low-dimensional representation that preserves much of the sample's structure. However, the space of all phylogenetic trees on a fixed set of species does not form a Euclidean vector space, and methods adapted to tree space are needed. Previous work introduced the notion of a principal geodesic in this space, analogous to the first principal component. Here we propose a geometric object for tree space similar to the [Formula: see text]th principal component in Euclidean space: the locus of the weighted Fréchet mean of [Formula: see text] vertex trees when the weights vary over the [Formula: see text]-simplex. We establish some basic properties of these objects, in particular showing that they have dimension [Formula: see text], and propose algorithms for projection onto these surfaces and for finding the principal locus associated with a sample of trees. Simulation studies demonstrate that these algorithms perform well, and analyses of two datasets, containing Apicomplexa and African coelacanth genomes respectively, reveal important structure from the second principal components.

4.
Philos Trans R Soc Lond B Biol Sci ; 370(1678): 20140336, 2015 Sep 26.
Article in English | MEDLINE | ID: mdl-26323766

ABSTRACT

The root of a phylogenetic tree is fundamental to its biological interpretation, but standard substitution models do not provide any information on its position. Here, we describe two recently developed models that relax the usual assumptions of stationarity and reversibility, thereby facilitating root inference without the need for an outgroup. We compare the performance of these models on a classic test case for phylogenetic methods, before considering two highly topical questions in evolutionary biology: the deep structure of the tree of life and the root of the archaeal radiation. We show that all three alignments contain meaningful rooting information that can be harnessed by these new models, thus complementing and extending previous work based on outgroup rooting. In particular, our analyses exclude the root of the tree of life from the eukaryotes or Archaea, placing it on the bacterial stem or within the Bacteria. They also exclude the root of the archaeal radiation from several major clades, consistent with analyses using other rooting methods. Overall, our results demonstrate the utility of non-reversible and non-stationary models for rooting phylogenetic trees, and identify areas where further progress can be made.


Subject(s)
Computer Simulation , Models, Genetic , Phylogeny , Archaea/genetics , Bacteria/genetics , Genetic Variation
5.
Stat Appl Genet Mol Biol ; 13(5): 589-609, 2014 Oct.
Article in English | MEDLINE | ID: mdl-25153609

ABSTRACT

In molecular phylogenetics, standard models of sequence evolution generally assume that sequence composition remains constant over evolutionary time. However, this assumption is violated in many datasets which show substantial heterogeneity in sequence composition across taxa. We propose a model which allows compositional heterogeneity across branches, and formulate the model in a Bayesian framework. Specifically, the root and each branch of the tree is associated with its own composition vector whilst a global matrix of exchangeability parameters applies everywhere on the tree. We encourage borrowing of strength between branches by developing two possible priors for the composition vectors: one in which information can be exchanged equally amongst all branches of the tree and another in which more information is exchanged between neighbouring branches than between distant branches. We also propose a Markov chain Monte Carlo (MCMC) algorithm for posterior inference which uses data augmentation of substitutional histories to yield a simple complete data likelihood function that factorises over branches and allows Gibbs updates for most parameters. Standard phylogenetic models are not informative about the root position. Therefore a significant advantage of the proposed model is that it allows inference about rooted trees. The position of the root is fundamental to the biological interpretation of trees, both for polarising trait evolution and for establishing the order of divergence among lineages. Furthermore, unlike some other related models from the literature, inference in the model we propose can be carried out through a simple MCMC scheme which does not require problematic dimension-changing moves. We investigate the performance of the model and priors in analyses of two alignments for which there is strong biological opinion about the tree topology and root position.


Subject(s)
Bayes Theorem , Phylogeny , Markov Chains , Monte Carlo Method
6.
Article in English | MEDLINE | ID: mdl-26355778

ABSTRACT

Most phylogenetic analyses result in a sample of trees, but summarizing and visualizing these samples can be challenging. Consensus trees often provide limited information about a sample, and so methods such as consensus networks, clustering and multidimensional scaling have been developed and applied to tree samples. This paper describes a stochastic algorithm for constructing a principal geodesic or line through treespace which is analogous to the first principal component in standard principal components analysis. A principal geodesic summarizes the most variable features of a sample of trees, in terms of both tree topology and branch lengths, and it can be visualized as an animation of smoothly changing trees. The algorithm performs a stochastic search through parameter space for a geodesic which minimizes the sum of squared projected distances of the data points. This procedure aims to identify the globally optimal principal geodesic, though convergence to locally optimal geodesics is possible. The methodology is illustrated by constructing principal geodesics for experimental and simulated data sets, demonstrating the insight into samples of trees that can be gained and how the method improves on a previously published approach. A java package called GeoPhytter for constructing and visualizing principal geodesics is freely available from www.ncl.ac.uk/ ntmwn/geophytter.


Subject(s)
Algorithms , Computational Biology/methods , Phylogeny , Chaperonins/classification , Chaperonins/genetics , Cluster Analysis , Models, Genetic , Principal Component Analysis
7.
Proc Biol Sci ; 279(1749): 4870-9, 2012 Dec 22.
Article in English | MEDLINE | ID: mdl-23097517

ABSTRACT

Determining the relationships among the major groups of cellular life is important for understanding the evolution of biological diversity, but is difficult given the enormous time spans involved. In the textbook 'three domains' tree based on informational genes, eukaryotes and Archaea share a common ancestor to the exclusion of Bacteria. However, some phylogenetic analyses of the same data have placed eukaryotes within the Archaea, as the nearest relatives of different archaeal lineages. We compared the support for these competing hypotheses using sophisticated phylogenetic methods and an improved sampling of archaeal biodiversity. We also employed both new and existing tests of phylogenetic congruence to explore the level of uncertainty and conflict in the data. Our analyses suggested that much of the observed incongruence is weakly supported or associated with poorly fitting evolutionary models. All of our phylogenetic analyses, whether on small subunit and large subunit ribosomal RNA or concatenated protein-coding genes, recovered a monophyletic group containing eukaryotes and the TACK archaeal superphylum comprising the Thaumarchaeota, Aigarchaeota, Crenarchaeota and Korarchaeota. Hence, while our results provide no support for the iconic three-domain tree of life, they are consistent with an extended eocyte hypothesis whereby vital components of the eukaryotic nuclear lineage originated from within the archaeal radiation.


Subject(s)
Archaea/classification , Archaea/genetics , Eukaryota/classification , Eukaryota/genetics , Evolution, Molecular , Genes, rRNA , Phylogeny , Proteins/genetics , Sequence Analysis, RNA , Sequence Homology
8.
Stat Methods Med Res ; 18(5): 487-504, 2009 Oct.
Article in English | MEDLINE | ID: mdl-19153166

ABSTRACT

A number of biological processes can lead to genes being copied within the genome of some given species. Duplicate genes of this form are called paralogs and such genes share a high degree sequence similarity as well as often having closely related functions. Some genes have become widely duplicated to form multigene families in which the copies are distributed both within the genomes of individual species and across different species. Statistical modelling of gene duplication and the evolution of multi-gene families currently lags behind well-established models of DNA sequence evolution despite an increasing volume of available data, but the analysis of multi-gene families is important as part of a wider effort to understand evolution at the genomic level. This article reviews existing approaches to modelling multi-gene families and presents various challenges and possibilities for this exciting area of research.


Subject(s)
Evolution, Molecular , Genomics , Models, Genetic , Models, Statistical , Multigene Family , Sequence Analysis, DNA/methods , Gene Deletion , Gene Duplication , Humans
9.
Syst Biol ; 57(5): 785-94, 2008 Oct.
Article in English | MEDLINE | ID: mdl-18853364

ABSTRACT

Phylogenetic analysis very commonly produces several alternative trees for a given fixed set of taxa. For example, different sets of orthologous genes may be analyzed, or the analysis may sample from a distribution of probable trees. This article describes an approach to comparing and visualizing multiple alternative phylogenies via the idea of a "tree of trees" or "meta-tree." A meta-tree clusters phylogenies with similar topologies together in the same way that a phylogeny clusters species with similar DNA sequences. Leaf nodes on a meta-tree correspond to the original set of phylogenies given by some analysis, whereas interior nodes correspond to certain consensus topologies. The construction of meta-trees is motivated by analogy with construction of a most parsimonious tree for DNA data, but instead of using DNA letters, in a meta-tree the characters are partitions or splits of the set of taxa. An efficient algorithm for meta-tree construction is described that makes use of a known relationship between the majority consensus and parsimony in terms of gain and loss of splits. To illustrate these ideas meta-trees are constructed for two datasets: a set of gene trees for species of yeast and trees from a bootstrap analysis of a set of gene trees in ray-finned fish. A software tool for constructing meta-trees and comparing alternative phylogenies is available online, and the source code can be obtained from the author.


Subject(s)
Evolution, Molecular , Genetic Variation , Phylogeny , Animals , Classification/methods , DNA/genetics , Fishes/genetics , Plants/genetics
10.
Stat Appl Genet Mol Biol ; 5: Article5, 2006.
Article in English | MEDLINE | ID: mdl-16646869

ABSTRACT

Experiments to determine the complete 3-dimensional structures of protein complexes are difficult to perform and only a limited range of such structures are available. In contrast, large-scale screening experiments have identified thousands of pairwise interactions between proteins, but such experiments do not produce explicit structural information. In addition, the data produced by these high through-put experiments contain large numbers of false positive results, and can be biased against detection of certain types of interaction. Several methods exist that analyse such pairwise interaction data in terms of the constituent domains within proteins, scoring pairs of domain superfamilies according to their propensity to interact. These scores can be used to predict the strongest domain-domain contact (the contact with the largest surface area) between interacting proteins for which the domain-level structures of the individual proteins are known. We test this predictive approach on a set of pairwise protein interactions taken from the Protein Quaternary Structure (PQS) database for which the true domain-domain contacts are known.While the overall prediction success rate across the whole test data set is poor, we shown how interactions in the test data set for which the training data are not informative can be automatically excluded from the prediction process, giving improved prediction success rates at the expense of restricted coverage of the test data.


Subject(s)
Protein Interaction Mapping/methods , Protein Structure, Tertiary , Binding Sites , Data Interpretation, Statistical , Databases, Protein , Saccharomyces cerevisiae Proteins/chemistry , Saccharomyces cerevisiae Proteins/metabolism , Two-Hybrid System Techniques
11.
Bioinformatics ; 22(1): 117-9, 2006 Jan 01.
Article in English | MEDLINE | ID: mdl-16234319

ABSTRACT

SUMMARY: We describe an algorithm and software tool for comparing alternative phylogenetic trees. The main application of the software is to compare phylogenies obtained using different phylogenetic methods for some fixed set of species or obtained using different gene sequences from those species. The algorithm pairs up each branch in one phylogeny with a matching branch in the second phylogeny and finds the optimum 1-to-1 map between branches in the two trees in terms of a topological score. The software enables the user to explore the corresponding mapping between the phylogenies interactively, and clearly highlights those parts of the trees that differ, both in terms of topology and branch length. AVAILABILITY: The software is implemented as a Java applet at http://www.mrc-bsu.cam.ac.uk/personal/thomas/phylo_comparison/comparison_page.html. It is also available on request from the authors.


Subject(s)
Computational Biology/methods , Algorithms , Computer Graphics , HIV/genetics , Internet , Models, Genetic , Models, Statistical , Phylogeny , Programming Languages , Sequence Alignment , Software , User-Computer Interface
12.
Bioinformatics ; 21(7): 993-1001, 2005 Apr 01.
Article in English | MEDLINE | ID: mdl-15509600

ABSTRACT

MOTIVATION: Several methods have recently been developed to analyse large-scale sets of physical interactions between proteins in terms of physical contacts between the constituent domains, often with a view to predicting new pairwise interactions. Our aim is to combine genomic interaction data, in which domain-domain contacts are not explicitly reported, with the domain-level structure of individual proteins, in order to learn about the structure of interacting protein pairs. Our approach is driven by the need to assess the evidence for physical contacts between domains in a statistically rigorous way. RESULTS: We develop a statistical approach that assigns p-values to pairs of domain superfamilies, measuring the strength of evidence within a set of protein interactions that domains from these superfamilies form contacts. A set of p-values is calculated for SCOP superfamily pairs, based on a pooled data set of interactions from yeast. These p-values can be used to predict which domains come into contact in an interacting protein pair. This predictive scheme is tested against protein complexes in the Protein Quaternary Structure (PQS) database, and is used to predict domain-domain contacts within 705 interacting protein pairs taken from our pooled data set.


Subject(s)
Algorithms , Databases, Protein , Models, Chemical , Protein Interaction Mapping/methods , Saccharomyces cerevisiae Proteins/chemistry , Saccharomyces cerevisiae Proteins/metabolism , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Binding Sites , Computer Simulation , Models, Statistical , Protein Binding , Protein Structure, Tertiary , Saccharomyces cerevisiae Proteins/analysis , Saccharomyces cerevisiae Proteins/classification , Structure-Activity Relationship
SELECTION OF CITATIONS
SEARCH DETAIL
...