Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 28
Filter
Add more filters










Publication year range
1.
PeerJ ; 12: e16624, 2024.
Article in English | MEDLINE | ID: mdl-38188165

ABSTRACT

The Open Tree of Life (OToL) project produces a supertree that summarizes phylogenetic knowledge from tree estimates published in the primary literature. The supertree construction algorithm iteratively calls Aho's Build algorithm thousands of times in order to assess the compatability of different phylogenetic groupings. We describe an incrementalized version of the Build algorithm that is able to share work between successive calls to Build. We provide details that allow a programmer to implement the incremental algorithm BuildInc, including pseudo-code and a description of data structures. We assess the effect of BuildInc on our supertree algorithm by analyzing simulated data and by analyzing a supertree problem taken from the OpenTree 13.4 synthesis tree. We find that BuildInc provides up to 550-fold speedup for our supertree algorithm.


Subject(s)
Algorithms , Knowledge , Phylogeny
2.
PLoS Comput Biol ; 17(5): e1008924, 2021 05.
Article in English | MEDLINE | ID: mdl-33983918

ABSTRACT

The "multispecies" coalescent (MSC) model that underlies many genomic species-delimitation approaches is problematic because it does not distinguish between genetic structure associated with species versus that of populations within species. Consequently, as both the genomic and spatial resolution of data increases, a proliferation of artifactual species results as within-species population lineages, detected due to restrictions in gene flow, are identified as distinct species. The toll of this extends beyond systematic studies, getting magnified across the many disciplines that rely upon an accurate framework of identified species. Here we present the first of a new class of approaches that addresses this issue by incorporating an extended speciation process for species delimitation. We model the formation of population lineages and their subsequent development into independent species as separate processes and provide for a way to incorporate current understanding of the species boundaries in the system through specification of species identities of a subset of population lineages. As a result, species boundaries and within-species lineages boundaries can be discriminated across the entire system, and species identities can be assigned to the remaining lineages of unknown affinities with quantified probabilities. In addition to the identification of species units in nature, the primary goal of species delimitation, the incorporation of a speciation model also allows us insights into the links between population and species-level processes. By explicitly accounting for restrictions in gene flow not only between, but also within, species, we also address the limits of genetic data for delimiting species. Specifically, while genetic data alone is not sufficient for accurate delimitation, when considered in conjunction with other information we are able to not only learn about species boundaries, but also about the tempo of the speciation process itself.


Subject(s)
Genetic Speciation , Models, Genetic , Algorithms , Animals , Computational Biology , Computer Simulation , Gene Flow , Genetics, Population , Models, Statistical , Phylogeny , Software , Species Specificity , Time Factors
3.
Syst Biol ; 70(6): 1295-1301, 2021 10 13.
Article in English | MEDLINE | ID: mdl-33970279

ABSTRACT

The Open Tree of Life project constructs a comprehensive, dynamic, and digitally available tree of life by synthesizing published phylogenetic trees along with taxonomic data. Open Tree of Life provides web-service application programming interfaces (APIs) to make the tree estimate, unified taxonomy, and input phylogenetic data available to anyone. Here, we describe the Python package opentree, which provides a user friendly Python wrapper for these APIs and a set of scripts and tutorials for straightforward downstream data analyses. We demonstrate the utility of these tools by generating an estimate of the phylogenetic relationships of all bird families, and by capturing a phylogenetic estimate for all taxa observed at the University of California Merced Vernal Pools and Grassland Reserve.[Evolution; open science; phylogenetics; Python; taxonomy.].


Subject(s)
Data Analysis , Software , Humans , Phylogeny
4.
Am J Bot ; 107(8): 1189-1197, 2020 08.
Article in English | MEDLINE | ID: mdl-32864742

ABSTRACT

PREMISE: The mating system has profound consequences, not only for ecology and evolution, but also for the conservation of threatened or endangered species. Unfortunately, small populations are difficult to study owing to limits on sample size and genetic marker diversity. Here, we estimated mating system parameters in three small populations of an island plant using genomic genotyping. Although self-incompatible (SI) species are known to often set some self-seed, little is known about how "leaky SI" affects selfing rates in nature or the role that multiple paternity plays in small populations. METHODS: We generalized the BORICE mating system program to determine the siring pattern within maternal families. We applied this algorithm to maternal families from three populations of Tolpis succulenta from Madeira Island and genotyped the progeny using RADseq. We applied BORICE to estimate each individual offspring as outcrossed or selfed, the paternity of each outcrossed offspring, and the level of inbreeding of each maternal plant. RESULTS: Despite a functional self-incompatibility system, these data establish T. succulenta as a pseudo-self-compatible (PSC) species. Two of 75 offspring were strongly indicated as products of self-fertilization. Despite selfing, all adult maternal plants were fully outbred. There was high differentiation among and low variation within populations, consistent with a history of genetic isolation of these small populations. There were generally multiple sires per maternal family. Twenty-two percent of sib contrasts (between outcrossed offspring within maternal families) shared the same sire. CONCLUSIONS: Genome-wide genotyping, combined with appropriate analytical methods, enables estimation of mating system and multiple paternity in small populations. These data address questions about the evolution of reproductive traits and the conservation of threatened populations.


Subject(s)
Paternity , Self-Fertilization , Genotype , Islands , Portugal , Reproduction
5.
PeerJ ; 5: e3058, 2017.
Article in English | MEDLINE | ID: mdl-28265520

ABSTRACT

We present a new supertree method that enables rapid estimation of a summary tree on the scale of millions of leaves. This supertree method summarizes a collection of input phylogenies and an input taxonomy. We introduce formal goals and criteria for such a supertree to satisfy in order to transparently and justifiably represent the input trees. In addition to producing a supertree, our method computes annotations that describe which grouping in the input trees support and conflict with each group in the supertree. We compare our supertree construction method to a previously published supertree construction method by assessing their performance on input trees used to construct the Open Tree of Life version 4, and find that our method increases the number of displayed input splits from 35,518 to 39,639 and decreases the number of conflicting input splits from 2,760 to 1,357. The new supertree method also improves on the previous supertree construction method in that it produces no unsupported branches and avoids unnecessary polytomies. This pipeline is currently used by the Open Tree of Life project to produce all of the versions of project's "synthetic tree" starting at version 5. This software pipeline is called "propinquity". It relies heavily on "otcetera"-a set of C++ tools to perform most of the steps of the pipeline. All of the components are free software and are available on GitHub.

6.
Mol Phylogenet Evol ; 93: 289-95, 2015 Dec.
Article in English | MEDLINE | ID: mdl-26256643

ABSTRACT

Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two "long branch repulsion" trees will be preferred over the true tree - though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a "twisted Farris-zone" tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.


Subject(s)
Models, Genetic , Algorithms , Biological Evolution , INDEL Mutation , Phylogeny , Sequence Alignment
7.
Bioinformatics ; 31(17): 2794-800, 2015 Sep 01.
Article in English | MEDLINE | ID: mdl-25940563

ABSTRACT

MOTIVATION: Phylogenetic estimates from published studies can be archived using general platforms like Dryad (Vision, 2010) or TreeBASE (Sanderson et al., 1994). Such services fulfill a crucial role in ensuring transparency and reproducibility in phylogenetic research. However, digital tree data files often require some editing (e.g. rerooting) to improve the accuracy and reusability of the phylogenetic statements. Furthermore, establishing the mapping between tip labels used in a tree and taxa in a single common taxonomy dramatically improves the ability of other researchers to reuse phylogenetic estimates. As the process of curating a published phylogenetic estimate is not error-free, retaining a full record of the provenance of edits to a tree is crucial for openness, allowing editors to receive credit for their work and making errors introduced during curation easier to correct. RESULTS: Here, we report the development of software infrastructure to support the open curation of phylogenetic data by the community of biologists. The backend of the system provides an interface for the standard database operations of creating, reading, updating and deleting records by making commits to a git repository. The record of the history of edits to a tree is preserved by git's version control features. Hosting this data store on GitHub (http://github.com/) provides open access to the data store using tools familiar to many developers. We have deployed a server running the 'phylesystem-api', which wraps the interactions with git and GitHub. The Open Tree of Life project has also developed and deployed a JavaScript application that uses the phylesystem-api and other web services to enable input and curation of published phylogenetic statements. AVAILABILITY AND IMPLEMENTATION: Source code for the web service layer is available at https://github.com/OpenTreeOfLife/phylesystem-api. The data store can be cloned from: https://github.com/OpenTreeOfLife/phylesystem. A web application that uses the phylesystem web services is deployed at http://tree.opentreeoflife.org/curator. Code for that tool is available from https://github.com/OpenTreeOfLife/opentree. CONTACT: mtholder@gmail.com.


Subject(s)
Computational Biology/methods , Databases, Factual , Information Storage and Retrieval , Phylogeny , Software , Humans , Internet , Programming Languages , Reproducibility of Results , User-Computer Interface
8.
Syst Biol ; 64(3): 525-31, 2015 May.
Article in English | MEDLINE | ID: mdl-25577605

ABSTRACT

Phycas is open source, freely available Bayesian phylogenetics software written primarily in C++ but with a Python interface. Phycas specializes in Bayesian model selection for nucleotide sequence data, particularly the estimation of marginal likelihoods, central to computing Bayes Factors. Marginal likelihoods can be estimated using newer methods (Thermodynamic Integration and Generalized Steppingstone) that are more accurate than the widely used Harmonic Mean estimator. In addition, Phycas supports two posterior predictive approaches to model selection: Gelfand-Ghosh and Conditional Predictive Ordinates. The General Time Reversible family of substitution models, as well as a codon model, are available, and data can be partitioned with all parameters unlinked except tree topology and edge lengths. Phycas provides for analyses in which the prior on tree topologies allows polytomous trees as well as fully resolved trees, and provides for several choices for edge length priors, including a hierarchical model as well as the recently described compound Dirichlet prior, which helps avoid overly informative induced priors on tree length.


Subject(s)
Classification/methods , Phylogeny , Software , Algorithms , Bayes Theorem , Chlorophyta/classification , Chlorophyta/genetics
9.
Evolution ; 67(4): 991-1010, 2013 Apr.
Article in English | MEDLINE | ID: mdl-23550751

ABSTRACT

Approximate Bayesian computation (ABC) is rapidly gaining popularity in population genetics. One example, msBayes, infers the distribution of divergence times among pairs of taxa, allowing phylogeographers to test hypotheses about historical causes of diversification in co-distributed groups of organisms. Using msBayes, we infer the distribution of divergence times among 22 pairs of populations of vertebrates distributed across the Philippine Archipelago. Our objective was to test whether sea-level oscillations during the Pleistocene caused diversification across the islands. To guide interpretation of our results, we perform a suite of simulation-based power analyses. Our empirical results strongly support a recent simultaneous divergence event for all 22 taxon pairs, consistent with the prediction of the Pleistocene-driven diversification hypothesis. However, our empirical estimates are sensitive to changes in prior distributions, and our simulations reveal low power of the method to detect random variation in divergence times and bias toward supporting clustered divergences. Our results demonstrate that analyses exploring power and prior sensitivity should accompany ABC model selection inferences. The problems we identify are potentially mitigable with uniform priors over divergence models (rather than classes of models) and more flexible prior distributions on demographic and divergence-time parameters.


Subject(s)
Biological Evolution , Climate , Models, Biological , Animals , Genetic Speciation , Geological Phenomena , Islands , Phylogeny
10.
Ecol Evol ; 2(8): 1826-33, 2012 Aug.
Article in English | MEDLINE | ID: mdl-22957185

ABSTRACT

Filoviruses have to date been considered as consisting of one diverse genus (Ebola viruses) and one undifferentiated genus (Marburg virus). We reconsider this idea by means of detailed phylogenetic analyses of sequence data available for the Filoviridae: using coalescent simulations, we ascertain that two Marburg isolates (termed the "RAVN" strain) represent a quite-distinct lineage that should be considered in studies of biogeography and host associations, and may merit recognition at the level of species. In contrast, filovirus isolates recently obtained from bat tissues are not distinct from previously known strains, and should be considered as drawn from the same population. Implications for understanding the transmission geography and host associations of these viruses are discussed.

11.
Protein Sci ; 21(6): 769-85, 2012 Jun.
Article in English | MEDLINE | ID: mdl-22528593

ABSTRACT

Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed. More complex mutational processes such as insertion and deletion and domain rearrangements and even circular permutations should be evaluated. The role of intrinsically disordered proteins is still controversial, but may be increasingly important to consider. Protein geometry and protein dynamics as a deviation from static considerations of protein structure are also important. Protein expression level is known to be a major determinant of evolutionary rate and several considerations including selection at the mRNA level and the role of interaction specificity are discussed. Lastly, the relationship between modeling and needed high-throughput experimental data as well as experimental examination of protein evolution using ancestral sequence resurrection and in vitro biochemistry are presented, towards an aim of ultimately generating better models for biological inference and prediction.


Subject(s)
Evolution, Molecular , Proteins/chemistry , Proteins/genetics , Amino Acid Sequence , Animals , Humans , Models, Molecular , Molecular Sequence Data , Protein Conformation , Protein Folding , RNA, Messenger/genetics , Sequence Alignment
12.
Syst Biol ; 61(4): 675-89, 2012 Jul.
Article in English | MEDLINE | ID: mdl-22357728

ABSTRACT

In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input-output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.


Subject(s)
Biological Evolution , Computational Biology/standards , Programming Languages , Biodiversity , Classification , Informatics , Models, Biological , Phylogeny , Software
13.
Syst Biol ; 61(1): 170-3, 2012 Jan.
Article in English | MEDLINE | ID: mdl-21963610

ABSTRACT

Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.


Subject(s)
Computational Biology/methods , Phylogeny , Software , Algorithms , Computing Methodologies , Evolution, Molecular , Genome
14.
PLoS Curr ; 4: e4fd1286980c08, 2012 12 14.
Article in English | MEDLINE | ID: mdl-23868168

ABSTRACT

Felsenstein's pruning algorithm allows one to calculate the probability of any particular data pattern arising on a phylogeny given a model of character evolution. Here we present a similar dynamic programming algorithm. Our algorithm treats the tree and model as known. The algorithm makes it feasible to calculate the probability that a randomly selected character will be a member of a particular class of character patterns. Specifically, we are interested in binning patterns by the number of parsimony steps and the set of states observed at the tips of the tree. This algorithm was developed to expand the range of data set sizes that can be used with Waddell et al.'s marginal testing approach for assessing the adequacy of a model. The algorithms introduced can also be used in likelihood calculations which correct for ascertainment biases. For example, Lewis introduced an Mkv model which corrects for the lack of constant sites. The probability of a constant pattern arising can be calculated using the algorithm that we present, or by enumerating all possible constant patterns and calculating the probability of each one. Because the number of constant data patterns is small, both methods are efficient. However, elaborations of the Mkv model (such as those in Nylander et al) require calculating the probability of parsimony-uninformative patterns arising. For large trees and characters with many possible character states, the number of possible parismony-uninformative patterns is immense. In these cases, the algorithms introduced here will be more efficient. The algorithm has been implemented in open source software written in C++.

15.
Syst Biol ; 61(1): 90-106, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22139466

ABSTRACT

Highly accurate estimation of phylogenetic trees for large data sets is difficult, in part because multiple sequence alignments must be accurate for phylogeny estimation methods to be accurate. Coestimation of alignments and trees has been attempted but currently only SATé estimates reasonably accurate trees and alignments for large data sets in practical time frames (Liu K., Raghavan S., Nelesen S., Linder C.R., Warnow T. 2009b. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 324:1561-1564). Here, we present a modification to the original SATé algorithm that improves upon SATé (which we now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy. SATé-II uses a different divide-and-conquer strategy than SATé-I and so produces smaller more closely related subsets than SATé-I; as a result, SATé-II produces more accurate alignments and trees, can analyze larger data sets, and runs more efficiently than SATé-I. Generally, SATé is a metamethod that takes an existing multiple sequence alignment method as an input parameter and boosts the quality of that alignment method. SATé-II-boosted alignment methods are significantly more accurate than their unboosted versions, and trees based upon these improved alignments are more accurate than trees based upon the original alignments. Because SATé-I used maximum likelihood (ML) methods that treat gaps as missing data to estimate trees and because we found a correlation between the quality of tree/alignment pairs and ML scores, we explored the degree to which SATé's performance depends on using ML with gaps treated as missing data to determine the best tree/alignment pair. We present two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results. First, we show that the optimization problem where a set of unaligned DNA sequences is given and the output is the tree and alignment of those sequences that maximize likelihood under the Jukes-Cantor model is uninformative in the worst possible sense. For all inputs, all trees optimize the likelihood score. Second, we show that a greedy heuristic that uses GTR+Gamma ML to optimize the alignment and the tree can produce very poor alignments and trees. Therefore, the excellent performance of SATé-II and SATé-I is not because ML is used as an optimization criterion for choosing the best tree/alignment pair but rather due to the particular divide-and-conquer realignment techniques employed.


Subject(s)
Phylogeny , Sequence Alignment/methods , Software , Algorithms , Automation , Computer Simulation , DNA , Evolution, Molecular , Likelihood Functions
16.
Mol Biol Evol ; 29(3): 939-55, 2012 Mar.
Article in English | MEDLINE | ID: mdl-22049064

ABSTRACT

We introduce a new model for relaxing the assumption of a strict molecular clock for use as a prior in Bayesian methods for divergence time estimation. Lineage-specific rates of substitution are modeled using a Dirichlet process prior (DPP), a type of stochastic process that assumes lineages of a phylogenetic tree are distributed into distinct rate classes. Under the Dirichlet process, the number of rate classes, assignment of branches to rate classes, and the rate value associated with each class are treated as random variables. The performance of this model was evaluated by conducting analyses on data sets simulated under a range of different models. We compared the Dirichlet process model with two alternative models for rate variation: the strict molecular clock and the independent rates model. Our results show that divergence time estimation under the DPP provides robust estimates of node ages and branch rates without significantly reducing power. Further analyses were conducted on a biological data set, and we provide examples of ways to summarize Markov chain Monte Carlo samples under this model.


Subject(s)
Evolution, Molecular , Models, Genetic , Mutation Rate , Phylogeny , Bayes Theorem , Computer Simulation , Markov Chains , Monte Carlo Method , Stochastic Processes
17.
J Theor Biol ; 280(1): 159-66, 2011 Jul 07.
Article in English | MEDLINE | ID: mdl-21540039

ABSTRACT

The field of phylogenetic tree estimation has been dominated by three broad classes of methods: distance-based approaches, parsimony and likelihood-based methods (including maximum likelihood (ML) and Bayesian approaches). Here we introduce two new approaches to tree inference: pairwise likelihood estimation and a distance-based method that estimates the number of substitutions along the paths through the tree. Our results include the derivation of the formulae for the probability that two leaves will be identical at a site given a number of substitutions along the path connecting them. We also derive the posterior probability of the number of substitutions along a path between two sequences. The calculations for the posterior probabilities are exact for group-based, symmetric models of character evolution, but are only approximate for more general models.


Subject(s)
Evolution, Molecular , Models, Genetic , Phylogeny
18.
Mol Ecol Resour ; 11(2): 364-9, 2011 Mar.
Article in English | MEDLINE | ID: mdl-21429145

ABSTRACT

We present Ginkgo, a software package for agent-based, forward-time simulations of genealogies of multiple unlinked loci from diploid populations. Ginkgo simulates the evolution of one or more species on a spatially explicit landscape of cells. The user of the software can specify the geographical and environmental characteristics of the landscape, and these properties can change according to a prespecified schedule. The geographical elements modelled include the arrangement of cells and movement rates between particular cells. Each species has a function that can calculate a fitness score for any combination of an individual organism's phenotype and environmental characteristics. The user can control the number of fitness factors (the dimensionality of the cell-specific fitness factors and the individuals phenotypic vectors) and the weighting of each of these dimensions in the fitness calculation. Cell-specific fitness trait optima can be specified across the landscape to mimic differences in habitat. In addition to their differing fitness functions, species can differ in terms of their vagility and fecundity. Genealogies and occurrence data can be produced at any time during the simulation in NEXUS and ESRI Ascii Grid formats, respectively.


Subject(s)
Computer Simulation , Phylogeography , Software , Biological Evolution , Ecosystem
19.
Syst Biol ; 60(2): 161-74, 2011 Mar.
Article in English | MEDLINE | ID: mdl-21233085

ABSTRACT

Most phylogenetic models of protein evolution assume that sites are independent and identically distributed. Interactions between sites are ignored, and the likelihood can be conveniently calculated as the product of the individual site likelihoods. The calculation considers all possible transition paths (also called substitution histories or mappings) that are consistent with the observed states at the terminals, and the probability density of any particular reconstruction depends on the substitution model. The likelihood is the integral of the probability density of each substitution history taken over all possible histories that are consistent with the observed data. We investigated the extent to which transition paths that are incompatible with a protein's three-dimensional structure contribute to the likelihood. Several empirical amino acid models were tested for sequence pairs of different degrees of divergence. When simulating substitutional histories starting from a real sequence, the structural integrity of the simulated sequences quickly disintegrated. This result indicates that simple models are clearly unable to capture the constraints on sequence evolution. However, when we sampled transition paths between real sequences from the posterior probability distribution according to these same models, we found that the sampled histories were largely consistent with the tertiary structure. This suggests that simple empirical substitution models may be adequate for interpolating changes between observed sequences during phylogenetic inference despite the fact that the models cannot predict the effects of structural constraints from first principles. This study is significant because it provides a quantitative assessment of the biological realism of substitution models from the perspective of protein structure, and it provides insight on the prospects for improving models of protein sequence evolution.


Subject(s)
Evolution, Molecular , Proteins/chemistry , Proteins/genetics , Animals , Humans , Likelihood Functions , Phylogeny , Probability
20.
Mol Biol Evol ; 27(12): 2733-46, 2010 Dec.
Article in English | MEDLINE | ID: mdl-20576761

ABSTRACT

Myxozoans are a diverse group of microscopic endoparasites that have been the focus of much controversy regarding their phylogenetic position. Two dramatically different hypotheses have been put forward regarding the placement of Myxozoa within Metazoa. One hypothesis, supported by ribosomal DNA (rDNA) data, place Myxozoa as a sister taxon to Bilateria. The alternative hypothesis, supported by phylogenomic data and morphology, place Myxozoa within Cnidaria. Here, we investigate these conflicting hypotheses and explore the effects of missing data, model choice, and inference methods, all of which can have an effect in placing highly divergent taxa. In addition, we identify subsets of the data that most influence the placement of Myxozoa and explore their effects by removing them from the data sets. Assembling the largest taxonomic sampling of myxozoans and cnidarians to date, with a comprehensive sampling of other metazoans for 18S and 28S nuclear rDNA sequences, we recover a well-supported placement of Myxozoa as an early diverging clade of Bilateria. By conducting parametric bootstrapping, we find that the bilaterian placement of Buddenbrockia could not alone be explained by long-branch attraction. After trimming a published phylogenomic data set, to circumvent problems of missing data, we recover the myxozoan Buddenbrockia plumatellae as a medusozoan cnidarian. In further explorations of these data sets, we find that removal of just a few identified sites under a maximum likelihood criterion employing the Whelan and Goldman amino acid substitution model changes the placement of Buddenbrockia from within Cnidaria to the alternative hypothesis at the base of Bilateria. Under a Bayesian criterion employing the CAT model, the cnidarian placement is more resilient to data removal, but under one test, a well-supported early diverging bilaterian position for Buddenbrockia is recovered. Our results confirm the existence of two relatively stable placements for myxozoans and demonstrate that conflicting signal exists not only between the two types of data but also within the phylogenomic data set. These analyses underscore the importance of careful model selection, taxon and data sampling, and in-depth data exploration when investigating the phylogenetic placement of highly divergent taxa.


Subject(s)
Databases, Genetic , Myxozoa/classification , Phylogeny , RNA, Ribosomal, 18S/genetics , RNA, Ribosomal, 28S/genetics , Animals , Base Sequence , Cnidaria/classification , Cnidaria/genetics , DNA, Ribosomal/genetics , Myxozoa/genetics , Ribosomes/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...