Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
Proc Natl Acad Sci U S A ; 119(34): e2204435119, 2022 08 23.
Article in English | MEDLINE | ID: mdl-35972964

ABSTRACT

To assess the conventional treatment in evolutionary inference of alignment gaps as missing data, we propose a simple nonparametric test of the null hypothesis that the locations of alignment gaps are independent of the nucleotide substitution or amino acid replacement process. When we apply the test to 1,390 protein alignments that are informed by protein tertiary structure and use a 5% significance level, the null hypothesis of independence between amino acid replacement and gap location is rejected for ∼65% of datasets. Via simulations that include substitution and insertion-deletion, we show that the test performs well with true alignments. When we simulate according to the null hypothesis and then apply the test to optimal alignments that are inferred by each of four widely used software packages, the null hypothesis is rejected too frequently. Via further simulations and analyses, we show that the overly frequent rejections of the null hypothesis are not solely due to weaknesses of widely used software for finding optimal alignments. Instead, our evidence suggests that optimal alignments are unrepresentative of true alignments and that biased evolutionary inferences may result from relying upon individual optimal alignments.


Subject(s)
Amino Acids , Nucleotides , Proteins , Algorithms , Amino Acid Substitution , Amino Acids/genetics , Nucleotides/genetics , Proteins/genetics , Sequence Alignment , Software
2.
Syst Biol ; 71(3): 630-648, 2022 04 19.
Article in English | MEDLINE | ID: mdl-34469581

ABSTRACT

Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.].


Subject(s)
Evolution, Molecular , INDEL Mutation , Models, Genetic , Models, Statistical , Phylogeny , Sequence Alignment
3.
Syst Biol ; 67(4): 616-632, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29309694

ABSTRACT

When inferring phylogenies, one important decision is whether and how nucleotide substitution parameters should be shared across different subsets or partitions of the data. One sort of partitioning error occurs when heterogeneous subsets are mistakenly lumped together and treated as if they share parameter values. The opposite kind of error is mistakenly treating homogeneous subsets as if they result from distinct sets of parameters. Lumping and splitting errors are not equally bad. Lumping errors can yield parameter estimates that do not accurately reflect any of the subsets that were combined whereas splitting errors yield estimates that did not benefit from sharing information across partitions. Phylogenetic partitioning decisions are often made by applying information criteria such as the Akaike information criterion (AIC). As with other information criteria, the AIC evaluates a model or partition scheme by combining the maximum log-likelihood value with a penalty that depends on the number of parameters being estimated. For the purpose of selecting an optimal partitioning scheme, we derive an adjustment to the AIC that we refer to as the AIC$^{(p)}$ and that is motivated by the idea that splitting errors are less serious than lumping errors. We also introduce a similar adjustment to the Bayesian information criterion (BIC) that we refer to as the BIC$^{(p)}$. Via simulation and empirical data analysis, we contrast AIC and BIC behavior to our suggested adjustments. We discuss these results and also emphasize why we expect the probability of lumping errors with the AIC$^{(p)}$ and the BIC$^{(p)}$ to be relatively robust to model parameterization.


Subject(s)
Computational Biology/methods , Phylogeny , Bayes Theorem , Models, Biological , Models, Genetic
4.
PLoS One ; 11(6): e0157032, 2016.
Article in English | MEDLINE | ID: mdl-27309961

ABSTRACT

Antarctica is considered a relatively uncontaminated region with regard to the infectious diseases because of its extreme environment, and isolated geography. For the genetic characterization and molecular epidemiology of the newly found penguin adenovirus in Antarctica, entire genome sequencing and annual survey of penguin adenovirus were conducted. The entire genome sequences of penguin adenoviruses were completed for two Chinstrap penguins (Pygoscelis antarctica) and two Gentoo penguins (Pygoscelis papua). The whole genome lengths and G+C content of penguin adenoviruses were found to be 24,630-24,662 bp and 35.5-35.6%, respectively. Notably, the presence of putative sialidase gene was not identified in penguin adenoviruses by Rapid Amplification of cDNA Ends (RACE-PCR) as well as consensus specific PCR. The penguin adenoviruses were demonstrated to be a new species within the genus Siadenovirus, with a distance of 29.9-39.3% (amino acid, 32.1-47.9%) in DNA polymerase gene, and showed the closest relationship with turkey adenovirus 3 (TAdV-3) in phylogenetic analysis. During the 2008-2013 study period, the penguin adenoviruses were annually detected in 22 of 78 penguins (28.2%), and the molecular epidemiological study of the penguin adenovirus indicates a predominant infection in Chinstrap penguin population (12/30, 40%). Interestingly, the genome of penguin adenovirus could be detected in several internal samples, except the lymph node and brain. In conclusion, an analysis of the entire adenoviral genomes from Antarctic penguins was conducted, and the penguin adenoviruses, containing unique genetic character, were identified as a new species within the genus Siadenovirus. Moreover, it was annually detected in Antarctic penguins, suggesting its circulation within the penguin population.


Subject(s)
Adenoviridae Infections/virology , Adenoviridae/pathogenicity , Molecular Epidemiology , Spheniscidae/virology , Adenoviridae Infections/genetics , Animals , Antarctic Regions , Phylogeny , Spheniscidae/genetics
5.
Mol Phylogenet Evol ; 62(1): 329-45, 2012 Jan.
Article in English | MEDLINE | ID: mdl-22040765

ABSTRACT

The phylum Cnidaria is comprised of remarkably diverse and ecologically significant taxa, such as the reef-forming corals, and occupies a basal position in metazoan evolution. The origin of this phylum and the most recent common ancestors (MRCAs) of its modern classes remain mostly unknown, although scattered fossil evidence provides some insights on this topic. Here, we investigate the molecular divergence times of the major taxonomic groups of Cnidaria (27 Hexacorallia, 16 Octocorallia, and 5 Medusozoa) on the basis of mitochondrial DNA sequences of 13 protein-coding genes. For this analysis, the complete mitochondrial genomes of seven octocoral and two scyphozoan species were newly sequenced and combined with all available mitogenomic data from GenBank. Five reliable fossil dates were used to calibrate the Bayesian estimates of divergence times. The molecular evidence suggests that cnidarians originated 741 million years ago (Ma) (95% credible region of 686-819), and the major taxa diversified prior to the Cambrian (543 Ma). The Octocorallia and Scleractinia may have originated from radiations of survivors of the Permian-Triassic mass extinction, which matches their fossil record well.


Subject(s)
Anthozoa/genetics , Evolution, Molecular , Genes, Mitochondrial , Genetic Speciation , Scyphozoa/genetics , Animals , Anthozoa/classification , Bayes Theorem , Calibration , Extinction, Biological , Fossils , Genetic Variation , Genome, Mitochondrial , Likelihood Functions , Models, Genetic , Phylogeny , RNA, Transfer/genetics , Scyphozoa/classification
6.
J Mol Evol ; 71(4): 250-67, 2010 Oct.
Article in English | MEDLINE | ID: mdl-20740280

ABSTRACT

Species identification is one of the most important issues in biological studies. Due to recent increases in the amount of genomic information available and the development of DNA sequencing technologies, the applicability of using DNA sequences to identify species (commonly referred to as "DNA barcoding") is being tested in many areas. Several methods have been suggested to identify species using DNA sequences, including similarity scores, analysis of phylogenetic and population genetic information, and detection of species-specific sequence patterns. Although these methods have demonstrated good performance under a range of circumstances, they also have limitations, as they are subject to loss of information, require intensive computation and are sensitive to model mis-specification, and can be difficult to evaluate in terms of the significance of identification. Here, we suggest a new DNA barcoding method in which support vector machine (SVM) procedures are adopted. Our new method is nonparametric and thus is expected to be robust for a wide range of evolutionary scenarios as well as multilocus analyses. Furthermore, we describe bootstrap procedures that can be used to test the significances of species identifications. We implemented a novel conversion technique for transforming sequence data to real-valued vectors, and therefore, bootstrap procedures can be easily combined with our SVM approach. In this study, we present the results of simulation studies and empirical data analyses to demonstrate the performance of our method and discuss its properties.


Subject(s)
Algorithms , DNA Barcoding, Taxonomic/methods , DNA/classification , Artificial Intelligence , Base Sequence , Computer Simulation , DNA/genetics , DNA, Concatenated , Genetic Loci/genetics , Molecular Sequence Data , Nucleotides/genetics , Phylogeny
7.
Syst Biol ; 58(2): 199-210, 2009 Apr.
Article in English | MEDLINE | ID: mdl-20525578

ABSTRACT

Statistical models for the evolution of molecular sequences play an important role in the study of evolutionary processes. For the evolutionary analysis of protein-coding sequences, 3 types of evolutionary models are available: 1) nucleotide, 2) amino acid, and 3) codon substitution models. Selecting appropriate models can greatly improve the estimation of phylogenies and divergence times and the detection of positive selection. Although much attention has been paid to the comparisons among the same types of models, relatively little attention has been paid to the comparisons among the different types of models. Additionally, because such models have different data structures, comparison of those models using conventional model selection criteria such as Akaike information criterion (AIC) or Bayesian information criterion (BIC) is not straightforward. Here, we suggest new procedures to convert models of the above-mentioned 3 types to 64-dimensional models with nucleotide triplet substitution. These conversion procedures render it possible to statistically compare the models of these 3 types by using AIC or BIC. By analyzing divergent and conserved interspecific mammalian sequences and intraspecific human population data, we show the superiority of the codon substitution models and discuss the advantages and disadvantages of the models of the 3 types.


Subject(s)
Evolution, Molecular , Models, Genetic , Models, Statistical
8.
Mol Phylogenet Evol ; 49(1): 327-42, 2008 Oct.
Article in English | MEDLINE | ID: mdl-18682295

ABSTRACT

Identifying causes of genetic divergence is a central goal in evolutionary biology. Although rates of nucleotide substitution vary among taxa and among genes, the causes of this variation tend to be poorly understood. In the present study, we examined the rate and pattern of molecular evolution for five DNA regions over a phylogeny of Cornus, the single genus of Cornaceae. To identify evolutionary mechanisms underlying the molecular variation, we employed Bayesian methods to estimate divergence times and to infer how absolute rates of synonymous and nonsynonymous substitutions and their ratios change over time. We found that the rates vary among genes, lineages, and through time, and differences in mutation rates, selection type and intensity, and possibly genetic drift all contributed to the variation of substitution rates observed among the major lineages of Cornus. We applied independent contrast analysis to explore whether speciation rates are linked to rates of molecular evolution. The results showed no relationships for individual genes, but suggested a possible localized link between species richness and rate of nonsynonymous nucleotide substitution for the combined cpDNA regions. Furthermore, we detected a positive correlation between rates of molecular evolution and morphological change in Cornus. This was particularly pronounced in the dwarf dogwood lineage, in which genome-wide acceleration in both molecular and morphological evolution has likely occurred.


Subject(s)
Cornus/classification , Cornus/genetics , Evolution, Molecular , Genetic Speciation , Bayes Theorem , Chloroplasts/genetics , Cornus/anatomy & histology , DNA, Chloroplast/genetics , DNA, Plant/genetics , Fossils , Genes, Plant , Genome, Chloroplast , Models, Genetic , Nucleotides/genetics , Phylogeny , Sequence Analysis, DNA
9.
Syst Biol ; 57(3): 367-77, 2008 Jun.
Article in English | MEDLINE | ID: mdl-18570032

ABSTRACT

Codon-and amino acid-substitution models are widely used for the evolutionary analysis of protein-coding DNA sequences. Using codon models, the amounts of both nonsynonymous and synonymous DNA substitutions can be estimated. The ratio of these amounts represents the strength of selective pressure. Using amino acid models, the amount of nonsynonymous substitutions is estimated, but that of synonymous substitutions is ignored. Although amino acid models lose any information regarding synonymous substitutions, they explicitly incorporate the information for amino acid replacement, which is empirically derived from databases. It is often presumed that when the protein-coding sequences are highly divergent, synonymous substitutions might be saturated and the evolutionary analysis may be hampered by synonymous noise. However, there exists no quantitative procedure to verify whether synonymous substitutions can be ignored; therefore, amino acid models have been arbitrarily selected. In this study, we investigate the issue of a statistical comparison between codon-and amino acid-substitution models. For this purpose, we propose a new procedure to transform a 20-dimensional amino acid model to a 61-dimensional codon model. This transformation reveals that amino acid models belong to a subset of the codon models and enables us to test whether synonymous substitutions can be ignored by using the likelihood ratio. Our theoretical results and analyses of real data indicate that synonymous substitutions are very informative and substantially improve evolutionary inference, even when the sequences are highly divergent. Therefore, we note that amino acid models should be adopted only after carefully investigating and discarding the possibility that synonymous substitutions can reveal important evolutionary information.


Subject(s)
Amino Acid Substitution , Evolution, Molecular , Fungal Proteins/chemistry , Mitochondrial Proteins/chemistry , Models, Genetic , Amino Acid Sequence , Animals , Codon , Computational Biology , DNA, Mitochondrial/chemistry , Likelihood Functions , Mammals/genetics , Phylogeny
10.
Mol Biol Evol ; 25(5): 960-71, 2008 May.
Article in English | MEDLINE | ID: mdl-18281270

ABSTRACT

Phylogeny estimation is extremely crucial in the study of molecular evolution. The increase in the amount of available genomic data facilitates phylogeny estimation from multilocus sequence data. Although maximum likelihood and Bayesian methods are available for phylogeny reconstruction using multilocus sequence data, these methods require heavy computation, and their application is limited to the analysis of a moderate number of genes and taxa. Distance matrix methods present suitable alternatives for analyzing huge amounts of sequence data. However, the manner in which distance methods can be applied to multilocus sequence data remains unknown. Here, we suggest new procedures to estimate molecular phylogeny using multilocus sequence data and evaluate its significance in the framework of the distance method. We found that concatenation of the multilocus sequence data may result in incorrect phylogeny estimation with an extremely high bootstrap probability (BP), which is due to incorrect estimation of the distances and intentional ignorance of the intergene variations. Therefore, we suggest that the distance matrices for multilocus sequence data be estimated separately and these matrices be subsequently combined to reconstruct phylogeny instead of phylogeny reconstruction using concatenated sequence data. To calculate the BPs of the reconstructed phylogeny, we suggest that 2-stage bootstrap procedures be adopted; in this, genes are resampled followed by resampling of the sequence columns within the resampled genes. By resampling the genes during calculation of BPs, intergene variations are properly considered. Via simulation studies and empirical data analysis, we demonstrate that our 2-stage bootstrap procedures are more suitable than the conventional bootstrap procedure that is adopted after sequence concatenation.


Subject(s)
Phylogeny , Probability , Animals , Computer Simulation , Evolution, Molecular , Genetic Heterogeneity , Genetic Variation , Mammals/genetics , Models, Genetic
11.
Proc Natl Acad Sci U S A ; 102(12): 4436-41, 2005 Mar 22.
Article in English | MEDLINE | ID: mdl-15764703

ABSTRACT

Because of the increase of genomic data, multiple genes are often available for the inference of phylogenetic relationships. The simple approach for combining multiple genes from the same taxon is to concatenate the sequences and then ignore the fact that different positions in the concatenated sequence came from different genes. Here, we discuss two criteria for inferring the optimal tree topology from data sets with multiple genes. These criteria are designed for multigene data sets where gene-specific evolutionary features are too important to ignore. One criterion is conventional and is obtained by taking the sum of log-likelihoods over all genes. The other criterion is obtained by dividing the log-likelihood for a gene by its sequence length and then taking the arithmetic mean over genes of these ratios. A similar strategy could be adopted with parsimony scores. The optimal tree is then declared to be the one for which the sum or the arithmetic mean is maximized. These criteria are justified within a two-stage hierarchical framework. The first level of the hierarchy represents gene-specific evolutionary features, and the second represents site-specific features for given genes. For testing significance of the optimal topology, we suggest a two-stage bootstrap procedure that involves resampling genes and then resampling alignment columns within resampled genes. An advantage of this procedure over concatenation is that it can effectively account for gene-specific evolutionary features. We discuss the applicability of the two-stage bootstrap idea to the Kishino-Hasegawa test and the Shimodaira-Hasegawa test.


Subject(s)
Biological Evolution , Genetic Variation , Models, Genetic , Databases, Genetic , Genetic Techniques , Likelihood Functions , Phylogeny
12.
Mol Biol Evol ; 21(7): 1201-13, 2004 Jul.
Article in English | MEDLINE | ID: mdl-15014159

ABSTRACT

The rate of molecular evolution can vary among lineages. Sources of this variation have differential effects on synonymous and nonsynonymous substitution rates. Changes in effective population size or patterns of natural selection will mainly alter nonsynonymous substitution rates. Changes in generation length or mutation rates are likely to have an impact on both synonymous and nonsynonymous substitution rates. By comparing changes in synonymous and nonsynonymous rates, the relative contributions of the driving forces of evolution can be better characterized. Here, we introduce a procedure for estimating the chronological rates of synonymous and nonsynonymous substitutions on the branches of an evolutionary tree. Because the widely used ratio of nonsynonymous and synonymous rates is not designed to detect simultaneous increases or simultaneous decreases in synonymous and nonsynonymous rates, the estimation of these rates rather than their ratio can improve characterization of the evolutionary process. With our Bayesian approach, we analyze cytochrome oxidase subunit I evolution in primates and infer that nonsynonymous rates have a greater tendency to change over time than do synonymous rates. Our analysis of these data also suggests that rates have been positively correlated.


Subject(s)
Electron Transport Complex IV/genetics , Evolution, Molecular , Genetic Variation , Primates/genetics , Selection, Genetic , Animals , Base Sequence/genetics , Bayes Theorem , Nucleotides/genetics , Polymorphism, Genetic , Rodentia/genetics , Sequence Analysis, DNA/methods
13.
Genetics ; 160(4): 1283-93, 2002 Apr.
Article in English | MEDLINE | ID: mdl-11973287

ABSTRACT

Using pseudomaximum-likelihood approaches to phylogenetic inference and coalescent theory, we develop a computationally tractable method of estimating effective population size from serially sampled viral data. We show that the variance of the maximum-likelihood estimator of effective population size depends on the serial sampling design only because internal node times on a coalescent genealogy can be better estimated with some designs than with others. Given the internal node times and the number of sequences sampled, the variance of the maximum-likelihood estimator is independent of the serial sampling design. We then estimate the effective size of the HIV-1 population within nine hosts. If we assume that the mutation rate is 2.5 x 10(-5) substitutions/generation and is the same in all patients, estimated generation lengths vary from 0.73 to 2.43 days/generation and the mean (1.47) is similar to the generation lengths estimated by other researchers. If we assume that generation length is 1.47 days and is the same in all patients, mutation rate estimates vary from 1.52 x 10(-5) to 5.02 x 10(-5). Our results indicate that effective viral population size and evolutionary rate per year are negatively correlated among HIV-1 patients.


Subject(s)
HIV-1/growth & development , Likelihood Functions , Viral Load/methods , Bayes Theorem , Humans , Models, Biological , Mutation
14.
Bioinformatics ; 18(1): 115-23, 2002 Jan.
Article in English | MEDLINE | ID: mdl-11836219

ABSTRACT

MOTIVATION: The high pace of viral sequence change means that variation in the times at which sequences are sampled can have a profound effect both on the ability to detect trends over time in evolutionary rates and on the power to reject the Molecular Clock Hypothesis (MCH). Trends in viral evolutionary rates are of particular interest because their detection may allow connections to be established between a patient's treatment or condition and the process of evolution. Variation in sequence isolation times also impacts the uncertainty associated with estimates of divergence times and evolutionary rates. Variation in isolation times can be intentionally adjusted to increase the power of hypothesis tests and to reduce the uncertainty of evolutionary parameter estimates, but this fact has received little previous attention. RESULTS: We provide approximations for the power to reject the MCH when the alternative is that rates change in a linear fashion over time and when the alternative is that rates differ randomly among branches. In addition, we approximate the standard deviation of estimated evolutionary rates and divergence times. We illustrate how these approximations can be exploited to determine which viral sample to sequence when samples representing different dates are available.


Subject(s)
Evolution, Molecular , Viruses/genetics , Computational Biology , HIV-1/genetics , HIV-1/isolation & purification , Humans , Models, Genetic , Models, Statistical , Time Factors , Viruses/isolation & purification
SELECTION OF CITATIONS
SEARCH DETAIL
...