Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 16 de 16
Filter
Add more filters










Publication year range
1.
Nat Commun ; 13(1): 3896, 2022 07 06.
Article in English | MEDLINE | ID: mdl-35794110

ABSTRACT

Widely applicable, accurate and fast inference methods in phylodynamics are needed to fully profit from the richness of genetic data in uncovering the dynamics of epidemics. Standard methods, including maximum-likelihood and Bayesian approaches, generally rely on complex mathematical formulae and approximations, and do not scale with dataset size. We develop a likelihood-free, simulation-based approach, which combines deep learning with (1) a large set of summary statistics measured on phylogenies or (2) a complete and compact representation of trees, which avoids potential limitations of summary statistics and applies to any phylodynamics model. Our method enables both model selection and estimation of epidemiological parameters from very large phylogenies. We demonstrate its speed and accuracy on simulated data, where it performs better than the state-of-the-art methods. To illustrate its applicability, we assess the dynamics induced by superspreading individuals in an HIV dataset of men-having-sex-with-men in Zurich. Our tool PhyloDeep is available on github.com/evolbioinfo/phylodeep .


Subject(s)
Deep Learning , Bayes Theorem , Computer Simulation , Disease Outbreaks , Humans , Male , Phylogeny
2.
Bioinformatics ; 35(21): 4290-4297, 2019 11 01.
Article in English | MEDLINE | ID: mdl-30977781

ABSTRACT

MOTIVATION: The reconstruction of ancestral genetic sequences from the analysis of contemporaneous data is a powerful tool to improve our understanding of molecular evolution. Various statistical criteria defined in a phylogenetic framework can be used to infer nucleotide, amino-acid or codon states at internal nodes of the tree, for every position along the sequence. These criteria generally select the state that maximizes (or minimizes) a given criterion. Although it is perfectly sensible from a statistical perspective, that strategy fails to convey useful information about the level of uncertainty associated to the inference. RESULTS: The present study introduces a new criterion for ancestral sequence reconstruction, the minimum posterior expected error (MPEE), that selects a single state whenever the signal conveyed by the data is strong, and a combination of multiple states otherwise. We also assess the performance of a criterion based on the Brier scoring scheme which, like MPEE, does not rely on any tuning parameters. The precision and accuracy of several other criteria that involve arbitrarily set tuning parameters are also evaluated. Large scale simulations demonstrate the benefits of using the MPEE and Brier-based criteria with a substantial increase in the accuracy of the inference of past sequences compared to the standard approach and realistic compromises on the precision of the solutions returned. AVAILABILITY AND IMPLEMENTATION: The software package PhyML (https://github.com/stephaneguindon/phyml) provides an implementation of the Maximum A Posteriori (MAP) and MPEE criteria for reconstructing ancestral nucleotide and amino-acid sequences.


Subject(s)
Sequence Analysis , Amino Acid Sequence , Biometry , Evolution, Molecular , Likelihood Functions , Phylogeny
3.
Nature ; 556(7702): 452-456, 2018 04.
Article in English | MEDLINE | ID: mdl-29670290

ABSTRACT

Felsenstein's application of the bootstrap method to evolutionary trees is one of the most cited scientific papers of all time. The bootstrap method, which is based on resampling and replications, is used extensively to assess the robustness of phylogenetic inferences. However, increasing numbers of sequences are now available for a wide variety of species, and phylogenies based on hundreds or thousands of taxa are becoming routine. With phylogenies of this size Felsenstein's bootstrap tends to yield very low supports, especially on deep branches. Here we propose a new version of the phylogenetic bootstrap in which the presence of inferred branches in replications is measured using a gradual 'transfer' distance rather than the binary presence or absence index used in Felsenstein's original version. The resulting supports are higher and do not induce falsely supported branches. The application of our method to large mammal, HIV and simulated datasets reveals their phylogenetic signals, whereas Felsenstein's bootstrap fails to do so.


Subject(s)
Data Interpretation, Statistical , Datasets as Topic , HIV-1/genetics , Mammals/genetics , Phylogeny , Animals , Computer Simulation , DNA Barcoding, Taxonomic , Haplorhini/genetics , pol Gene Products, Human Immunodeficiency Virus/chemistry , pol Gene Products, Human Immunodeficiency Virus/genetics
4.
Parasite ; 17(4): 273-83, 2010 Dec.
Article in English | MEDLINE | ID: mdl-21275233

ABSTRACT

The fight against Plasmodium falciparum, the species responsible for 90% of the lethal forms of human malaria, took a new direction with the publication of its genome in 2002. However, the hopes that the genome should help bringing to the foreground the expected new "vaccines candidates" or "targets of new medicines" were disappointed by the low number of genes that could be functionally annotated--less than 40% upon the genome publication, just over 50% eight years later. This 10% gain of knowledge was made possible by the efforts of the entire scientific community in many directions which include: the production of transcriptomic and proteomic profiles at various stages of the parasite development and in response to drug or stress treatments; the proteomic study of subcellular compartments; the sequencing of numerous Plasmodium related species (allowing whole genome comparisons) and the sequencing of numerous P. falciparum strains (allowing investigations of gene polymorphism). In parallel with this production of experimental biological data, the development of original mining tools adapted to the P falciparum specificities quickly appeared as a priority, as the performances of "classical" bioinformatic tools, used successfully for other genomes, had limited efficacy. This was the aim of the PlasmoExplore project launched in 2007. This brief review does not cover all efforts made by the international community to decipher the P falciparum genome but focuses on improvements and novel mining methods investigated by the PlasmoExplore consortium, and some of the lessons we could learn from these efforts.


Subject(s)
Computational Biology/methods , Genome, Protozoan/genetics , Malaria, Falciparum/prevention & control , Plasmodium falciparum/genetics , Animals , Base Sequence , Gene Expression Regulation/genetics , Gene Expression Regulation, Fungal , Genes, Protozoan , Humans , Malaria, Falciparum/genetics , Proteome/genetics , Saccharomyces cerevisiae , Transcription, Genetic
5.
Nucleic Acids Res ; 36(Web Server issue): W465-9, 2008 Jul 01.
Article in English | MEDLINE | ID: mdl-18424797

ABSTRACT

Phylogenetic analyses are central to many research areas in biology and typically involve the identification of homologous sequences, their multiple alignment, the phylogenetic reconstruction and the graphical representation of the inferred tree. The Phylogeny.fr platform transparently chains programs to automatically perform these tasks. It is primarily designed for biologists with no experience in phylogeny, but can also meet the needs of specialists; the first ones will find up-to-date tools chained in a phylogeny pipeline to analyze their data in a simple and robust way, while the specialists will be able to easily build and run sophisticated analyses. Phylogeny.fr offers three main modes. The 'One Click' mode targets non-specialists and provides a ready-to-use pipeline chaining programs with recognized accuracy and speed: MUSCLE for multiple alignment, PhyML for tree building, and TreeDyn for tree rendering. All parameters are set up to suit most studies, and users only have to provide their input sequences to obtain a ready-to-print tree. The 'Advanced' mode uses the same pipeline but allows the parameters of each program to be customized by users. The 'A la Carte' mode offers more flexibility and sophistication, as users can build their own pipeline by selecting and setting up the required steps from a large choice of tools to suit their specific needs. Prior to phylogenetic analysis, users can also collect neighbors of a query sequence by running BLAST on general or specialized databases. A guide tree then helps to select neighbor sequences to be used as input for the phylogeny pipeline. Phylogeny.fr is available at: http://www.phylogeny.fr/


Subject(s)
Phylogeny , Software , Internet , Sequence Alignment , Sequence Analysis, DNA , Sequence Analysis, Protein
6.
Mol Biol Evol ; 18(6): 1103-16, 2001 Jun.
Article in English | MEDLINE | ID: mdl-11371598

ABSTRACT

We analyze the performance of quartet methods in phylogenetic reconstruction. These methods first compute four-taxon trees (4-trees) and then use a combinatorial algorithm to infer a phylogeny that respects the inferred 4-trees as much as possible. Quartet puzzling (QP) is one of the few methods able to take weighting of the 4-trees, which is inferred by maximum likelihood, into account. QP seems to be widely used. We present weight optimization (WO), a new algorithm which is also based on weighted 4-trees. WO is faster and offers better theoretical guarantees than QP. Moreover, computer simulations indicate that the topological accuracy of WO is less dependent on the shape of the correct tree. However, although the performance of WO is better overall than that of QP, it is still less efficient than traditional phylogenetic reconstruction approaches based on pairwise evolutionary distances or maximum likelihood. This is likely related to long-branch attraction, a phenomenon to which quartet methods are very sensitive, and to inappropriate use of the initial results (weights) obtained by maximum likelihood for every quartet.


Subject(s)
Algorithms , Phylogeny , Models, Genetic
8.
Mol Biol Evol ; 17(3): 401-5, 2000 Mar.
Article in English | MEDLINE | ID: mdl-10723740

ABSTRACT

This paper discusses the optimization principle in phylogenetic analysis, in the case of distance data. We argue that the use of this principle cannot be called into question, except for computing time reasons. We show that the minimum-evolution criterion is not perfectly suited for distance data estimated from sequences, and we present another approach, implemented in the BIONJ algorithm, which allows the data features to be taken into account, while being less demanding in computing time. Simulations show that BIONJ significantly outperforms NJ.


Subject(s)
Algorithms , Evolution, Molecular , Models, Genetic , Phylogeny
9.
Mol Biol Evol ; 14(8): 875-82, 1997 Aug.
Article in English | MEDLINE | ID: mdl-9254926

ABSTRACT

Two methods are commonly employed for evaluating the extent of the uncertainty of evolutionary distances between sequences: either some estimator of the variance of the distance estimator, or the bootstrap method. However, both approaches can be misleading, particularly when the evolutionary distance is small. We propose using another statistical method which does not have the same defect: interval estimation. We show how confidence intervals may be constructed for the Jukes and Cantor (1969) and Kimura two-parameter (1980) estimators. We compare the exact confidence intervals thus obtained with the approximate intervals derived by the two previous methods, using artificial and biological data. The results show that the usual methods clearly underestimate the variability when the substitution rate is low and when sequences are short. Moreover, our analysis suggests that similar results may be expected for other evolutionary distance estimators.


Subject(s)
Algorithms , Evolution, Molecular , Animals , Confidence Intervals , Evaluation Studies as Topic , Humans , Rats , Sequence Alignment
10.
Mol Biol Evol ; 14(7): 685-95, 1997 Jul.
Article in English | MEDLINE | ID: mdl-9254330

ABSTRACT

We propose an improved version of the neighbor-joining (NJ) algorithm of Saitou and Nei. This new algorithm, BIONJ, follows the same agglomerative scheme as NJ, which consists of iteratively picking a pair of taxa, creating a new mode which represents the cluster of these taxa, and reducing the distance matrix by replacing both taxa by this node. Moreover, BIONJ uses a simple first-order model of the variances and covariances of evolutionary distance estimates. This model is well adapted when these estimates are obtained from aligned sequences. At each step it permits the selection, from the class of admissible reductions, of the reduction which minimizes the variance of the new distance matrix. In this way, we obtain better estimates to choose the pair of taxa to be agglomerated during the next steps. Moreover, in comparison with NJ's estimates, these estimates become better and better as the algorithm proceeds. BIONJ retains the good properties of NJ--especially its low run time. Computer simulations have been performed with 12-taxon model trees to determine BIONJ's efficiency. When the substitution rates are low (maximum pairwise divergence approximately 0.1 substitutions per site) or when they are constant among lineages, BIONJ is only slightly better than NJ. When the substitution rates are higher and vary among lineages,BIONJ clearly has better topological accuracy. In the latter case, for the model trees and the conditions of evolution tested, the topological error reduction is on the average around 20%. With highly-varying-rate trees and with high substitution rates (maximum pairwise divergence approximately 1.0 substitutions per site), the error reduction may even rise above 50%, while the probability of finding the correct tree may be augmented by as much as 15%.


Subject(s)
Biological Evolution , Phylogeny , Sequence Analysis/methods , Algorithms , Models, Biological , Software
12.
Biochimie ; 75(5): 363-70, 1993.
Article in English | MEDLINE | ID: mdl-8347723

ABSTRACT

Inductive learning, also called 'learning from examples', is a subfield of artificial intelligence. Inductive learning methods are able to deal with 'structural descriptions'. These portray objects as composite structures consisting of various components. The use of structural descriptions to represent biological objects is appealing. For instance, they have been used by Rawlings et al [1] for symbolically and comprehensively representing the folding of proteins. This paper shows how inductive learning techniques may be used for extracting information from biological objects. We briefly describe some general techniques for describing objects in a structural way and for learning from these descriptions. We present details of a program that we developed, PLAGE, and show the application of this program for a study on signal peptides, which was done in collaboration with A Danchin [2,3]. Finally, we survey some other approaches and applications of inductive learning to molecular biology.


Subject(s)
Artificial Intelligence , Protein Sorting Signals/chemistry , Sequence Analysis , Software , Algorithms , Amino Acid Sequence , Molecular Sequence Data , Protein Structure, Secondary , Sequence Analysis, DNA , Sequence Analysis, RNA
13.
Res Microbiol ; 142(7-8): 913-6, 1991.
Article in English | MEDLINE | ID: mdl-1784830

ABSTRACT

The information collected in national and international libraries on nucleotide and protein sequences cannot be directly treated for proper handling by existing software. Therefore we evaluated the feasibility of constructing a data base for Escherichia coli using the data present in the banks. The knowhow thus acquired was applied to Bacillus subtilis. Specific examples of the general procedure are given.


Subject(s)
Bacillus subtilis/ultrastructure , Chromosomes, Bacterial/ultrastructure , DNA, Bacterial/ultrastructure , Databases, Factual , Escherichia coli/ultrastructure , Bacillus subtilis/genetics , Base Sequence/genetics , DNA, Bacterial/genetics , Database Management Systems , Databases, Bibliographic , Escherichia coli/genetics , In Vitro Techniques , Molecular Sequence Data
14.
Comput Appl Biosci ; 4(3): 357-65, 1988 Aug.
Article in English | MEDLINE | ID: mdl-3416198

ABSTRACT

A method is presented for predicting the secondary structure of globular proteins from their amino acid sequence. It is based on a rigorous statistical exploitation of the well-known biological fact that the amino acid compositions of each secondary structure are different. We also propose an evaluation process that allows us to estimate the capacity of a method to predict the secondary structure of a new protein which does not have any homologous proteins whose structure is already known. This evaluation process shows that our method has a prediction accuracy of 58.7% over three states for the 62 proteins of the Kabsch and Sander (1983a) data bank. This result is better than that obtained by the most widely used methods--Lim (1974), Chou and Fasman (1978) and Garnier et al. (1978)--and also than that obtained by a recent method based on local homologies (Levin et al., 1986). Our prediction method is very simple and may be implemented on any microcomputer and even on programmable pocket calculators. A simple Pascal implementation of the method prediction algorithm is given. The interpretation of our results in terms of protein folding and directions for further work are discussed.


Subject(s)
Protein Conformation , Software , Algorithms , Amino Acid Sequence , Mathematical Computing
15.
J Mol Evol ; 24(1-2): 130-42, 1986.
Article in English | MEDLINE | ID: mdl-3104613

ABSTRACT

Investigation of possible variations between prokaryotic and eukaryotic signal sequences of exported proteins has revealed unexpected differences. Apart from the known similarities (presence of a core hydrophobic sequence preceded by a positively charged amino terminus and followed by a flexible structure), we have found that the core is much more rigid in eukaryotic signals than in their prokaryotic counterparts, and that at both ends the constraints are much more stringent in bacteria than in human cells. The differences have been summarized as a set of 17 criteria describing noteworthy features discriminating between the two classes of signal peptides. The program we used permitted each class of sequences to be learned; Escherichia coli sequences were well learned (i.e., they could be recognized by the programs as having common features), whereas human sequences were found to exhibit a much wider variation. Thus it was possible to propose a consensus in the case of the bacterial peptides, but none (or a much looser one) in the case of the human sequences. Two sequences were exceptional among the E. coli signal peptides, those of lipoprotein and plasmid-borne beta-lactamase, suggesting that they have special origins or destinations. Finally, the differences found strongly suggest that the mode of secretion is rather different in the two types of organisms, in spite of the common features of the signal sequences.


Subject(s)
Bacteria/genetics , Protein Sorting Signals/genetics , Proteins/genetics , Amino Acid Sequence , Biological Evolution , Humans , Proteins/metabolism , Software , Species Specificity
16.
Biochimie ; 67(5): 499-507, 1985 May.
Article in French | MEDLINE | ID: mdl-3839692

ABSTRACT

In order for the computer to learn about objects, the user must first provide a good description language for these objects. In this paper we present a new description language which is structural. Structural descriptions portray objects as composite structures consisting of various components. Structural descriptions can be contrasted with attribute descriptions, which specify only global properties of an object. Attribute descriptions can be expressed using propositional logic. Structural descriptions, however, must be expressed in first-order logic. We present how this language works, why it is suitable for biochemical objects and how one can discriminate on structural descriptions. Finally, we present an application on learning about tRNA which has yielded very good results brings evident of feasibility.


Subject(s)
Computers , Software , Base Sequence , Discrimination Learning , Logic , RNA, Transfer
SELECTION OF CITATIONS
SEARCH DETAIL
...