Search | VHL Regional Portal

The Tree Reconstruction Game: Phylogenetic Reconstruction Using Reinforcement Learning.

Azouri, Dana; Granit, Oz; Alburquerque, Michael; Mansour, Yishay; Pupko, Tal; Mayrose, Itay.

Mol Biol Evol ; 41(6)2024 Jun 01.

Article in English | MEDLINE | ID: mdl-38829798

ABSTRACT

The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree search algorithms might result in a tree that is the local optima, not the global one. Here, we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning-based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for data sets containing 15 sequences of length 18,000 bp and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.

Subject(s)

Algorithms , Phylogeny , Likelihood Functions , Models, Genetic , Computational Biology/methods , Software

A LASSO-based approach to sample sites for phylogenetic tree search.

Ecker, Noa; Azouri, Dana; Bettisworth, Ben; Stamatakis, Alexandros; Mansour, Yishay; Mayrose, Itay; Pupko, Tal.

Bioinformatics ; 38(Suppl 1): i118-i124, 2022 06 24.

Article in English | MEDLINE | ID: mdl-35758778

ABSTRACT

MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Artificial Intelligence , Software , Likelihood Functions , Phylogeny

A Probabilistic Model for Indel Evolution: Differentiating Insertions from Deletions.

Loewenthal, Gil; Rapoport, Dana; Avram, Oren; Moshe, Asher; Wygoda, Elya; Itzkovitch, Alon; Israeli, Omer; Azouri, Dana; Cartwright, Reed A; Mayrose, Itay; Pupko, Tal.

Mol Biol Evol ; 38(12): 5769-5781, 2021 12 09.

Article in English | MEDLINE | ID: mdl-34469521

ABSTRACT

Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here, we introduce several improvements to indel modeling: 1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; 2) we introduce numerous summary statistics that allow approximate Bayesian computation-based parameter estimation; 3) we develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical data sets; and 4) using a model-selection scheme, we test whether the richer model better fits biological data compared with the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical data sets and that, for the majority of these data sets, the deletion rate is higher than the insertion rate.

Subject(s)

Evolution, Molecular , INDEL Mutation , Bayes Theorem , Models, Statistical , Phylogeny

Harnessing machine learning to guide phylogenetic-tree search algorithms.

Azouri, Dana; Abadi, Shiran; Mansour, Yishay; Mayrose, Itay; Pupko, Tal.

Nat Commun ; 12(1): 1983, 2021 03 31.

Article in English | MEDLINE | ID: mdl-33790270

ABSTRACT

Inferring a phylogenetic tree is a fundamental challenge in evolutionary studies. Current paradigms for phylogenetic tree reconstruction rely on performing costly likelihood optimizations. With the aim of making tree inference feasible for problems involving more than a handful of sequences, inference under the maximum-likelihood paradigm integrates heuristic approaches to evaluate only a subset of all potential trees. Consequently, existing methods suffer from the known tradeoff between accuracy and running time. In this proof-of-concept study, we train a machine-learning algorithm over an extensive cohort of empirical data to predict the neighboring trees that increase the likelihood, without actually computing their likelihood. This provides means to safely discard a large set of the search space, thus potentially accelerating heuristic tree searches without losing accuracy. Our analyses suggest that machine learning can guide tree-search methodologies towards the most promising candidate trees.

Subject(s)

Algorithms , Evolution, Molecular , Machine Learning , Phylogeny , Animals , Databases, Genetic/statistics & numerical data , Databases, Protein/statistics & numerical data , Humans , Models, Genetic

Heterogeneity in the rate of molecular sequence evolution substantially impacts the accuracy of detecting shifts in diversification rates.

Shafir, Anat; Azouri, Dana; Goldberg, Emma E; Mayrose, Itay.

Evolution ; 74(8): 1620-1639, 2020 08.

Article in English | MEDLINE | ID: mdl-32510165

ABSTRACT

As species richness varies along the tree of life, there is a great interest in identifying factors that affect the rates by which lineages speciate or go extinct. To this end, theoretical biologists have developed a suite of phylogenetic comparative methods that aim to identify where shifts in diversification rates had occurred along a phylogeny and whether they are associated with some traits. Using these methods, numerous studies have predicted that speciation and extinction rates vary across the tree of life. In this study, we show that asymmetric rates of sequence evolution lead to systematic biases in the inferred phylogeny, which in turn lead to erroneous inferences regarding lineage diversification patterns. The results demonstrate that as the asymmetry in sequence evolution rates increases, so does the tendency to select more complicated models that include the possibility of diversification rate shifts. These results thus suggest that any inference regarding shifts in diversification pattern should be treated with great caution, at least until any biases regarding the molecular substitution rate have been ruled out.

Subject(s)

Biological Evolution , Models, Genetic , Computer Simulation

Model selection may not be a mandatory step for phylogeny reconstruction.

Abadi, Shiran; Azouri, Dana; Pupko, Tal; Mayrose, Itay.

Nat Commun ; 10(1): 934, 2019 02 25.

Article in English | MEDLINE | ID: mdl-30804347

ABSTRACT

Determining the most suitable model for phylogeny reconstruction constitutes a fundamental step in numerous evolutionary studies. Over the years, various criteria for model selection have been proposed, leading to debate over which criterion is preferable. However, the necessity of this procedure has not been questioned to date. Here, we demonstrate that although incongruency regarding the selected model is frequent over empirical and simulated data, all criteria lead to very similar inferences. When topologies and ancestral sequence reconstruction are the desired output, choosing one criterion over another is not crucial. Moreover, skipping model selection and using instead the most parameter-rich model, GTR+I+G, leads to similar inferences, thus rendering this time-consuming step nonessential, at least under current strategies of model selection.

Subject(s)

Models, Genetic , Phylogeny , Evolution, Molecular

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL