Search | VHL Regional Portal

The Tree Reconstruction Game: Phylogenetic Reconstruction Using Reinforcement Learning.

Azouri, Dana; Granit, Oz; Alburquerque, Michael; Mansour, Yishay; Pupko, Tal; Mayrose, Itay.

Mol Biol Evol ; 41(6)2024 Jun 01.

Article in English | MEDLINE | ID: mdl-38829798

ABSTRACT

The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree search algorithms might result in a tree that is the local optima, not the global one. Here, we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning-based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for data sets containing 15 sequences of length 18,000 bp and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.

Subject(s)

Algorithms , Phylogeny , Likelihood Functions , Models, Genetic , Computational Biology/methods , Software

Statistical framework to determine indel-length distribution.

Wygoda, Elya; Loewenthal, Gil; Moshe, Asher; Alburquerque, Michael; Mayrose, Itay; Pupko, Tal.

Bioinformatics ; 40(2)2024 02 01.

Article in English | MEDLINE | ID: mdl-38269647

ABSTRACT

MOTIVATION: Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution. RESULTS: We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them. AVAILABILITY AND IMPLEMENTATION: The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.

Subject(s)

Algorithms , Software , Bayes Theorem , Sequence Alignment , INDEL Mutation , Evolution, Molecular

GenomeFLTR: filtering reads made easy.

Dotan, Edo; Alburquerque, Michael; Wygoda, Elya; Huchon, Dorothée; Pupko, Tal.

Nucleic Acids Res ; 51(W1): W232-W236, 2023 07 05.

Article in English | MEDLINE | ID: mdl-37177997

ABSTRACT

In the last decade, advances in sequencing technology have led to an exponential increase in genomic data. These new data have dramatically changed our understanding of the evolution and function of genes and genomes. Despite improvements in sequencing technologies, identifying contaminated reads remains a complex task for many research groups. Here, we introduce GenomeFLTR, a new web server to filter contaminated reads. Reads are compared against existing sequence databases from various representative organisms to detect potential contaminants. The main features implemented in GenomeFLTR are: (i) automated updating of the relevant databases; (ii) fast comparison of each read against the database; (iii) the ability to create user-specified databases; (iv) a user-friendly interactive dashboard to investigate the origin and frequency of the contaminations; (v) the generation of a contamination-free file. Availability: https://genomefltr.tau.ac.il/.

Subject(s)

Genomics , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA , Genome/genetics , Databases, Nucleic Acid , Software

Natural language processing approach to model the secretion signal of type III effectors.

Wagner, Naama; Alburquerque, Michael; Ecker, Noa; Dotan, Edo; Zerah, Ben; Pena, Michelle Mendonca; Potnis, Neha; Pupko, Tal.

Front Plant Sci ; 13: 1024405, 2022.

Article in English | MEDLINE | ID: mdl-36388586

ABSTRACT

Type III effectors are proteins injected by Gram-negative bacteria into eukaryotic hosts. In many plant and animal pathogens, these effectors manipulate host cellular processes to the benefit of the bacteria. Type III effectors are secreted by a type III secretion system that must "classify" each bacterial protein into one of two categories, either the protein should be translocated or not. It was previously shown that type III effectors have a secretion signal within their N-terminus, however, despite numerous efforts, the exact biochemical identity of this secretion signal is generally unknown. Computational characterization of the secretion signal is important for the identification of novel effectors and for better understanding the molecular translocation mechanism. In this work we developed novel machine-learning algorithms for characterizing the secretion signal in both plant and animal pathogens. Specifically, we represented each protein as a vector in high-dimensional space using Facebook's protein language model. Classification algorithms were next used to separate effectors from non-effector proteins. We subsequently curated a benchmark dataset of hundreds of effectors and thousands of non-effector proteins. We showed that on this curated dataset, our novel approach yielded substantially better classification accuracy compared to previously developed methodologies. We have also tested the hypothesis that plant and animal pathogen effectors are characterized by different secretion signals. Finally, we integrated the novel approach in Effectidor, a web-server for predicting type III effector proteins, leading to a more accurate classification of effectors from non-effectors.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL