Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinform Adv ; 4(1): vbae089, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38911822

RESUMO

Motivation: Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results: Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation: TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.

2.
J Comput Biol ; 31(4): 312-327, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38634854

RESUMO

Phylogenetic inference and reconstruction methods generate hypotheses on evolutionary history. Competing inference methods are frequently used, and the evaluation of the generated hypotheses is achieved using tree comparison costs. The Robinson-Foulds (RF) distance is a widely used cost to compare the topology of two trees, but this cost is sensitive to tree error and can overestimate tree differences. To overcome this limitation, a refined version of the RF distance called the Cluster Affinity (CA) distance was introduced. However, CA distances are symmetric and cannot compare different types of trees. These asymmetric comparisons occur when gene trees are compared with species trees, when disparate datasets are integrated into a supertree, or when tree comparison measures are used to infer a phylogenetic network. In this study, we introduce a relaxation of the original Affinity distance to compare heterogeneous trees called the asymmetric CA cost. We also develop a biologically interpretable cost, the Cluster Support cost that normalizes by cluster size across gene trees. The characteristics of these costs are similar to the symmetric CA cost. We describe efficient algorithms, derive the exact diameters, and use these to standardize the cost to be applicable in practice. These costs provide objective, fine-scale, and biologically interpretable values that can assess differences and similarities between phylogenetic trees.


Assuntos
Algoritmos , Filogenia , Análise por Conglomerados , Modelos Genéticos , Biologia Computacional/métodos , Evolução Molecular
3.
Bioinformatics ; 39(39 Suppl 1): i177-i184, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387175

RESUMO

The classic quantitative measure of phylogenetic diversity (PD) has been used to address problems in conservation biology, microbial ecology, and evolutionary biology. PD is the minimum total length of the branches in a phylogeny required to cover a specified set of taxa on the phylogeny. A general goal in the application of PD has been identifying a set of taxa of size k that maximize PD on a given phylogeny; this has been mirrored in active research to develop efficient algorithms for the problem. Other descriptive statistics, such as the minimum PD, average PD, and standard deviation of PD, can provide invaluable insight into the distribution of PD across a phylogeny (relative to a fixed value of k). However, there has been limited or no research on computing these statistics, especially when required for each clade in a phylogeny, enabling direct comparisons of PD between clades. We introduce efficient algorithms for computing PD and the associated descriptive statistics for a given phylogeny and each of its clades. In simulation studies, we demonstrate the ability of our algorithms to analyze large-scale phylogenies with applications in ecology and evolutionary biology. The software is available at https://github.com/flu-crew/PD_stats.


Assuntos
Algoritmos , Evolução Biológica , Filogenia , Simulação por Computador , Software
4.
Syst Biol ; 72(5): 1052-1063, 2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37208300

RESUMO

The use of next-generation sequencing technology has enabled phylogenetic studies with hundreds of thousands of taxa. Such large-scale phylogenies have become a critical component in genomic epidemiology in pathogens such as SARS-CoV-2 and influenza A virus. However, detailed phenotypic characterization of pathogens or generating a computationally tractable dataset for detailed phylogenetic analyses requires objective subsampling of taxa. To address this need, we propose parnas, an objective and flexible algorithm to sample and select taxa that best represent observed diversity by solving a generalized k-medoids problem on a phylogenetic tree. parnas solves this problem efficiently and exactly by novel optimizations and adapting algorithms from operations research. For more nuanced selections, taxa can be weighted with metadata or genetic sequence parameters, and the pool of potential representatives can be user-constrained. Motivated by influenza A virus genomic surveillance and vaccine design, parnas can be applied to identify representative taxa that optimally cover the diversity in a phylogeny within a specified distance radius. We demonstrated that parnas is more efficient and flexible than existing approaches. To demonstrate its utility, we applied parnas to 1) quantify SARS-CoV-2 genetic diversity over time, 2) select representative influenza A virus in swine genes derived from over 5 years of genomic surveillance data, and 3) identify gaps in H3N2 human influenza A virus vaccine coverage. We suggest that our method, through the objective selection of representatives in a phylogeny, provides criteria for quantifying genetic diversity that has application in the the rational design of multivalent vaccines and genomic epidemiology. PARNAS is available at https://github.com/flu-crew/parnas.


Assuntos
Vírus da Influenza A Subtipo H3N2 , Vacinas , Animais , Humanos , Suínos , Filogenia , Vírus da Influenza A Subtipo H3N2/genética , Genômica
5.
Bioinformatics ; 38(8): 2144-2152, 2022 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-35150239

RESUMO

MOTIVATION: A phylogenetic network is a powerful model to represent entangled evolutionary histories with both divergent (speciation) and convergent (e.g. hybridization, reassortment, recombination) evolution. The standard approach to inference of hybridization networks is to (i) reconstruct rooted gene trees and (ii) leverage gene tree discordance for network inference. Recently, we introduced a method called RF-Net for accurate inference of virus reassortment and hybridization networks from input gene trees in the presence of errors commonly found in phylogenetic trees. While RF-Net demonstrated the ability to accurately infer networks with up to four reticulations from erroneous input gene trees, its application was limited by the number of reticulations it could handle in a reasonable amount of time. This limitation is particularly restrictive in the inference of the evolutionary history of segmented RNA viruses such as influenza A virus (IAV), where reassortment is one of the major mechanisms shaping the evolution of these pathogens. RESULTS: Here, we expand the functionality of RF-Net that makes it significantly more applicable in practice. Crucially, we introduce a fast extension to RF-Net, called Fast-RF-Net, that can handle large numbers of reticulations without sacrificing accuracy. In addition, we develop automatic stopping criteria to select the appropriate number of reticulations heuristically and implement a feature for RF-Net to output error-corrected input gene trees. We then conduct a comprehensive study of the original method and its novel extensions and confirm their efficacy in practice using extensive simulation and empirical IAV evolutionary analyses. AVAILABILITY AND IMPLEMENTATION: RF-Net 2 is available at https://github.com/flu-crew/rf-net-2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Vírus da Influenza A , Filogenia , Simulação por Computador , Vírus da Influenza A/genética , Evolução Molecular , Modelos Genéticos
6.
J Comput Biol ; 28(8): 758-773, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34125600

RESUMO

The duplication-loss-coalescence (DLC) parsimony model is invaluable for analyzing the complex scenarios of concurrent duplication loss and deep coalescence events in the evolution of gene families. However, inferring such scenarios for already moderately sized families is prohibitive owing to the computational complexity involved. To overcome this stringent limitation, we make the first step by describing a flexible integer linear programming (ILP) formulation for inferring DLC evolutionary scenarios. Then, to make the DLC model more scalable, we introduce four sensibly constrained versions of the model and describe modified versions of our ILP formulation reflecting these constraints. Our simulation studies showcase that our constrained ILP formulations compute evolutionary scenarios that are substantially larger than scenarios computable under our original ILP formulation and the original dynamic programming algorithm by Wu et al. Furthermore, scenarios computed under our constrained DLC models are remarkably accurate compared with corresponding scenarios under the original DLC model, which we also confirm in an empirical study with thousands of gene families.


Assuntos
Biologia Computacional/métodos , Família Multigênica , Algoritmos , Evolução Molecular , Duplicação Gênica , Modelos Genéticos , Filogenia , Programação Linear
7.
Bioinformatics ; 37(22): 4064-4074, 2021 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-34048529

RESUMO

MOTIVATION: The classic multispecies coalescent (MSC) model provides the means for theoretical justification of incomplete lineage sorting-aware species tree inference methods. This has motivated an extensive body of work on phylogenetic methods that are statistically consistent under MSC. One such particularly popular method is ASTRAL, a quartet-based species tree inference method. Novel studies suggest that ASTRAL also performs well when given multi-locus gene trees in simulation studies. Further, Legried et al. recently demonstrated that ASTRAL is statistically consistent under the gene duplication and loss model (GDL). GDL is prevalent in evolutionary histories and is the first core process in the powerful duplication-loss-coalescence evolutionary model (DLCoal) by Rasmussen and Kellis. RESULTS: In this work, we prove that ASTRAL is statistically consistent under the general DLCoal model. Therefore, our result supports the empirical evidence from the simulation-based studies. More broadly, we prove that the quartet-based inference approach is statistically consistent under DLCoal. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Duplicação Gênica , Filogenia , Simulação por Computador
9.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2125-2135, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-31150345

RESUMO

Tree reconciliation costs are a popular choice to account for the discordance between the evolutionary history of a gene family (i.e., a gene tree), and the species tree through which this family has evolved. This discordance is accounted for by the minimum number of postulated evolutionary events necessary for reconciling the two trees. Such events include gene duplication, loss, and deep coalescence, and are used to define different types of tree reconciliation costs. For example, the duplication-loss cost for a gene tree and species tree accounts for the minimum number of gene duplications and losses necessary to reconcile these trees. Fundamental to the understanding of how gene trees and species trees relate to each other are the diameters of tree reconciliation costs. While such diameters have been well-researched, still absent from these studies are the unconstrained diameters for two of the classic tree reconciliation costs, namely the duplication-loss cost and the loss cost. Here, we show the essential mathematical properties of these diameters and provide efficient solutions for computing them. Finally, we analyze the distributions of these diameters using simulated datasets.


Assuntos
Biologia Computacional/métodos , Duplicação Gênica/genética , Modelos Genéticos , Evolução Molecular , Filogenia
10.
Bioinformatics ; 36(Suppl_2): i668-i674, 2020 12 30.
Artigo em Inglês | MEDLINE | ID: mdl-33381825

RESUMO

MOTIVATION: The evolution of complexity is one of the most fascinating and challenging problems in modern biology, and tracing the evolution of complex traits is an open problem. In bacteria, operons and gene blocks provide a model of tractable evolutionary complexity at the genomic level. Gene blocks are structures of co-located genes with related functions, and operons are gene blocks whose genes are co-transcribed on a single mRNA molecule. The genes in operons and gene blocks typically work together in the same system or molecular complex. Previously, we proposed a method that explains the evolution of orthologous gene blocks (orthoblocks) as a combination of a small set of events that take place in vertical evolution from common ancestors. A heuristic method was proposed to solve this problem. However, no study was done to identify the complexity of the problem. RESULTS: Here, we establish that finding the homologous gene block problem is NP-hard and APX-hard. We have developed a greedy algorithm that runs in polynomial time and guarantees an O(ln⁡n) approximation. In addition, we formalize our problem as an integer linear program problem and solve it using the PuLP package and the standard CPLEX algorithm. Our exploration of several candidate operons reveals that our new method provides more optimal results than the results from the heuristic approach, and is significantly faster. AVAILABILITY AND IMPLEMENTATION: The software and data accompanying this paper are available under the GPLv3 and CC0 license respectively on: https://github.com/nguyenngochuy91/Relevant-Operon.


Assuntos
Genômica , Software , Algoritmos , Bactérias , Biologia Computacional , Dureza
11.
BMC Evol Biol ; 20(Suppl 1): 136, 2020 10 28.
Artigo em Inglês | MEDLINE | ID: mdl-33115401

RESUMO

BACKGROUND: Solving median tree problems under tree reconciliation costs is a classic and well-studied approach for inferring species trees from collections of discordant gene trees. These problems are NP-hard, and therefore are, in practice, typically addressed by local search heuristics. So far, however, such heuristics lack any provable correctness or precision. Further, even for small phylogenetic studies, it has been demonstrated that local search heuristics may only provide sub-optimal solutions. Obviating such heuristic uncertainties are exact dynamic programming solutions that allow solving tree reconciliation problems for smaller phylogenetic studies. Despite these promises, such exact solutions are only suitable for credibly rooted input gene trees, which constitute only a tiny fraction of the readily available gene trees. Standard gene tree inference approaches provide only unrooted gene trees and accurately rooting such trees is often difficult, if not impossible. RESULTS: Here, we describe complex dynamic programming solutions that represent the first nonnaïve exact solutions for solving the tree reconciliation problems for unrooted input gene trees. Further, we show that the asymptotic runtime of the proposed solutions does not increase when compared to the most time-efficient dynamic programming solutions for rooted input trees. CONCLUSIONS: In an experimental evaluation, we demonstrate that the described solutions for unrooted gene trees are, like the solutions for rooted input gene trees, suitable for smaller phylogenetic studies. Finally, for the first time, we study the accuracy of classic local search heuristics for unrooted tree reconciliation problems.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Algoritmos , Evolução Molecular , Incerteza
12.
Bioinformatics ; 35(17): 2998-3004, 2019 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30689726

RESUMO

MOTIVATION: Complexity is a fundamental attribute of life. Complex systems are made of parts that together perform functions that a single component, or subsets of components, cannot. Examples of complex molecular systems include protein structures such as the F1Fo-ATPase, the ribosome, or the flagellar motor: each one of these structures requires most or all of its components to function properly. Given the ubiquity of complex systems in the biosphere, understanding the evolution of complexity is central to biology. At the molecular level, operons are classic examples of a complex system. An operon's genes are co-transcribed under the control of a single promoter to a polycistronic mRNA molecule, and the operon's gene products often form molecular complexes or metabolic pathways. With the large number of complete bacterial genomes available, we now have the opportunity to explore the evolution of these complex entities, by identifying possible intermediate states of operons. RESULTS: In this work, we developed a maximum parsimony algorithm to reconstruct ancestral operon states, and show a simple vertical evolution model of how operons may evolve from the individual component genes. We describe several ancestral states that are plausible functional intermediate forms leading to the full operon. We also offer Reconstruction of Ancestral Gene blocks Using Events or ROAGUE as a software tool for those interested in exploring gene block and operon evolution. AVAILABILITY AND IMPLEMENTATION: The software accompanying this paper is available under GPLv3 license on: https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Evolução Molecular , Genoma Bacteriano , Óperon , Bactérias , Software
13.
IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1374-1385, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-29035224

RESUMO

Synthesizing large-scale phylogenetic trees is a fundamental problem in evolutionary biology. Median tree problems have evolved as a powerful tool to reconstruct such trees. Such problems seek a median tree for a given collection of input trees under some problem-specific tree distance. There has been an increased interest in the median tree problem for the classical path-difference distance between trees. While this problem is NP-hard, standard local search heuristics have been described that are based on solving a local search problem exactly. For a more effective heuristic we devise a time efficient algorithm for the local search problem that improves on the best-known solution by a factor of $n$n, where $n$n is the size of the input trees. Furthermore, we introduce a novel hybrid version of the standard local search that is exploiting our new algorithm for a more refined heuristic search. Finally, we demonstrate the performance of our hybrid heuristic in a comparative study with other commonly used methods that synthesize species trees using published empirical data sets.


Assuntos
Biologia Computacional/métodos , Filogenia , Algoritmos , Animais , Simulação por Computador , Evolução Molecular , Internet , Marsupiais , Modelos Genéticos , Software , Especificidade da Espécie
14.
IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1063-1076, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-28650824

RESUMO

Median tree problems are powerful tools for inferring large-scale phylogenetic trees that hold enormous promise for society at large. Such problems seek a median tree for a given collection of input trees under some problem-specific distance. Here, we introduce a median tree problem under the classic Manhattan path-difference distance. We show that this problem is NP-hard, devise an ILP formulation, and provide an effective local search heuristic that is based on solving a local search problem exactly. Our algorithm for the local search problem improves asymptotically by a factor of $n$n on the best-known (naïve) solution, where $n$n is the overall number of taxa in the input trees. Finally, comparative phylogenetic studies using considerably large empirical data and an accuracy analysis for smaller phylogenetic trees reveal the ability of our novel heuristic.


Assuntos
Biologia Computacional/métodos , Filogenia , Software , Algoritmos , Animais , Classificação , Análise por Conglomerados , Evolução Molecular , Marsupiais , Modelos Genéticos , Linguagens de Programação
15.
IEEE/ACM Trans Comput Biol Bioinform ; 16(5): 1459-1470, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30222583

RESUMO

Median tree inference under path-difference metrics has shown great promise for large-scale phylogeny estimation. Similar to these metrics is the family of cophenetic metrics that originates from a classic dendrogram comparison method introduced more than 50 years ago. Despite the appeal of this family of metrics, the problem of computing median trees under cophenetic metrics has not been analyzed. Like other standard median tree problems relevant in practice, as we show here, this problem is also NP-hard. NP-hard median tree problems have been successfully addressed by local search heuristics that are solving thousands of instances of a corresponding (local neighborhood) search problem. For the local neighborhood search problem under a cophenetic metric, the best known (naïve) algorithm has a time complexity that is typically prohibitive for effective heuristic searches. Building on the pioneering work on path-difference median trees, we develop efficient algorithms for Manhattan and Euclidean cophenetic search problems that improve on the naïve solution by a linear and a quadratic factor, respectively. We demonstrate the performance and effectiveness of the resulting heuristic methods in a comparative study using benchmark empirical datasets.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Algoritmos , Bases de Dados Genéticas , Técnicas Genéticas , Heurística
18.
IEEE/ACM Trans Comput Biol Bioinform ; 15(5): 1723-1727, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-28792904

RESUMO

Synthesizing median trees from a collection of gene trees under the biologically motivated gene tree parsimony (GTP) costs has provided credible species tree estimates. GTP costs are defined for each of the classic evolutionary processes. These costs count the minimum number of events necessary to reconcile the gene tree with the species tree where the leaf-genes are mapped to the leaf-species through a function called labeling. To better understand the synthesis of median trees under these costs, there is an increased interest in analyzing their diameters. The diameters of a GTP cost between a gene tree and a species tree are the maximum values of this cost of one or both topologies of the trees involved. We are concerned about the diameters of the GTP costs under bijective labelings. While these diameters are linear time computable for the gene duplication and deep coalescence costs, this has been unknown for the classic gene duplication and loss, and for the loss cost. For the first time, we show how to compute these diameters and proof that this can be achieved in linear time, and thus, completing the computational time analysis for all of the bijective diameters under the GTP costs.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Modelos Genéticos , Filogenia , Duplicação Gênica
20.
J Bioinform Comput Biol ; 15(3): 1740002, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28513253

RESUMO

Supertree problems are a standard tool for synthesizing large-scale species trees from a given collection of gene trees under some problem-specific objective. Unfortunately, these problems are typically NP-hard, and often remain so when their instances are restricted to rooted gene trees sampled from the same species. While a class of restricted supertree problems has been effectively addressed by the parameterized strict consensus approach, in practice, most gene trees are unrooted and sampled from different species. Here, we overcome this stringent limitation by describing efficient algorithms that are adopting the strict consensus approach to also handle unrestricted supertree problems. Finally, we demonstrate the performance of our algorithms in a comparative study with classic supertree heuristics using simulated and empirical data sets.


Assuntos
Algoritmos , Filogenia , Biologia Computacional/métodos , Simulação por Computador , Duplicação Gênica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...