Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
J Comput Biol ; 31(4): 312-327, 2024 04.
Artigo em Inglês | MEDLINE | ID: mdl-38634854

RESUMO

Phylogenetic inference and reconstruction methods generate hypotheses on evolutionary history. Competing inference methods are frequently used, and the evaluation of the generated hypotheses is achieved using tree comparison costs. The Robinson-Foulds (RF) distance is a widely used cost to compare the topology of two trees, but this cost is sensitive to tree error and can overestimate tree differences. To overcome this limitation, a refined version of the RF distance called the Cluster Affinity (CA) distance was introduced. However, CA distances are symmetric and cannot compare different types of trees. These asymmetric comparisons occur when gene trees are compared with species trees, when disparate datasets are integrated into a supertree, or when tree comparison measures are used to infer a phylogenetic network. In this study, we introduce a relaxation of the original Affinity distance to compare heterogeneous trees called the asymmetric CA cost. We also develop a biologically interpretable cost, the Cluster Support cost that normalizes by cluster size across gene trees. The characteristics of these costs are similar to the symmetric CA cost. We describe efficient algorithms, derive the exact diameters, and use these to standardize the cost to be applicable in practice. These costs provide objective, fine-scale, and biologically interpretable values that can assess differences and similarities between phylogenetic trees.


Assuntos
Algoritmos , Filogenia , Análise por Conglomerados , Modelos Genéticos , Biologia Computacional/métodos , Evolução Molecular
2.
Algorithms Mol Biol ; 19(1): 7, 2024 Feb 14.
Artigo em Inglês | MEDLINE | ID: mdl-38355611

RESUMO

We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of partially leaf-labeled gene trees labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. In addition, we design a method to infer distributions of gene-species mappings. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events.

3.
Algorithms Mol Biol ; 17(1): 11, 2022 May 19.
Artigo em Inglês | MEDLINE | ID: mdl-35590416

RESUMO

BACKGROUND: Phylogenetic networks are mathematical models of evolutionary processes involving reticulate events such as hybridization, recombination, or horizontal gene transfer. One of the crucial notions in phylogenetic network modelling is displayed tree, which is obtained from a network by removing a set of reticulation edges. Displayed trees may represent an evolutionary history of a gene family if the evolution is shaped by reticulation events. RESULTS: We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence and duplication costs. We propose an O(mn)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where m and n are the sizes of G and N, respectively. In addition, our algorithm can verify whether the solution is exact. Moreover, it provides a set of reticulation edges corresponding to the obtained cost. If the cost is exact, the set induces an optimal displayed tree. Otherwise, the set contains pairs of conflicting edges, i.e., edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires [Formula: see text] invocations of DP in the worst case, where r is the number of reticulations. We propose a similar [Formula: see text]-time algorithm for level-k tree-child networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also extend the algorithms to a broader class of phylogenetic networks. Based on simulated data, the average runtime is [Formula: see text] under the deep-coalescence cost and [Formula: see text] under the duplication cost. CONCLUSIONS: Despite exponential complexity in the worst case, our algorithms perform significantly well on empirical and simulated datasets, due to the strategy of resolving internal dissimilarities between gene trees and networks. Therefore, the algorithms are efficient alternatives to enumeration strategies commonly proposed in the literature and enable analyses of complex networks with dozens of reticulations.

4.
J Comput Biol ; 28(8): 758-773, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34125600

RESUMO

The duplication-loss-coalescence (DLC) parsimony model is invaluable for analyzing the complex scenarios of concurrent duplication loss and deep coalescence events in the evolution of gene families. However, inferring such scenarios for already moderately sized families is prohibitive owing to the computational complexity involved. To overcome this stringent limitation, we make the first step by describing a flexible integer linear programming (ILP) formulation for inferring DLC evolutionary scenarios. Then, to make the DLC model more scalable, we introduce four sensibly constrained versions of the model and describe modified versions of our ILP formulation reflecting these constraints. Our simulation studies showcase that our constrained ILP formulations compute evolutionary scenarios that are substantially larger than scenarios computable under our original ILP formulation and the original dynamic programming algorithm by Wu et al. Furthermore, scenarios computed under our constrained DLC models are remarkably accurate compared with corresponding scenarios under the original DLC model, which we also confirm in an empirical study with thousands of gene families.


Assuntos
Biologia Computacional/métodos , Família Multigênica , Algoritmos , Evolução Molecular , Duplicação Gênica , Modelos Genéticos , Filogenia , Programação Linear
5.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2125-2135, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-31150345

RESUMO

Tree reconciliation costs are a popular choice to account for the discordance between the evolutionary history of a gene family (i.e., a gene tree), and the species tree through which this family has evolved. This discordance is accounted for by the minimum number of postulated evolutionary events necessary for reconciling the two trees. Such events include gene duplication, loss, and deep coalescence, and are used to define different types of tree reconciliation costs. For example, the duplication-loss cost for a gene tree and species tree accounts for the minimum number of gene duplications and losses necessary to reconcile these trees. Fundamental to the understanding of how gene trees and species trees relate to each other are the diameters of tree reconciliation costs. While such diameters have been well-researched, still absent from these studies are the unconstrained diameters for two of the classic tree reconciliation costs, namely the duplication-loss cost and the loss cost. Here, we show the essential mathematical properties of these diameters and provide efficient solutions for computing them. Finally, we analyze the distributions of these diameters using simulated datasets.


Assuntos
Biologia Computacional/métodos , Duplicação Gênica/genética , Modelos Genéticos , Evolução Molecular , Filogenia
6.
Comput Biol Chem ; 89: 107260, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-33038778

RESUMO

BACKGROUND: The genomic duplication study is fundamental to understand the process of evolution. In evolutionary molecular biology, many approaches focus on discovering the occurrences of gene duplications and multiple gene duplication episodes and their locations in the Tree of Life. To reconstruct such episodes, one can cluster single gene duplications inferred by reconciling a set of gene trees with a species tree. RESULTS: We propose an efficient quadratic time algorithm to solve the problem of genomic duplication clustering, in which input gene trees are rooted, episode locations are restricted to preserve the minimal number of single gene duplications, clustering rules are described by minimum episodes method, and the goal is based on the recently introduced new approach to minimize the maximal number of duplication episodes on a single path, called here the MP score. Based on our theoretical results, we show new algorithmic relationships between the MP score and the minimum episodes (ME) score, defined as the minimal number of duplication episodes. CONCLUSIONS: Our evaluation analysis on three empirical datasets demonstrates, that under the model in which the minimal number of duplications is preserved, the duplication clusterings with minimal MP score support the clusterings with the minimal total number of duplication episodes. AVAILABILITY: The software is available at https://bitbucket.org/pgor17/rmp.


Assuntos
Algoritmos , Duplicação Gênica , Modelos Genéticos , Bases de Dados Genéticas/estatística & dados numéricos , Evolução Molecular
7.
BMC Evol Biol ; 20(Suppl 1): 136, 2020 10 28.
Artigo em Inglês | MEDLINE | ID: mdl-33115401

RESUMO

BACKGROUND: Solving median tree problems under tree reconciliation costs is a classic and well-studied approach for inferring species trees from collections of discordant gene trees. These problems are NP-hard, and therefore are, in practice, typically addressed by local search heuristics. So far, however, such heuristics lack any provable correctness or precision. Further, even for small phylogenetic studies, it has been demonstrated that local search heuristics may only provide sub-optimal solutions. Obviating such heuristic uncertainties are exact dynamic programming solutions that allow solving tree reconciliation problems for smaller phylogenetic studies. Despite these promises, such exact solutions are only suitable for credibly rooted input gene trees, which constitute only a tiny fraction of the readily available gene trees. Standard gene tree inference approaches provide only unrooted gene trees and accurately rooting such trees is often difficult, if not impossible. RESULTS: Here, we describe complex dynamic programming solutions that represent the first nonnaïve exact solutions for solving the tree reconciliation problems for unrooted input gene trees. Further, we show that the asymptotic runtime of the proposed solutions does not increase when compared to the most time-efficient dynamic programming solutions for rooted input trees. CONCLUSIONS: In an experimental evaluation, we demonstrate that the described solutions for unrooted gene trees are, like the solutions for rooted input gene trees, suitable for smaller phylogenetic studies. Finally, for the first time, we study the accuracy of classic local search heuristics for unrooted tree reconciliation problems.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Algoritmos , Evolução Molecular , Incerteza
8.
J Bioinform Comput Biol ; 16(5): 1840021, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-30419782

RESUMO

Metagenomic studies identify the species present in an environmental sample usually by using procedures that match molecular sequences, e.g. genes, with the species taxonomy. Here, we first formulate the problem of gene-species matching in the parsimony framework using binary phylogenetic gene and species trees under the deep coalescence cost and the assumption that each gene is paired uniquely with one species. In particular, we solve the problem in the cases when one of the trees is a caterpillar. Next, we propose a dynamic programming algorithm, which solves the problem exactly, however, its time and space complexity is exponential. Next, we generalize the problem to include non-binary trees and show the solution for caterpillar trees. We then propose time and space-efficient heuristic algorithms for solving the gene-species matching problem for any input trees. Finally, we present the results of computational experiments on simulated and empirical datasets consisting of binary tree pairs.


Assuntos
Algoritmos , Metagenômica/métodos , Filogenia , Animais , Biologia Computacional/métodos , Bases de Dados Genéticas , Modelos Genéticos
9.
Artigo em Inglês | MEDLINE | ID: mdl-29990287

RESUMO

Based on the classical non-parametric bootstrapping for phylogenetic trees, we propose a novel bootstrap method to define support for gene duplication and speciation events. By comparing bootstrap gene trees to the original gene tree, we calculate support for evolutionary events. While this approach can be used to annotate orthology and paralogy, we show how it can be used to verify the reliability of tree reconciliation. We propose a linear time algorithm for the computation of bootstrap values, and we show the correspondence of our method with the classical non-parametric bootstrapping. Finally, we present two computational experiments. In the first one, based on simulated data and nine yeast genomes, we show a comparative study of several tree rooting methods and evaluation of their performance by using our bootstrapping method. In the second experiment, using data from the TreeFam database, we tested how the reliability of the gene trees influence the inferred supertree. We found out that species trees inferred from gene trees having highly supported events are more biologically consistent.

10.
Algorithms Mol Biol ; 13: 11, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29881445

RESUMO

BACKGROUND: Horizontal gene transfer (HGT), a process of acquisition and fixation of foreign genetic material, is an important biological phenomenon. Several approaches to HGT inference have been proposed. However, most of them either rely on approximate, non-phylogenetic methods or on the tree reconciliation, which is computationally intensive and sensitive to parameter values. RESULTS: We investigate the locus tree inference problem as a possible alternative that combines the advantages of both approaches. We present several algorithms to solve the problem in the parsimony framework. We introduce a novel tree mapping, which allows us to obtain a heuristic solution to the problems of locus tree inference and duplication classification. CONCLUSIONS: Our approach allows for faster comparisons of gene and species trees and improves known algorithms for duplication inference in the presence of polytomies in the species trees. We have implemented our algorithms in a software tool available at https://github.com/mciach/LocusTreeInference.

11.
BMC Genomics ; 19(Suppl 5): 288, 2018 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-29745844

RESUMO

BACKGROUND: One of evolutionary molecular biology fundamental issues is to discover genomic duplication events and their correspondence to the species tree. Such events can be reconstructed by clustering single gene duplications inferred by reconciling a set of gene trees with a species tree. RESULTS: Here we propose the first solutions to the genomic duplication problem in which every reconciliation with the minimal number of single gene duplications is allowed and the method of clustering called minimum episodes under the assumption that input gene trees are unrooted. CONCLUSIONS: We showed new theoretical properties of unrooted reconciliation for the duplication cost and apply them to design several exact and heuristic algorithms for solving the problem. Our evaluation study on empirical dataset confirmed several genomic duplication events from the literature and demonstrate that algorithms can be successfully applied.


Assuntos
Algoritmos , Duplicação Gênica , Genoma , Modelos Genéticos , Filogenia , Animais , Biologia Computacional , Evolução Molecular , Genômica
12.
IEEE/ACM Trans Comput Biol Bioinform ; 15(5): 1723-1727, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-28792904

RESUMO

Synthesizing median trees from a collection of gene trees under the biologically motivated gene tree parsimony (GTP) costs has provided credible species tree estimates. GTP costs are defined for each of the classic evolutionary processes. These costs count the minimum number of events necessary to reconcile the gene tree with the species tree where the leaf-genes are mapped to the leaf-species through a function called labeling. To better understand the synthesis of median trees under these costs, there is an increased interest in analyzing their diameters. The diameters of a GTP cost between a gene tree and a species tree are the maximum values of this cost of one or both topologies of the trees involved. We are concerned about the diameters of the GTP costs under bijective labelings. While these diameters are linear time computable for the gene duplication and deep coalescence costs, this has been unknown for the classic gene duplication and loss, and for the loss cost. For the first time, we show how to compute these diameters and proof that this can be achieved in linear time, and thus, completing the computational time analysis for all of the bijective diameters under the GTP costs.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Modelos Genéticos , Filogenia , Duplicação Gênica
13.
IEEE/ACM Trans Comput Biol Bioinform ; 15(5): 1571-1578, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-28541905

RESUMO

BACKGROUND: Microbial communities from environmental samples show great diversity as bacteria quickly responds to changes in their ecosystems. To assess the scenario of the actual changes, metagenomics experiments aimed at sequencing genomic DNA from such samples are performed. These new obtained sequences together with already known are used to infer phylogenetic trees assessing the taxonomic groups the species with these genes belong to. Here, we propose the first approach to the gene-species assignment problem by using reconciliation with horizontal gene transfer. RESULTS: We propose efficient algorithms that search for optimal gene-species mappings taking into account gene duplication, loss and transfer events under two tractable models of HGT reconciliation. CONCLUSIONS: We calculate both the optimal cost and all possible optimal scenarios. Furthermore as the number of optimal reconstructions can be large, we use a Monte-Carlo method for the inference of approximate distributions of gene-species assignments. We demonstrate the applicability on empirical and simulated datasets.


Assuntos
Transferência Genética Horizontal/genética , Metagenômica/métodos , Modelos Genéticos , Filogenia , Genes Bacterianos/genética , Methanobrevibacter/genética
14.
IEEE/ACM Trans Comput Biol Bioinform ; 15(5): 1515-1524, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-28541223

RESUMO

An important issue in evolutionary molecular biology is to discover genomic duplication episodes and their correspondence to the species tree. Existing approaches vary in the two fundamental aspects: the choice of evolutionary scenarios that model allowed locations of duplications in the species tree, and the rules of clustering gene duplications from gene trees into a single multiple duplication event. Here we study the method of clustering called minimum episodes for several models of allowed evolutionary scenarios with a focus on interval models in which every gene duplication has an interval consisting of allowed locations in the species tree. We present mathematical foundations for general genomic duplication problems. Next, we propose the first linear time and space algorithm for minimum episodes clustering jointly for any interval model and the algorithm for the most general model in which every evolutionary scenario is allowed. We also present a comparative study of different models of genomic duplication based on simulated and empirical datasets. We provided algorithms and tools that could be applied to solve efficiently minimum episodes clustering problems. Our comparative study helps to identify which model is the most reasonable choice in inferring genomic duplication events.


Assuntos
Algoritmos , Biologia Computacional/métodos , Duplicação Gênica/genética , Genoma/genética , Modelos Genéticos , Análise por Conglomerados , Filogenia
15.
IEEE/ACM Trans Comput Biol Bioinform ; 14(5): 1002-1012, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-26887001

RESUMO

The minimizing-deep-coalescence (MDC) approach infers a median (species) tree for a given set of gene trees under the deep coalescence cost. This cost accounts for the minimum number of deep coalescences needed to reconcile a gene tree with a species tree where the leaf-genes are mapped to the leaf-species through a function called leaf labeling. In order to better understand the MDC approach we investigate here the diameter of a gene tree, which is an important property of the deep coalescence cost. This diameter is the maximal deep coalescence costs for a given gene tree under all leaf labelings for each possible species tree topology. While we prove that this diameter is generally infinite, this result relies on the diameter's unrealistic assumption that species trees can be of infinite size. Providing a more practical definition, we introduce a natural extension of the gene tree diameter that constrains the species tree size by a given constant. For this new diameter, we describe an exact formula, present a complete classification of the trees yielding this diameter, derive formulas for its mean and variance, and demonstrate its ability using comparative studies.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Genes/genética , Especiação Genética , Modelos Genéticos , Algoritmos , Filogenia
16.
BMC Genomics ; 17 Suppl 1: 15, 2016 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-26818591

RESUMO

BACKGROUND: Discovering the location of gene duplications and multiple gene duplication episodes is a fundamental issue in evolutionary molecular biology. The problem introduced by Guigó et al. in 1996 is to map gene duplication events from a collection of rooted, binary gene family trees onto theirs corresponding rooted binary species tree in such a way that the total number of multiple gene duplication episodes is minimized. There are several models in the literature that specify how gene duplications from gene families can be interpreted as one duplication episode. However, in all duplication episode problems gene trees are rooted. This restriction limits the applicability, since unrooted gene family trees are frequently inferred by phylogenetic methods. RESULTS: In this article we show the first solution to the open problem of episode clustering where the input gene family trees are unrooted. In particular, by using theoretical properties of unrooted reconciliation, we show an efficient algorithm that reduces this problem into the episode clustering problems defined for rooted trees. We show theoretical properties of the reduction algorithm and evaluation of empirical datasets. CONCLUSIONS: We provided algorithms and tools that were successfully applied to several empirical datasets. In particular, our comparative study shows that we can improve known results on genomic duplication inference from real datasets.


Assuntos
Duplicação Gênica/genética , Modelos Genéticos , Algoritmos , Análise por Conglomerados , Bases de Dados Genéticas
17.
Artigo em Inglês | MEDLINE | ID: mdl-26357086

RESUMO

The deep coalescence cost accounts for discord caused by deep coalescence between a gene tree and a species tree. It is a major concern that the diameter of a gene tree (the tree's maximum deep coalescence cost across all species trees) depends on its topology, which can largely obfuscate phylogenetic studies. While this bias can be compensated by normalizing the deep coalescence cost using diameters, obtaining them efficiently has been posed as an open problem by Than and Rosenberg. Here, we resolve this problem by describing a linear time algorithm to compute the diameter of a gene tree. In addition, we provide a complete classification of the species trees yielding this diameter to guide phylogenetic analyses.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Evolução Molecular
18.
BMC Bioinformatics ; 15 Suppl 13: S3, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25434729

RESUMO

BACKGROUND: Evolutionary studies are complicated by discordance between gene trees and the species tree in which they evolved. Dealing with discordant trees often relies on comparison costs between gene and species trees, including the well-established Robinson-Foulds, gene duplication, and deep coalescence costs. While these costs have provided credible results for binary rooted gene trees, corresponding cost definitions for non-binary unrooted gene trees, which are frequently occurring in practice, are challenged by biological realism. RESULT: We propose a natural extension of the well-established costs for comparing unrooted and non-binary gene trees with rooted binary species trees using a binary refinement model. For the duplication cost we describe an efficient algorithm that is based on a linear time reduction and also computes an optimal rooted binary refinement of the given gene tree. Finally, we show that similar reductions lead to solutions for computing the deep coalescence and the Robinson-Foulds costs. CONCLUSION: Our binary refinement of Robinson-Foulds, gene duplication, and deep coalescence costs for unrooted and non-binary gene trees together with the linear time reductions provided here for computing these costs significantly extends the range of trees that can be incorporated into approaches dealing with discordance.


Assuntos
Algoritmos , Evolução Biológica , Duplicação Gênica , Modelos Genéticos , Filogenia
19.
J Comput Biol ; 21(1): 89-98, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24073895

RESUMO

DrML is a software program for inferring evolutionary scenarios from a gene tree and a species tree with speciation time estimates that is based on a general maximum likelihood model. The program implements novel algorithms that efficiently infer most likely scenarios of gene duplication and loss events. Our comparative studies suggest that the general maximum likelihood model provides more credible estimates than standard parsimony reconciliation, especially when speciation times differ significantly. DrML is an open source project written in Python, and along with an on-line manual and sample data sets publicly available.


Assuntos
Duplicação Gênica , Modelos Genéticos , Software , Animais , Biologia Computacional , Evolução Molecular , Humanos , Funções Verossimilhança , Modelos Estatísticos , Filogenia
20.
Artigo em Inglês | MEDLINE | ID: mdl-26355521

RESUMO

The minimizing deep coalescence (MDC) problem seeks a species tree that reconciles the given gene trees with the minimum number of deep coalescence events, called deep coalescence (DC) cost. To better assess MDC species trees we investigate into a basic mathematical property of the DC cost, called the diameter. Given a gene tree, a species tree, and a leaf labeling function that assigns leaf-genes of the gene tree to a leaf-species in the species tree from which they were sampled, the DC cost describes the discordance between the trees caused by deep coalescence events. The diameter of a gene tree and a species tree is the maximum DC cost across all leaf labelings for these trees. We prove fundamental mathematical properties describing precisely these diameters for bijective and general leaf labelings, and present efficient algorithms to compute the diameters and their corresponding leaf labelings. In particular, we describe an optimal, i.e., linear time, algorithm for the bijective case. Finally, in an experimental study we demonstrate that the average diameters between a gene tree and a species tree grow significantly slower than their naive upper bounds, suggesting that our exact bounds can significantly improve on assessing DC costs when using diameters.


Assuntos
Biologia Computacional/métodos , Evolução Molecular , Genes/genética , Modelos Genéticos , Análise de Sequência de DNA/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...