Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36637196

RESUMO

MOTIVATION: The phylogenetic signal of structural variation informs a more comprehensive understanding of evolution. As (near-)complete genome assembly becomes more commonplace, the next methodological challenge for inferring genome rearrangement trees is the identification of syntenic blocks of orthologous sequences. In this article, we studied 94 reference quality genomes of primarily Mycobacterium tuberculosis (Mtb) isolates as a benchmark to evaluate these methods. The clonal nature of Mtb evolution, the manageable genome sizes, along with substantial levels of structural variation make this an ideal benchmarking dataset. RESULTS: We tested several methods for detecting homology and obtaining syntenic blocks and two methods for inferring phylogenies from them, then compared the resulting trees to the standard method's tree, inferred from nucleotide substitutions. We found that, not only the choice of methods, but also their parameters can impact results, and that the tree inference method had less impact than the block determination method. Interestingly, a rearrangement tree based on blocks from the Cactus whole-genome aligner was fully compatible with the highly supported branches of the substitution-based tree, enabling the combination of the two into a high-resolution supertree. Overall, our results indicate that accurate trees can be inferred using genome rearrangements, but the choice of the methods for inferring homology requires care. AVAILABILITY AND IMPLEMENTATION: Analysis scripts and code written for this study are available at https://gitlab.com/LPCDRP/rearrangement-homology.pub and https://gitlab.com/LPCDRP/syntement. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Mycobacterium tuberculosis , Filogenia , Mycobacterium tuberculosis/genética , Genoma , Sintenia
2.
Artigo em Inglês | MEDLINE | ID: mdl-31217125

RESUMO

The maximum agreement subtree method determines the consensus of a collection of phylogenetic trees by identifying maximum cardinality subsets of leaves for which all input trees agree. The trees induced by these maximum cardinality subsets are maximum agreement subtrees (MASTs). A single MAST may be misleading, since there can exist two MASTs which share almost no leaves; nevertheless, it may be impossible to inspect all MASTs, since the number of MASTs can be exponential in the number of leaves. To overcome this drawback, Swenson et al. suggested to further summarize the information common to all MASTs by their intersection, which is called the kernel agreement subtree (KAST). The construction of the KAST is the focus of this paper. Swenson et al. had an O(kn3 + n4 + nd + 1) time algorithm for computing the KAST of k trees on n leaves, in which at least one tree has maximum degree d. In this paper, an O(kn3 + nd)-time algorithm is presented. We demonstrate the efficiency of our algorithm on simulated trees as well as on ribosomal RNA alignments, where trees with 13,000 taxa took only hours to process, whereas the previous algorithm did not terminate after a week of computation.


Assuntos
Algoritmos , Biologia Computacional/métodos , Filogenia , Alinhamento de Sequência/métodos , Sequência Consenso , Genes de RNAr/genética , Análise de Sequência de RNA/métodos
3.
PLoS One ; 15(2): e0228676, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32040487

RESUMO

Production of the Panton-Valentine leukocidin (PVL) by Staphylococcus aureus is mediated via the genes lukS-PV and lukF-PV which are carried on bacteriophage ϕSa2. PVL is associated with S. aureus strains that cause serious infections and clones of community-associated methicillin-resistant S. aureus (CA-MRSA) that have additionally disseminated widely. In Western Australia (WA) the original CA-MRSA were PVL negative however, between 2005 and 2008, following the introduction of eight international PVL-positive CA-MRSA, PVL-positive WA CA-MRSA were found. There was concern that PVL bacteriophages from the international clones were transferring into the local clones, therefore a comparative study of PVL-carrying ϕSa2 prophage genomes from historic WA PVL-positive S. aureus and representatives of all PVL-positive CA-MRSA isolated in WA between 2005 and 2008 was performed. The prophages were classified into two genera and three PVL bacteriophage groups and had undergone many recombination events during their evolution. Comparative analysis of mosaic regions of selected bacteriophages using the Alignments of bacteriophage genomes (Alpha) aligner revealed novel recombinations and modules. There was heterogeneity in the chromosomal integration sites, the lysogeny regulation regions, the defence and DNA processing modules, the structural and packaging modules and the lukSF-PV genes. One WA CA-MRSA (WA518751) and one international clone (Korean Clone) have probably acquired PVL-carrying ϕSa2 in WA, however these clones did not disseminate in the community. Genetic heterogeneity made it impossible to trace the source of the PVL prophages in the other WA clones. Against this background of PVL prophage diversity, the sequence of one group, the ϕSa2USA/ϕSa2wa-st93 group, was remarkably stable over at least 20 years and associated with the highly virulent USA300 and ST93-IVa CA-MRSA lineages that have disseminated globally.


Assuntos
Toxinas Bacterianas/genética , Bacteriófagos/genética , Exotoxinas/genética , Leucocidinas/genética , Staphylococcus aureus Resistente à Meticilina/virologia , Linhagem da Célula , DNA Bacteriano/genética , Genótipo , Geografia , Lisogenia , Staphylococcus aureus Resistente à Meticilina/genética , Epidemiologia Molecular , Tipagem de Sequências Multilocus , Fases de Leitura Aberta , Prófagos/genética , Fatores de Virulência/genética , Austrália Ocidental
4.
Bioinformatics ; 35(14): i117-i126, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510664

RESUMO

MOTIVATION: Genome rearrangements drastically change gene order along great stretches of a chromosome. There has been initial evidence that these apparently non-local events in the 1D sense may have breakpoints that are close in the 3D sense. We harness the power of the Double Cut and Join model of genome rearrangement, along with Hi-C chromosome conformation capture data to test this hypothesis between human and mouse. RESULTS: We devise novel statistical tests that show that indeed, rearrangement scenarios that transform the human into the mouse gene order are enriched for pairs of breakpoints that have frequent chromosome interactions. This is observed for both intra-chromosomal breakpoint pairs, as well as for inter-chromosomal pairs. For intra-chromosomal rearrangements, the enrichment exists from close (<20 Mb) to very distant (100 Mb) pairs. Further, the pattern exists across multiple cell lines in Hi-C data produced by different laboratories and at different stages of the cell cycle. We show that similarities in the contact frequencies between these many experiments contribute to the enrichment. We conclude that either (i) rearrangements usually involve breakpoints that are spatially close or (ii) there is selection against rearrangements that act on spatially distant breakpoints. AVAILABILITY AND IMPLEMENTATION: Our pipeline is freely available at https://bitbucket.org/thekswenson/locality. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Cromatina , Genoma , Software , Animais , Ciclo Celular , Pontos de Quebra do Cromossomo , Cromossomos , Humanos , Mamíferos , Camundongos
5.
Algorithms Mol Biol ; 14: 15, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31360217

RESUMO

This paper generalizes previous studies on genome rearrangement under biological constraints, using double cut and join (DCJ). We propose a model for weighted DCJ, along with a family of optimization problems called φ -MCPS (Minimum Cost Parsimonious Scenario), that are based on labeled graphs. We show how to compute solutions to general instances of φ -MCPS, given an algorithm to compute φ -MCPS on a circular genome with exactly one occurrence of each gene. These general instances can have an arbitrary number of circular and linear chromosomes, and arbitrary gene content. The practicality of the framework is displayed by presenting polynomial-time algorithms that generalize the results of Bulteau, Fertin, and Tannier on the Sorting by wDCJs and indels in intergenes problem, and that generalize previous results on the Minimum Local Parsimonious Scenario problem.

6.
Algorithms Mol Biol ; 13: 9, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29755580

RESUMO

BACKGROUND: The double cut and join (DCJ) model of genome rearrangement is well studied due to its mathematical simplicity and power to account for the many events that transform gene order. These studies have mostly been devoted to the understanding of minimum length scenarios transforming one genome into another. In this paper we search instead for rearrangement scenarios that minimize the number of rearrangements whose breakpoints are unlikely due to some biological criteria. One such criterion has recently become accessible due to the advent of the Hi-C experiment, facilitating the study of 3D spacial distance between breakpoint regions. RESULTS: We establish a link between the minimum number of unlikely rearrangements required by a scenario and the problem of finding a maximum edge-disjoint cycle packing on a certain transformed version of the adjacency graph. This link leads to a 3/2-approximation as well as an exact integer linear programming formulation for our problem, which we prove to be NP-complete. We also present experimental results on fruit flies, showing that Hi-C data is informative when used as a criterion for rearrangements. CONCLUSIONS: A new variant of the weighted DCJ distance problem is addressed that ignores scenario length in its objective function. A solution to this problem provides a lower bound on the number of unlikely moves necessary when transforming one gene order into another. This lower bound aids in the study of rearrangement scenarios with respect to chromatin structure, and could eventually be used in the design of a fixed parameter algorithm with a more general objective function.

7.
Algorithms Mol Biol ; 11: 13, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27190550

RESUMO

BACKGROUND: Traditionally, the merit of a rearrangement scenario between two gene orders has been measured based on a parsimony criteria alone; two scenarios with the same number of rearrangements are considered equally good. In this paper, we acknowledge that each rearrangement has a certain likelihood of occurring based on biological constraints, e.g. physical proximity of the DNA segments implicated or repetitive sequences. RESULTS: We propose optimization problems with the objective of maximizing overall likelihood, by weighting the rearrangements. We study a binary weight function suitable to the representation of sets of genome positions that are most likely to have swapped adjacencies. We give a polynomial-time algorithm for the problem of finding a minimum weight double cut and join scenario among all minimum length scenarios. In the process we solve an optimization problem on colored noncrossing partitions, which is a generalization of the Maximum Independent Set problem on circle graphs. CONCLUSIONS: We introduce a model for weighting genome rearrangements and show that under simple yet reasonable conditions, a fundamental distance can be computed in polynomial time. This is achieved by solving a generalization of the Maximum Independent Set problem on circle graphs. Several variants of the problem are also mentioned.

8.
BMC Bioinformatics ; 17: 30, 2016 Jan 13.
Artigo em Inglês | MEDLINE | ID: mdl-26757899

RESUMO

BACKGROUND: In recent years, many studies focused on the description and comparison of large sets of related bacteriophage genomes. Due to the peculiar mosaic structure of these genomes, few informative approaches for comparing whole genomes exist: dot plots diagrams give a mostly qualitative assessment of the similarity/dissimilarity between two or more genomes, and clustering techniques are used to classify genomes. Multiple alignments are conspicuously absent from this scene. Indeed, whole genome aligners interpret lack of similarity between sequences as an indication of rearrangements, insertions, or losses. This behavior makes them ill-prepared to align bacteriophage genomes, where even closely related strains can accomplish the same biological function with highly dissimilar sequences. RESULTS: In this paper, we propose a multiple alignment strategy that exploits functional collinearity shared by related strains of bacteriophages, and uses partial orders to capture mosaicism of sets of genomes. As classical alignments do, the computed alignments can be used to predict that genes have the same biological function, even in the absence of detectable similarity. The Alpha aligner implements these ideas in visual interactive displays, and is used to compute several examples of alignments of Staphylococcus aureus and Mycobacterium bacteriophages, involving up to 29 genomes. Using these datasets, we prove that Alpha alignments are at least as good as those computed by standard aligners. Comparison with the progressive Mauve aligner - which implements a partial order strategy, but whose alignments are linearized - shows a greatly improved interactive graphic display, while avoiding misalignments. CONCLUSIONS: Multiple alignments of whole bacteriophage genomes work, and will become an important conceptual and visual tool in comparative genomics of sets of related strains. A python implementation of Alpha, along with installation instructions for Ubuntu and OSX, is available on bitbucket (https://bitbucket.org/thekswenson/alpha).


Assuntos
Bacteriófagos/genética , Genoma Viral , Mycobacterium/virologia , Alinhamento de Sequência/métodos , Staphylococcus aureus/virologia , Algoritmos , Biologia Computacional/métodos , Genômica/métodos
9.
BMC Genomics ; 17(Suppl 10): 786, 2016 11 11.
Artigo em Inglês | MEDLINE | ID: mdl-28185551

RESUMO

BACKGROUND: Transcriptome reconstruction, defined as the identification of all protein isoforms that may be expressed by a gene, is a notably difficult computational task. With real data, the best methods based on RNA-seq data identify barely 21 % of the expressed transcripts. While waiting for algorithms and sequencing techniques to improve - as has been strongly suggested in the literature - it is important to evaluate assisted transcriptome prediction; this is the question of how alternative transcription in one species performs as a predictor of protein isoforms in another relatively close species. Most evidence-based gene predictors use transcripts from other species to annotate a genome, but the predictive power of procedures that use exclusively transcripts from external species has never been quantified. The cornerstone of such an evaluation is the correct identification of pairs of transcripts with the same splicing patterns, called splicing orthologs. RESULTS: We propose a rigorous procedural definition of splicing orthologs, based on the identification of all ortholog pairs of splicing sites in the nucleotide sequences, and alignments at the protein level. Using our definition, we compared 24 382 human transcripts and 17 909 mouse transcripts from the highly curated CCDS database, and identified 11 122 splicing orthologs. In prediction mode, we show that human transcripts can be used to infer over 62 % of mouse protein isoforms. When restricting the predictions to transcripts known eight years ago, the percentage grows to 74 %. Using CCDS timestamped releases, we also analyze the evolution of the number of splicing orthologs over the last decade. CONCLUSIONS: Alternative splicing is now recognized to play a major role in the protein diversity of eukaryotic organisms, but definitions of spliced isoform orthologs are still approximate. Here we propose a definition adapted to the subtle variations of conserved alternative splicing sites, and use it to validate numerous accurate orthologous isoform predictions.


Assuntos
Algoritmos , Proteínas/genética , Transcriptoma , Processamento Alternativo , Animais , Biologia Computacional , Éxons , Humanos , Camundongos , Isoformas de Proteínas/química , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , Proteínas/química , Proteínas/metabolismo , RNA/química , RNA/genética , RNA/metabolismo
10.
BMC Bioinformatics ; 14 Suppl 15: S5, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564227

RESUMO

BACKGROUND: Reconciled gene trees yield orthology and paralogy relationships between genes. This information may however contradict other information on orthology and paralogy provided by other footprints of evolution, such as conserved synteny. RESULTS: We explore a way to include external information on orthology in the process of gene tree construction. Given an initial gene tree and a set of orthology constraints on pairs of genes or on clades, we give polynomial-time algorithms for producing a modified gene tree satisfying the set of constraints, that is as close as possible to the original one according to the Robinson-Foulds distance. We assess the validity of the modifications we propose by computing the likelihood ratio between initial and modified trees according to sequence alignments on Ensembl trees, showing that often the two trees are statistically equivalent. AVAILABILITY: Software and data available upon request to the corresponding author.


Assuntos
Alinhamento de Sequência , Algoritmos , Animais , Evolução Molecular , Humanos , Filogenia , Software , Sintenia
11.
BMC Bioinformatics ; 14 Suppl 15: S17, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24564731

RESUMO

BACKGROUND: Viruses that infect bacteria, called phages, are well-known for their extreme mosaicism, in which an individual genome shares many different parts with many others. The mechanisms for creating these mosaics are largely unknown but are believed to be recombinations, either illegitimate, or partly homologous. In order to reconstruct the history of these recombinations, we need to identify the positions where recombinations may have occurred, and develop algorithms to generate and explore the possible reconstructions. RESULTS: We first show that, provided that their gene order is co-linear, genomes of phages can be aligned, even if large parts of their sequences lack any detectable similarity and are annotated hypothetical proteins. We give such an alignment for 31 Staphylococcus aureus phage genomes, and algorithms that can be used in any similar context. These alignments provide the datasets needed for a combinatorial study of recombinations. We next reconstruct the most likely recombination history of the set of 31 phages, under the hypothesis that recombinations are partly homologous. This history relies on the computational identification of missing phages. CONCLUSIONS: This first combinatorial study of modular recombinations acts as a proof of concept. We show that alignments of whole genomes are feasible for large sets of phages, and that this representation yields data that can be used to reconstruct parts of the evolutionary history of these organisms.


Assuntos
Bacteriófagos/genética , Genoma Viral , Recombinação Genética , Staphylococcus aureus/virologia , Algoritmos , Análise de Sequência de DNA , Staphylococcus aureus/genética
12.
Algorithms Mol Biol ; 7(1): 31, 2012 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-23167951

RESUMO

BACKGROUND: Reconciliation is the commonly used method for inferring the evolutionary scenario for a gene family. It consists in "embedding" inferred gene trees into a known species tree, revealing the evolution of the gene family by duplications and losses. When a species tree is not known, a natural algorithmic problem is to infer a species tree from a set of gene trees, such that the corresponding reconciliation minimizes the number of duplications and/or losses. The main drawback of reconciliation is that the inferred evolutionary scenario is strongly dependent on the considered gene trees, as few misplaced leaves may lead to a completely different history, with significantly more duplications and losses. RESULTS: In this paper, we take advantage of certain gene trees' properties in order to preprocess them for reconciliation or species tree inference. We flag certain duplication vertices of a gene tree, the "non-apparent duplication" (NAD) vertices, as resulting from the misplacement of leaves. In the case of species tree inference, we develop a polynomial-time heuristic for removing the minimum number of species leading to a set of gene trees that exhibit no NAD vertices with respect to at least one species tree. In the case of reconciliation, we consider the optimization problem of removing the minimum number of leaves or species leading to a tree without any NAD vertex. We develop a polynomial-time algorithm that is exact for two special classes of gene trees, and show a good performance on simulated data sets in the general case.

13.
Artigo em Inglês | MEDLINE | ID: mdl-22231622

RESUMO

A Maximum Agreement SubTree (MAST) is a largest subtree common to a set of trees and serves as a summary of common substructure in the trees. A single MAST can be misleading, however, since there can be an exponential number of MASTs, and two MASTs for the same tree set do not even necessarily share any leaves. In this paper, we introduce the notion of the Kernel Agreement SubTree (KAST), which is the summary of the common substructure in all MASTs, and show that it can be calculated in polynomial time (for trees with bounded degree). Suppose the input trees represent competing hypotheses for a particular phylogeny. We explore the utility of the KAST as a method to discern the common structure of confidence, and as a measure of how confident we are in a given tree set. We also show the trend of the KAST, as compared to other consensus methods, on the set of all trees visited during a Bayesian analysis of flatworm genomes.


Assuntos
Algoritmos , Biologia Computacional/métodos , Filogenia , Animais , Teorema de Bayes , Genoma Bacteriano/genética , Genoma Helmíntico , Modelos Genéticos , Platelmintos/genética , Proteobactérias
14.
BMC Bioinformatics ; 13 Suppl 19: S15, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23281654

RESUMO

BACKGROUND: Reconciliation is the classical method for inferring a duplication and loss history from a set of extant genes. It is based upon the notion of embedding the gene tree into the species tree, the incongruence between the two indicating evidence for duplication and loss. However, results obtained by this method are highly dependent upon the considered species and gene trees. Thus, painstaking attention has been given to the development of methods for reconstructing accurate gene trees. RESULTS: This paper highlights the fact that errors in gene trees are not the only reasons for the inference of an erroneous duplication-loss history. More precisely, we prove that, under certain reasonable hypotheses based on the widely accepted link between function and sequence constraints, even a well-supported gene tree yield a reconciliation that does not correspond to the true history. We then provide the theoretical underpinnings for a conservative approach to infer histories given such gene trees. We apply our method to the mammalian interleukin-1 (IL) gene tree, that has been used as a model example to illustrate the role of reconciliation.


Assuntos
Evolução Molecular , Duplicação Gênica , Genes , Filogenia , Algoritmos , Animais , Genoma , Interleucina-1/genética , Análise de Sequência de DNA
15.
BMC Bioinformatics ; 13 Suppl 19: S16, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23281701

RESUMO

Understanding the history of a gene family that evolves through duplication, speciation, and loss is a fundamental problem in comparative genomics. Features such as function, position, and structural similarity between genes are intimately connected to this history; relationships between genes such as orthology (genes related through a speciation event) or paralogy (genes related through a duplication event) are usually correlated with these features. For example, recent work has shown that in human and mouse there is a strong connection between function and inparalogs, the paralogs that were created since the speciation event separating the human and mouse lineages. Methods exist for detecting inparalogs that either use information from only two species, or consider a set of species but rely on clustering methods. In this paper we present a graph-theoretic approach for finding lower bounds on the number of inparalogs for a given set of species; we pose an edge covering problem on the similarity graph and give an efficient 2/3-approximation as well as a faster heuristic. Since the physical position of inparalogs corresponding to recent speciations is not likely to have changed since the duplication, we also use our predictions to estimate the types of duplications that have occurred in some vertebrates and drosophila.


Assuntos
Evolução Molecular , Genômica/métodos , Família Multigênica , Análise de Sequência de DNA/métodos , Animais , Duplicação Gênica , Humanos , Camundongos , Filogenia , Ratos
16.
J Comput Biol ; 18(9): 1041-53, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21899414

RESUMO

We consider the following problem: given a set of gene family trees, spanning a given set of species, find a first speciation which splits these species into two subsets and minimizes the number of gene duplications that happened before this speciation. We call this problem the Minimum Duplication Bipartition Problem. Using a generalization of the Minimum Edge-Cut Problem, we propose a polynomial time 2-approximation algorithm for the Minimum Duplication Bipartition Problem. We apply this algorithm to the inference of species trees on synthetic datasets and on two datasets of eukaryotic species.


Assuntos
Especiação Genética , Modelos Genéticos , Filogenia , Análise de Sequência de DNA/métodos , Algoritmos , Simulação por Computador , Eucariotos/genética , Evolução Molecular , Genômica
17.
J Comput Biol ; 18(9): 1201-10, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21899425

RESUMO

In comparative genomics studies, finding a minimum length sequences of reversals, so-called sorting by reversals, has been the topic of a huge literature. Since there are many minimum length sequences, another important topic has been the problem of listing all parsimonious sequences between two genomes, called the All Sorting Sequences by Reversals (ASSR) problem. In this article, we revisit the ASSR problem for uni-chromosomal genomes when no duplications are allowed and when the relative order of the genes is known. We put the current body of work in perspective by illustrating the fundamental framework that is common for all of them, a perspective that allows us for the first time to theoretically compare their running times. The article also proposes an improved framework that empirically speeds up all known algorithms.


Assuntos
Algoritmos , Simulação por Computador , Modelos Genéticos , Análise de Sequência de DNA/métodos
18.
J Comput Biol ; 18(9): 1219-30, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21899427

RESUMO

Perfection has been used as a criteria to classify rearrangement scenarios since 2004. However, there is a fundamental bias towards extant species in the original definition: ancestral species are not bound to perfection. Here we develop a new theory of perfection that takes an egalitarian view of species, and we examine the fitness of this theory on several datasets. Supplementary Material is available at www.liebertonline.com/cmb.


Assuntos
Simulação por Computador , Genoma , Modelos Genéticos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Cromossomos/genética , Drosophila/genética , Humanos , Camundongos , Modelos Estatísticos , Filogenia , Ratos , Alinhamento de Sequência , Homologia de Sequência do Ácido Nucleico
19.
Algorithms Mol Biol ; 6: 11, 2011 Apr 19.
Artigo em Inglês | MEDLINE | ID: mdl-21504604

RESUMO

We describe an average-case O(n2) algorithm to list all reversals on a signed permutation π that, when applied to π, produce a permutation that is closer to the identity. This algorithm is optimal in the sense that, the time it takes to write the list is Ω(n2) in the worst case.

20.
Artigo em Inglês | MEDLINE | ID: mdl-21301032

RESUMO

Many of the steps in phylogenetic reconstruction can be confounded by "rogue" taxa­taxa that cannot be placed with assurance anywhere within the tree, indeed, whose location within the tree varies with almost any choice of algorithm or parameters. Phylogenetic consensus methods, in particular, are known to suffer from this problem. In this paper, we provide a novel framework to define and identify rogue taxa. In this framework, we formulate a bicriterion optimization problem, the relative information criterion, that models the net increase in useful information present in the consensus tree when certain taxa are removed from the input data. We also provide an effective greedy heuristic to identify a subset of rogue taxa and use this heuristic in a series of experiments, with both pathological examples from the literature and a collection of large biological data sets. As the presence of rogue taxa in a set of bootstrap replicates can lead to deceivingly poor support values, we propose a procedure to recompute support values in light of the rogue taxa identified by our algorithm; applying this procedure to our biological data sets caused a large number of edges to move from "unsupported" to "supported" status, indicating that many existing phylogenies should be recomputed and reevaluated to reduce any inaccuracies introduced by rogue taxa. We also discuss the implementation issues encountered while integrating our algorithm into RAxML v7.2.7, particularly those dealing with scaling up the analyses. This integration enables practitioners to benefit from our algorithm in the analysis of very large data sets (up to 2,500 taxa and 10,000 trees, although we present the results of even larger analyses).


Assuntos
Algoritmos , Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Análise por Conglomerados , Sequência Consenso , Bases de Dados Genéticas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...