Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 43
Filtrar
1.
Methods Mol Biol ; 1704: 317-329, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29277871

RESUMO

Current methods for synteny analysis provide only limited support to study large genomes at the sequence level. In this chapter, we describe a pipeline based on existing tools that, applied in a suitable fashion, enables synteny analysis of large genomic datasets. We give a hands-on description of each step of the pipeline using four avian genomes for data. We also provide integration scripts that simplify the conversion and setup of data between the different tools in the pipeline.


Assuntos
Aves/genética , Genoma , Software , Sintenia , Algoritmos , Animais , Aves/classificação , Biologia Computacional , Marcadores Genéticos , Genômica/métodos , Análise de Sequência de DNA
2.
J Comput Biol ; 24(6): 616-634, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28590847

RESUMO

Many important questions in molecular biology, evolution, and biomedicine can be addressed by comparative genomic approaches. One of the basic tasks when comparing genomes is the definition of measures of similarity (or dissimilarity) between two genomes, for example, to elucidate the phylogenetic relationships between species. The power of different genome comparison methods varies with the underlying formal model of a genome. The simplest models impose the strong restriction that each genome under study must contain the same genes, each in exactly one copy. More realistic models allow several copies of a gene in a genome. One speaks of gene families, and comparative genomic methods that allow this kind of input are called gene family-based. The most powerful-but also most complex-models avoid this preprocessing of the input data and instead integrate the family assignment within the comparative analysis. Such methods are called gene family-free. In this article, we study an intermediate approach between family-based and family-free genomic similarity measures. Introducing this simpler model, called gene connections, we focus on the combinatorial aspects of gene family-free genome comparison. While in most cases, the computational costs to the general family-free case are the same, we also find an instance where the gene connections model has lower complexity. Within the gene connections model, we define three variants of genomic similarity measures that have different expression powers. We give polynomial-time algorithms for two of them, while we show NP-hardness for the third, most powerful one. We also generalize the measures and algorithms to make them more robust against recent local disruptions in gene order. Our theoretical findings are supported by experimental results, proving the applicability and performance of our newly defined similarity measures.


Assuntos
Algoritmos , Biologia Computacional/métodos , Ordem dos Genes , Genes de Plantas , Genoma de Planta , Genômica/métodos , Modelos Genéticos , Família Multigênica , Filogenia
3.
IEEE Trans Nanobioscience ; 16(2): 131-139, 2017 03.
Artigo em Inglês | MEDLINE | ID: mdl-28113347

RESUMO

Modeling the evolution of biological networks is a major challenge. Biological networks are usually represented as graphs; evolutionary events not only include addition and removal of vertices and edges but also duplication of vertices and their associated edges. Since duplication is viewed as a primary driver of genomic evolution, recent work has focused on duplication-based models. Missing from these models is any embodiment of modularity, a widely accepted attribute of biological networks. Some models spontaneously generate modular structures, but none is known to maintain and evolve them. We describe network evolution with modularity (NEMo), a new model that embodies modularity. NEMo allows modules to appear and disappear and to fission and to merge, all driven by the underlying edge-level events using a duplication-based process. We also introduce measures to compare biological networks in terms of their modular structure; we present comparisons between NEMo and existing duplication-based models and run our measuring tools on both generated and published networks.


Assuntos
Biologia Computacional/métodos , Modelos Genéticos , Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Evolução Molecular , Humanos , Proteínas/genética
4.
J Comput Biol ; 24(6): 571-580, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-27788022

RESUMO

A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of nonconserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this article, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological data sets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the "any matching" formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.


Assuntos
Algoritmos , Genes Duplicados , Genoma , Genômica/métodos , Mamíferos/genética , Modelos Genéticos , Animais , Evolução Biológica
5.
J Comput Biol ; 23(5): 337-46, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-26953781

RESUMO

A fundamental problem in comparative genomics is to compute the distance between two genomes. For two genomes without duplicate genes, we can easily compute a variety of distance measures in linear time, but the problem is NP-hard under most models when genomes contain duplicate genes. Sankoff proposed the use of exemplars to tackle the problem of duplicate genes and gene families: each gene family is represented by a single gene (the exemplar for that family), chosen so as to optimize some metric. Unfortunately, choosing exemplars is itself an NP-hard problem. In this article, we propose a very fast and exact algorithm to compute the exemplar breakpoint distance, based on new insights in the underlying structure of genome rearrangements and exemplars. We evaluate the performance of our algorithm on simulation data and compare its performance to the best effort to date (a divide-and-conquer approach), showing that our algorithm runs much faster and scales much better. We also devise a new algorithm for the intermediate breakpoint distance problem, which can then be applied to assign orthologs. We compare our algorithm with the state-of-the-art method MSOAR by assigning orthologs among five well annotated mammalian genomes, showing that our algorithm runs much faster and is slightly more accurate than MSOAR.


Assuntos
Genômica/métodos , Mamíferos/genética , Algoritmos , Animais , Genoma , Modelos Genéticos , Família Multigênica , Software
6.
Bioinformatics ; 31(12): i329-38, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26072500

RESUMO

MOTIVATION: Large-scale evolutionary events such as genomic rearrange.ments and segmental duplications form an important part of the evolution of genomes and are widely studied from both biological and computational perspectives. A basic computational problem is to infer these events in the evolutionary history for given modern genomes, a task for which many algorithms have been proposed under various constraints. Algorithms that can handle both rearrangements and content-modifying events such as duplications and losses remain few and limited in their applicability. RESULTS: We study the comparison of two genomes under a model including general rearrangements (through double-cut-and-join) and segmental duplications. We formulate the comparison as an optimization problem and describe an exact algorithm to solve it by using an integer linear program. We also devise a sufficient condition and an efficient algorithm to identify optimal substructures, which can simplify the problem while preserving optimality. Using the optimal substructures with the integer linear program (ILP) formulation yields a practical and exact algorithm to solve the problem. We then apply our algorithm to assign in-paralogs and orthologs (a necessary step in handling duplications) and compare its performance with that of the state-of-the-art method MSOAR, using both simulations and real data. On simulated datasets, our method outperforms MSOAR by a significant margin, and on five well-annotated species, MSOAR achieves high accuracy, yet our method performs slightly better on each of the 10 pairwise comparisons. AVAILABILITY AND IMPLEMENTATION: http://lcbb.epfl.ch/softwares/coser.


Assuntos
Algoritmos , Evolução Molecular , Genômica/métodos , Duplicações Segmentares Genômicas , Animais , Cromossomos , Humanos , Camundongos , Programação Linear , Ratos
7.
J Comput Biol ; 22(5): 425-35, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25517208

RESUMO

Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.


Assuntos
Algoritmos , Genes Duplicados , Genoma , Genômica/estatística & dados numéricos , Programação Linear , Animais , Evolução Biológica , Genômica/métodos , Humanos , Camundongos , Modelos Genéticos , Ratos
8.
BMC Bioinformatics ; 15: 269, 2014 Aug 08.
Artigo em Inglês | MEDLINE | ID: mdl-25104072

RESUMO

BACKGROUND: In cell differentiation, a cell of a less specialized type becomes one of a more specialized type, even though all cells have the same genome. Transcription factors and epigenetic marks like histone modifications can play a significant role in the differentiation process. RESULTS: In this paper, we present a simple analysis of cell types and differentiation paths using phylogenetic inference based on ChIP-Seq histone modification data. We precisely defined the notion of cell-type trees and provided a procedure of building such trees. We propose new data representation techniques and distance measures for ChIP-Seq data and use these together with standard phylogenetic inference methods to build biologically meaningful cell-type trees that indicate how diverse types of cells are related. We demonstrate our approach on various kinds of histone modifications for various cell types, also using the datasets to explore various issues surrounding replicate data, variability between cells of the same type, and robustness. We use the results to get some interesting biological findings like important patterns of histone modification changes during cell differentiation process. CONCLUSIONS: We introduced and studied the novel problem of inferring cell type trees from histone modification data. The promising results we obtain point the way to a new approach to the study of cell differentiation. We also discuss how cell-type trees can be used to study the evolution of cell types.


Assuntos
Diferenciação Celular/genética , Epigenômica/métodos , Histonas/metabolismo , Filogenia , Imunoprecipitação da Cromatina , Histonas/genética , Humanos , Análise de Sequência de DNA , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
9.
Bioinformatics ; 30(12): i9-18, 2014 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-24932010

RESUMO

MOTIVATION: Comparative genomics aims to understand the structure and function of genomes by translating knowledge gained about some genomes to the object of study. Early approaches used pairwise comparisons, but today researchers are attempting to leverage the larger potential of multi-way comparisons. Comparative genomics relies on the structuring of genomes into syntenic blocks: blocks of sequence that exhibit conserved features across the genomes. Syntenic blocs are required for complex computations to scale to the billions of nucleotides present in many genomes; they enable comparisons across broad ranges of genomes because they filter out much of the individual variability; they highlight candidate regions for in-depth studies; and they facilitate whole-genome comparisons through visualization tools. However, the concept of syntenic block remains loosely defined. Tools for the identification of syntenic blocks yield quite different results, thereby preventing a systematic assessment of the next steps in an analysis. Current tools do not include measurable quality objectives and thus cannot be benchmarked against themselves. Comparisons among tools have also been neglected-what few results are given use superficial measures unrelated to quality or consistency. RESULTS: We present a theoretical model as well as an experimental basis for comparing syntenic blocks and thus also for improving or designing tools for the identification of syntenic blocks. We illustrate the application of the model and the measures by applying them to syntenic blocks produced by three different contemporary tools (DRIMM-Synteny, i-ADHoRe and Cyntenator) on a dataset of eight yeast genomes. Our findings highlight the need for a well founded, systematic approach to the decomposition of genomes into syntenic blocks. Our experiments demonstrate widely divergent results among these tools, throwing into question the robustness of the basic approach in comparative genomics. We have taken the first step towards a formal approach to the construction of syntenic blocks by developing a simple quality criterion based on sound evolutionary principles.


Assuntos
Sintenia , Genoma Fúngico , Genômica/métodos , Alinhamento de Sequência , Software , Leveduras/genética
10.
Bioinformatics ; 30(17): 2406-13, 2014 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-24812341

RESUMO

MOTIVATION: We have witnessed an enormous increase in ChIP-Seq data for histone modifications in the past few years. Discovering significant patterns in these data is an important problem for understanding biological mechanisms. RESULTS: We propose probabilistic partitioning methods to discover significant patterns in ChIP-Seq data. Our methods take into account signal magnitude, shape, strand orientation and shifts. We compare our methods with some current methods and demonstrate significant improvements, especially with sparse data. Besides pattern discovery and classification, probabilistic partitioning can serve other purposes in ChIP-Seq data analysis. Specifically, we exemplify its merits in the context of peak finding and partitioning of nucleosome positioning patterns in human promoters. AVAILABILITY AND IMPLEMENTATION: The software and code are available in the supplementary material. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Imunoprecipitação da Cromatina/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Histonas/metabolismo , Análise de Sequência de DNA/métodos , Algoritmos , Humanos , Nucleossomos/metabolismo , Probabilidade , Regiões Promotoras Genéticas , Software
11.
Pac Symp Biocomput ; : 285-96, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23424133

RESUMO

The rapid accumulation of whole-genome data has renewed interest in the study of the evolution of genomic architecture, under such events as rearrangements, duplications, losses. Comparative genomics, evolutionary biology, and cancer research all require tools to elucidate the mechanisms, history, and consequences of those evolutionary events, while phylogenetics could use whole-genome data to enhance its picture of the Tree of Life. Current approaches in the area of phylogenetic analysis are limited to very small collections of closely related genomes using low-resolution data (typically a few hundred syntenic blocks); moreover, these approaches typically do not include duplication and loss events. We describe a maximum likelihood (ML) approach for phylogenetic analysis that takes into account genome rearrangements as well as duplications, insertions, and losses. Our approach can handle high-resolution genomes (with 40,000 or more markers) and can use in the same analysis genomes with very different numbers of markers. Because our approach uses a standard ML reconstruction program (RAxML), it scales up to large trees. We present the results of extensive testing on both simulated and real data showing that our approach returns very accurate results very quickly. In particular, we analyze a dataset of 68 high-resolution eukaryotic genomes, with from 3,000 to 42,000 genes, from the eGOB database; the analysis, including bootstrapping, takes just 3 hours on a desktop system and returns a tree in agreement with all well supported branches, while also suggesting resolutions for some disputed placements.


Assuntos
Eucariotos/classificação , Eucariotos/genética , Genômica/estatística & dados numéricos , Filogenia , Algoritmos , Animais , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Rearranjo Gênico , Humanos , Funções Verossimilhança , Modelos Genéticos
12.
Artigo em Inglês | MEDLINE | ID: mdl-24407299

RESUMO

Alternative splicing is now recognized as a major mechanism for transcriptome and proteome diversity in higher eukaryotes, yet its evolution is poorly understood. Most studies focus on the evolution of exons and introns at the gene level, while only few consider the evolution of transcripts. In this paper, we present a framework for transcript phylogenies where ancestral transcripts evolve along the gene tree by gains, losses, and mutation. We demonstrate the usefulness of our method on a set of 805 genes and two different topics. First, we improve a method for transcriptome reconstruction from ESTs (ASPic), then we study the evolution of function in transcripts. The use of transcript phylogenies allows us to double the precision of ASPic, whereas results on the functional study reveal that conserved transcripts are more likely to share protein domains than functional sites. These studies validate our framework for the study of evolution in large collections of organisms from the perspective of transcripts; for this purpose, we developed and provide a new tool, TrEvoR.


Assuntos
Processamento Alternativo , Evolução Biológica , Biologia Computacional/métodos , Algoritmos , Animais , Evolução Molecular , Etiquetas de Sequências Expressas , Humanos , Mutação , Filogenia , Software , Especificidade da Espécie , Transcrição Gênica , Transcriptoma
13.
Bioinformatics ; 28(24): 3324-5, 2012 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-23060619

RESUMO

TIBA is a tool to reconstruct phylogenetic trees from rearrangement data that consist of ordered lists of synteny blocks (or genes), where each synteny block is shared with all of its homologues in the input genomes. The evolution of these synteny blocks, through rearrangement operations, is modelled by the uniform Double-Cut-and-Join model. Using a true distance estimate under this model and simple distance-based methods, TIBA reconstructs a phylogeny of the input genomes. Unlike any previous tool for inferring phylogenies from rearrangement data, TIBA uses novel methods of robustness estimation to provide support values for the edges in the inferred tree.


Assuntos
Filogenia , Software , Evolução Molecular , Genoma , Sintenia
14.
PLoS One ; 7(8): e39573, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22870189

RESUMO

The advent of high-throughput technologies such as ChIP-seq has made possible the study of histone modifications. A problem of particular interest is the identification of regions of the genome where different cell types from the same organism exhibit different patterns of histone enrichment. This problem turns out to be surprisingly difficult, even in simple pairwise comparisons, because of the significant level of noise in ChIP-seq data. In this paper we propose a two-stage statistical method, called ChIPnorm, to normalize ChIP-seq data, and to find differential regions in the genome, given two libraries of histone modifications of different cell types. We show that the ChIPnorm method removes most of the noise and bias in the data and outperforms other normalization methods. We correlate the histone marks with gene expression data and confirm that histone modifications H3K27me3 and H3K4me3 act as respectively a repressor and an activator of genes. Compared to what was previously reported in the literature, we find that a substantially higher fraction of bivalent marks in ES cells for H3K27me3 and H3K4me3 move into a K27-only state. We find that most of the promoter regions in protein-coding genes have differential histone-modification sites. The software for this work can be downloaded from http://lcbb.epfl.ch/software.html.


Assuntos
Histonas , Modelos Teóricos , Biblioteca de Peptídeos , Processamento de Proteína Pós-Traducional/fisiologia , Software , Animais , Células Cultivadas , Histonas/química , Histonas/genética , Histonas/metabolismo , Camundongos
15.
BMC Bioinformatics ; 13 Suppl 9: S1, 2012 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-22831154

RESUMO

Alternative splicing, an unknown mechanism 20 years ago, is now recognized as a major mechanism for proteome and transcriptome diversity, particularly in mammals­some researchers conjecture that up to 90% of human genes are alternatively spliced. Despite much research on exon and intron evolution, little is known about the evolution of transcripts. In this paper, we present a model of transcript evolution and an associated algorithm to reconstruct transcript phylogenies. The evolution of the gene structure­exons and introns­is used as basis for the reconstruction of transcript phylogenies. We apply our model and reconstruction algorithm on two well-studied genes, MAG and PAX6, obtaining results consistent with current knowledge and thereby providing evidence that a phylogenetic analysis of transcripts is feasible and likely to be informative.


Assuntos
Algoritmos , Processamento Alternativo , Modelos Genéticos , Filogenia , Animais , Evolução Molecular , Éxons , Proteínas do Olho/genética , Proteínas de Homeodomínio/genética , Humanos , Íntrons , Glicoproteína Associada a Mielina/genética , Fator de Transcrição PAX6 , Fatores de Transcrição Box Pareados/genética , Proteínas Repressoras/genética
16.
Artigo em Inglês | MEDLINE | ID: mdl-22547434

RESUMO

The experimental determination of transcriptional regulatory networks in the laboratory remains difficult and timeconsuming, while computational methods to infer these networks provide only modest accuracy. The latter can be attributed partly to the limitations of a single-organism approach. Computational biology has long used comparative and evolutionary approaches to extend the reach and accuracy of its analyses. In this paper, we describe ProPhyC, a probabilistic phylogenetic model and associated inference algorithms, designed to improve the inference of regulatory networks for a family of organisms by using known evolutionary relationships among these organisms. ProPhyC can be used with various network evolutionary models and any existing inference method. Extensive experimental results on both biological and synthetic data confirm that our model (through its associated refinement algorithms) yields substantial improvement in the quality of inferred networks over all current methods. We also compare ProPhyC with a transfer learning approach we design. This approach also uses phylogenetic relationships while inferring regulatory networks for a family of organisms. Using similar input information but designed in a very different framework, this transfer learning approach does not perform better than ProPhyC, which indicates that ProPhyC makes good use of the evolutionary information.


Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Modelos Genéticos , Filogenia , Algoritmos , Animais , Teorema de Bayes , Sítios de Ligação , Simulação por Computador , Drosophila , Evolução Molecular , Deleção de Genes , Duplicação Gênica , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Curva ROC , Fatores de Transcrição
17.
Artigo em Inglês | MEDLINE | ID: mdl-22184263

RESUMO

Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small changes--reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds distance.


Assuntos
Algoritmos , Biologia Computacional/métodos , Filogenia , Análise por Conglomerados
18.
BMC Genomics ; 12 Suppl 2: S3, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21989112

RESUMO

BACKGROUND: Reassortments are events in the evolution of the genome of influenza (flu), whereby segments of the genome are exchanged between different strains. As reassortments have been implicated in major human pandemics of the last century, their identification has become a health priority. While such identification can be done "by hand" on a small dataset, researchers and health authorities are building up enormous databases of genomic sequences for every flu strain, so that it is imperative to develop automated identification methods. However, current methods are limited to pairwise segment comparisons. RESULTS: We present FluReF, a fully automated flu virus reassortment finder. FluReF is inspired by the visual approach to reassortment identification and uses the reconstructed phylogenetic trees of the individual segments and of the full genome. We also present a simple flu evolution simulator, based on the current, source-sink, hypothesis for flu cycles. On synthetic datasets produced by our simulator, FluReF, tuned for a 0% false positive rate, yielded false negative rates of less than 10%. FluReF corroborated two new reassortments identified by visual analysis of 75 Human H3N2 New York flu strains from 2005-2008 and gave partial verification of reassortments found using another bioinformatics method. METHODS: FluReF finds reassortments by a bottom-up search of the full-genome and segment-based phylogenetic trees for candidate clades--groups of one or more sampled viruses that are separated from the other variants from the same season. Candidate clades in each tree are tested to guarantee confidence values, using the lengths of key edges as well as other tree parameters; clades with reassortments must have validated incongruencies among segment trees. CONCLUSIONS: FluReF demonstrates robustness of prediction for geographically and temporally expanded datasets, and is not limited to finding reassortments with previously collected sequences. The complete source code is available from http://lcbb.epfl.ch/software.html.


Assuntos
Algoritmos , Genoma Viral , Vírus da Influenza A Subtipo H3N2/classificação , Filogenia , Vírus Reordenados/classificação , Software , Evolução Molecular , Vírus da Influenza A Subtipo H3N2/genética , Modelos Estatísticos , Mutação Puntual , Vírus Reordenados/genética , Alinhamento de Sequência
19.
J Comput Biol ; 18(9): 1055-64, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21899415

RESUMO

Genomic rearrangements have been studied since the beginnings of modern genetics and models for such rearrangements have been the subject of many papers over the last 10 years. However, none of the extant models can predict the evolution of genomic organization into circular unichromosomal genomes (as in most prokaryotes) and linear multichromosomal genomes (as in most eukaryotes). Very few of these models support gene duplications and losses--yet these events may be more common in evolutionary history than rearrangements and themselves cause apparent rearrangements. We propose a new evolutionary model that integrates gene duplications and losses with genome rearrangements and that leads to genomes with either one (or a very few) circular chromosome or a collection of linear chromosomes. Our model is based on existing rearrangement models and inherits their linear-time algorithms for pairwise distance computation (for rearrangement only). Moreover, our model predictions fit observations about the evolution of gene family sizes and agree with the existing predictions about the growth in the number of chromosomes in eukaryotic genomes.


Assuntos
Bactérias/genética , Eucariotos/genética , Evolução Molecular , Deleção de Genes , Duplicação Gênica , Rearranjo Gênico , Algoritmos , Cromossomos/genética , Simulação por Computador , Ordem dos Genes , Genoma , Variação Estrutural do Genoma , Modelos Genéticos , Filogenia
20.
Artigo em Inglês | MEDLINE | ID: mdl-21301032

RESUMO

Many of the steps in phylogenetic reconstruction can be confounded by "rogue" taxa­taxa that cannot be placed with assurance anywhere within the tree, indeed, whose location within the tree varies with almost any choice of algorithm or parameters. Phylogenetic consensus methods, in particular, are known to suffer from this problem. In this paper, we provide a novel framework to define and identify rogue taxa. In this framework, we formulate a bicriterion optimization problem, the relative information criterion, that models the net increase in useful information present in the consensus tree when certain taxa are removed from the input data. We also provide an effective greedy heuristic to identify a subset of rogue taxa and use this heuristic in a series of experiments, with both pathological examples from the literature and a collection of large biological data sets. As the presence of rogue taxa in a set of bootstrap replicates can lead to deceivingly poor support values, we propose a procedure to recompute support values in light of the rogue taxa identified by our algorithm; applying this procedure to our biological data sets caused a large number of edges to move from "unsupported" to "supported" status, indicating that many existing phylogenies should be recomputed and reevaluated to reduce any inaccuracies introduced by rogue taxa. We also discuss the implementation issues encountered while integrating our algorithm into RAxML v7.2.7, particularly those dealing with scaling up the analyses. This integration enables practitioners to benefit from our algorithm in the analysis of very large data sets (up to 2,500 taxa and 10,000 trees, although we present the results of even larger analyses).


Assuntos
Algoritmos , Biologia Computacional/métodos , Modelos Genéticos , Filogenia , Análise por Conglomerados , Sequência Consenso , Bases de Dados Genéticas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...