Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
iScience ; 25(6): 104413, 2022 Jun 17.
Artigo em Inglês | MEDLINE | ID: mdl-35663029

RESUMO

One of the most basic kinds of analysis to be performed on a pangenome is the detection of its core, i.e., the information shared among all members. Pangenomic core detection is classically done on the gene level and many tools focus exclusively on core detection in prokaryotes. Here, we present a new method for sequence-based pangenomic core detection. Our model generalizes from a strict core definition allowing us to flexibly determine suitable core properties depending on the research question and the dataset under consideration. We propose an algorithm based on a colored de Bruijn graph that runs in linear time with respect to the number of k-mers in the graph. An implementation of our method is called Corer. Because of the usage of a colored de Bruijn graph, it works alignment-free, is provided with a small memory footprint, and accepts as input assembled genomes as well as sequencing reads.

2.
Bioinformatics ; 37(24): 4868-4870, 2021 12 11.
Artigo em Inglês | MEDLINE | ID: mdl-34132752

RESUMO

SUMMARY: SANS serif is a novel software for alignment-free, whole-genome-based phylogeny estimation that follows a pangenomic approach to efficiently calculate a set of splits in a phylogenetic tree or network. AVAILABILITY AND IMPLEMENTATION: Implemented in C++ and supported on Linux, MacOS and Windows. The source code is freely available for download at https://gitlab.ub.uni-bielefeld.de/gi/sans. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Filogenia
3.
Bioinformatics ; 37(16): 2266-2274, 2021 Aug 25.
Artigo em Inglês | MEDLINE | ID: mdl-33532821

RESUMO

MOTIVATION: Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. RESULTS: We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. AVAILABILITY AND IMPLEMENTATION: Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

4.
Algorithms Mol Biol ; 15: 4, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32280365

RESUMO

BACKGROUND: The increasing amount of available genome sequence data enables large-scale comparative studies. A common task is the inference of phylogenies-a challenging task if close reference sequences are not available, genome sequences are incompletely assembled, or the high number of genomes precludes multiple sequence alignment in reasonable time. RESULTS: We present a new whole-genome based approach to infer phylogenies that is alignment- and reference-free. In contrast to other methods, it does not rely on pairwise comparisons to determine distances to infer edges in a tree. Instead, a colored de Bruijn graph is constructed, and information on common subsequences is extracted to infer phylogenetic splits. CONCLUSIONS: The introduced new methodology for large-scale phylogenomics shows high potential. Application to different datasets confirms robustness of the approach. A comparison to other state-of-the-art whole-genome based methods indicates comparable or higher accuracy and efficiency.

5.
IEEE/ACM Trans Comput Biol Bioinform ; 16(4): 1364-1373, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-28166504

RESUMO

Reconstructing ancestral gene orders in a given phylogeny is a classical problem in comparative genomics. Most existing methods compare conserved features in extant genomes in the phylogeny to define potential ancestral gene adjacencies, and either try to reconstruct all ancestral genomes under a global evolutionary parsimony criterion, or, focusing on a single ancestral genome, use a scaffolding approach to select a subset of ancestral gene adjacencies, generally aiming at reducing the fragmentation of the reconstructed ancestral genome. In this paper, we describe an exact algorithm for the Small Parsimony Problem that combines both approaches. We consider that gene adjacencies at internal nodes of the species phylogeny are weighted, and we introduce an objective function defined as a convex combination of these weights and the evolutionary cost under the Single-Cut-or-Join (SCJ) model. The weights of ancestral gene adjacencies can, e.g., be obtained through the recent availability of ancient DNA sequencing data, which provide a direct hint at the genome structure of the considered ancestor, or through probabilistic analysis of gene adjacencies evolution. We show the NP-hardness of our problem variant and propose a Fixed-Parameter Tractable algorithm based on the Sankoff-Rousseau dynamic programming algorithm that also allows to sample co-optimal solutions. We apply our approach to mammalian and bacterial data providing different degrees of complexity. We show that including adjacency weights in the objective has a significant impact in reducing the fragmentation of the reconstructed ancestral gene orders. An implementation is available at http://github.com/nluhmann/PhySca.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma Bacteriano , Genômica/métodos , Animais , Evolução Biológica , Simulação por Computador , Bases de Dados Genéticas , Evolução Molecular , Ordem dos Genes , Marcadores Genéticos/genética , Modelos Genéticos , Gambás/genética , Filogenia , Plasmídeos/metabolismo , Probabilidade , Reprodutibilidade dos Testes , Suínos/genética , Yersinia/genética
6.
J Comput Biol ; 25(7): 825-836, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-30011247

RESUMO

The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/tendências , Software , Algoritmos , Compressão de Dados , Genoma/genética
7.
IEEE/ACM Trans Comput Biol Bioinform ; 15(6): 2094-2100, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29993816

RESUMO

Ancestral genome reconstruction is an important task to analyze the evolution of genomes. Recent progress in sequencing ancient DNA led to the publication of so-called paleogenomes and allows the integration of this sequencing data in genome evolution analysis. However, the de novo assembly of ancient genomes is usually fragmented due to DNA degradation over time among others. Integrated phylogenetic assembly addresses the issue of genome fragmentation in the ancient DNA assembly while aiming to improve the reconstruction of all ancient genomes in the phylogeny simultaneously. The fragmented assembly of the ancient genome can be represented as an assembly graph, indicating contradicting ordering information of contigs. In this setting, our approach is to compare the ancient data with extant finished genomes. We generalize a reconstruction approach minimizing the Single-Cut-or-Join rearrangement distance towards multifurcating trees and include edge lengths to improve the reconstruction in practice. This results in a polynomial time algorithm that includes additional ancient DNA data at one node in the tree, resulting in consistent reconstructions of ancestral genomes.


Assuntos
DNA Antigo/análise , DNA , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , DNA/análise , DNA/classificação , DNA/genética , Evolução Molecular , História Antiga , História Medieval , Humanos , Modelos Genéticos , Paleontologia , Filogenia , Peste/história , Peste/microbiologia , Ratos , Alinhamento de Sequência/métodos , Yersinia pestis/classificação , Yersinia pestis/genética
8.
Algorithms Mol Biol ; 11: 3, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27087830

RESUMO

BACKGROUND: High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. RESULTS: In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52-66 times faster while using about 5.5-14.3 times less memory. CONCLUSION: We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. AVAILABILITY: https://www.github.com/GuillaumeHolley/BloomFilterTrie.

9.
Bioinformatics ; 31(18): 2947-54, 2015 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-25979471

RESUMO

MOTIVATION: The number of reported genetic variants is rapidly growing, empowered by ever faster accumulation of next-generation sequencing data. A major issue is comparability. Standards that address the combined problem of inaccurately predicted breakpoints and repeat-induced ambiguities are missing. This decisively lowers the quality of 'consensus' callsets and hampers the removal of duplicate entries in variant databases, which can have deleterious effects in downstream analyses. RESULTS: We introduce a sound framework for comparison of deletions that captures both tool-induced inaccuracies and repeat-induced ambiguities. We present a maximum matching algorithm that outputs virtual duplicates among two sets of predictions/annotations. We demonstrate that our approach is clearly superior over ad hoc criteria, like overlap, and that it can reduce the redundancy among callsets substantially. We also identify large amounts of duplicate entries in the Database of Genomic Variants, which points out the immediate relevance of our approach. AVAILABILITY AND IMPLEMENTATION: Implementation is open source and available from https://bitbucket.org/readdi/readdi CONTACT: roland.wittler@uni-bielefeld.de or t.marschall@mpi-inf.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional/métodos , Variação Genética/genética , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de DNA/normas , Deleção de Sequência/genética , Software , Bases de Dados Factuais , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Modelos Teóricos
10.
BMC Genomics ; 14 Suppl 1: S12, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23369161

RESUMO

BACKGROUND: Structural variations in human genomes, such as deletions, play an important role in cancer development. Next-Generation Sequencing technologies have been central in providing ways to detect such variations. Methods like paired-end mapping allow to simultaneously analyze data from several samples in order to, e.g., distinguish tumor from patient specific variations. However, it has been shown that, especially in this setting, there is a need to explicitly take overlapping deletions into consideration. Existing tools have only minor capabilities to call overlapping deletions, unable to unravel complex signals to obtain consistent predictions. RESULT: We present a first approach specifically designed to cluster short-read paired-end data into possibly overlapping deletion predictions. The method does not make any assumptions on the composition of the data, such as the number of samples, heterogeneity, polyploidy, etc. Taking paired ends mapped to a reference genome as input, it iteratively merges mappings to clusters based on a similarity score that takes both the putative location and size of a deletion into account. CONCLUSION: We demonstrate that agglomerative clustering is suitable to predict deletions. Analyzing real data from three samples of a cancer patient, we found putatively overlapping deletions and observed that, as a side-effect, erroneous mappings are mostly identified as singleton clusters. An evaluation on simulated data shows, compared to other methods which can output overlapping clusters, high accuracy in separating overlapping from single deletions.


Assuntos
Deleção de Genes , Genoma Humano , Algoritmos , Mapeamento Cromossômico , Análise por Conglomerados , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Máquina de Vetores de Suporte
11.
BMC Bioinformatics ; 13 Suppl 19: S11, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23281593

RESUMO

BACKGROUND: Recovering the structure of ancestral genomes can be formalized in terms of properties of binary matrices such as the Consecutive-Ones Property (C1P). The Linearization Problem asks to extract, from a given binary matrix, a maximum weight subset of rows that satisfies such a property. This problem is in general intractable, and in particular if the ancestral genome is expected to contain only linear chromosomes or a unique circular chromosome. In the present work, we consider a relaxation of this problem, which allows ancestral genomes that can contain several chromosomes, each either linear or circular. RESULT: We show that, when restricted to binary matrices of degree two, which correspond to adjacencies, the genomic characters used in most ancestral genome reconstruction methods, this relaxed version of the Linearization Problem is polynomially solvable using a reduction to a matching problem. This result holds in the more general case where columns have bounded multiplicity, which models possibly duplicated ancestral genes. We also prove that for matrices with rows of degrees 2 and 3, without multiplicity and without weights on the rows, the problem is NP-complete, thus tracing sharp tractability boundaries. CONCLUSION: As it happened for the breakpoint median problem, also used in ancestral genome reconstruction, relaxing the definition of a genome turns an intractable problem into a tractable one. The relaxation is adapted to some biological contexts, such as bacterial genomes with several replicons, possibly partially assembled. Algorithms can also be used as heuristics for hard variants. More generally, this work opens a way to better understand linearization results for ancestral genome structure inference.


Assuntos
Cromossomos/genética , Evolução Molecular , Genoma/genética , Algoritmos , Genômica/métodos
12.
BMC Bioinformatics ; 12 Suppl 9: S21, 2011 Oct 05.
Artigo em Inglês | MEDLINE | ID: mdl-22152084

RESUMO

BACKGROUND: Structural variations in human genomes, such as insertions, deletion, or rearrangements, play an important role in cancer development. Next-Generation Sequencing technologies have been central in providing ways to detect such variations. Most existing methods however are limited to the analysis of a single genome, and it is only recently that the comparison of closely related genomes has been considered. In particular, a few recent works considered the analysis of data sets obtained by sequencing both tumor and healthy tissues of the same cancer patient. In that context, the goal is to detect variations that are specific to exactly one of the genomes, for example to differentiate between patient-specific and tumor-specific variations. This is a difficult task, especially when facing the additional challenge of the possible contamination of healthy tissues by tumor cells and conversely. RESULTS: In the current work, we analyzed a data set of paired-end short-reads, obtained by sequencing tumor tissues and healthy tissues, both from the same cancer patient. Based on a combinatorial notion of conflict between deletions, we show that in the tumor data, more deletions are predicted than there could actually be in a diploid genome. In contrast, the predictions for the data from normal tissues are almost conflict-free. We designed and applied a method, specific to the analysis of such pooled and contaminated data sets, to detect potential tumor-specific deletions. Our method takes the deletion calls from both data sets and assigns reads from the mixed tumor/normal data to the normal one with the goal to minimize the number of reads that need to be discarded to obtain a set of conflict-free deletion clusters. We observed that, on the specific data set we analyze, only a very small fraction of the reads needs to be discarded to obtain a set of consistent deletions. CONCLUSIONS: We present a framework based on a rigorous definition of consistency between deletions and the assumption that the tumor sample also contains normal cells. A combined analysis of both data sets based on this model allowed a consistent explanation of almost all data, providing a detailed picture of candidate patient- and tumor-specific deletions.


Assuntos
Genômica/métodos , Neoplasias/genética , Deleção de Sequência , Variação Genética , Genoma Humano , Humanos
13.
J Comput Biol ; 18(9): 1023-39, 2011 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-21899413

RESUMO

In comparative genomics, differences or similarities of gene orders are determined to predict functional relations of genes or phylogenetic relations of genomes. For this purpose, various combinatorial models can be used to specify gene clusters--groups of genes that are co-located in a set of genomes. Several approaches have been proposed to reconstruct putative ancestral gene clusters based on the gene order of contemporary species. One prevalent and natural reconstruction criterion is consistency: For a set of reconstructed gene clusters, there should exist a gene order that comprises all given clusters. For permutation-based gene cluster models, efficient methods exist to verify this condition. In this article, we discuss the consistency problem for different gene cluster models on sequences with restricted gene multiplicities. Our results range from linear-time algorithms for the simple model of adjacencies to NP-completeness proofs for more complex models like common intervals.


Assuntos
Modelos Genéticos , Família Multigênica , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Simulação por Computador , Evolução Molecular
14.
Artigo em Inglês | MEDLINE | ID: mdl-19644167

RESUMO

The order of genes in genomes provides extensive information. In comparative genomics, differences or similarities of gene orders are determined to predict functional relations of genes or phylogenetic relations of genomes. For this purpose, various combinatorial models can be used to identify gene clusters--groups of genes that are colocated in a set of genomes. We introduce a unified approach to model gene clusters and define the problem of labeling the inner nodes of a given phylogenetic tree with sets of gene clusters. Our optimization criterion in this context combines two properties: parsimony, i.e., the number of gains and losses of gene clusters has to be minimal, and consistency, i.e., for each ancestral node, there must exist at least one potential gene order that contains all the reconstructed clusters. We present and evaluate an exact algorithm to solve this problem. Despite its exponential worst-case time complexity, our method is suitable even for large-scale data. We show the effectiveness and efficiency on both simulated and real data.


Assuntos
Genômica/métodos , Modelos Genéticos , Família Multigênica , Algoritmos , Bactérias/genética , Simulação por Computador , Ordem dos Genes , Genoma Bacteriano , Filogenia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...