Search | VHL Regional Portal

1.

Efficient short read mapping to a pangenome that is represented by a graph of ED strings.

Büchler, Thomas; Olbrich, Jannik; Ohlebusch, Enno.

Bioinformatics ; 39(5)2023 05 04.

Article in English | MEDLINE | ID: mdl-37171844

ABSTRACT

MOTIVATION: A pangenome represents many diverse genome sequences of the same species. In order to cope with small variations as well as structural variations, recent research focused on the development of graph-based models of pangenomes. Mapping is the process of finding the original location of a DNA read in a reference sequence, typically a genome. Using a pangenome instead of a (linear) reference genome can, e.g. reduce mapping bias, the tendency to incorrectly map sequences that differ from the reference genome. Mapping reads to a graph, however, is more complex and needs more resources than mapping to a reference genome. Reducing the complexity of the graph by encoding simple variations like SNPs in a simple way can accelerate read mapping and reduce the memory requirements at the same time. RESULTS: We introduce graphs based on elastic-degenerate strings (ED strings, EDS) and the linearized form of these EDS graphs as a new representation for pangenomes. In this representation, small variations are encoded directly in the sequence. Structural variations are encoded in a graph structure. This reduces the size of the representation in comparison to sequence graphs. In the linearized form, mapping techniques that are known from ordinary strings can be applied with appropriate adjustments. Since most variations are expressed directly in the sequence, the mapping process rarely has to take edges of the EDS graph into account. We developed a prototypical software tool GED-MAP that uses this representation together with a minimizer index to map short reads to the pangenome. Our experiments show that the new method works on a whole human genome scale, taking structural variants properly into account. The advantage of GED-MAP, compared with other pangenomic short read mappers, is that the new representation allows for a simple indexing method. This makes GED-MAP fast and memory efficient. AVAILABILITY AND IMPLEMENTATION: Sources are available at: https://github.com/thomas-buechler-ulm/gedmap.

Subject(s)

Genome, Human , Software , Humans , Sequence Analysis, DNA/methods , Algorithms

2.

An improved encoding of genetic variation in a Burrows-Wheeler transform.

Büchler, Thomas; Ohlebusch, Enno.

Bioinformatics ; 36(5): 1413-1419, 2020 03 01.

Article in English | MEDLINE | ID: mdl-31613311

ABSTRACT

MOTIVATION: In resequencing experiments, a high-throughput sequencer produces DNA-fragments (called reads) and each read is then mapped to the locus in a reference genome at which it fits best. Currently dominant read mappers are based on the Burrows-Wheeler transform (BWT). A read can be mapped correctly if it is similar enough to a substring of the reference genome. However, since the reference genome does not represent all known variations, read mapping tends to be biased towards the reference and mapping errors may thus occur. To cope with this problem, Huang et al. encoded single nucleotide polymorphisms (SNPs) in a BWT by the International Union of Pure and Applied Chemistry (IUPAC) nucleotide code. In a different approach, Maciuca et al. provided a 'natural encoding' of SNPs and other genetic variations in a BWT. However, their encoding resulted in a significantly increased alphabet size (the modified alphabet can have millions of new symbols, which usually implies a loss of efficiency). Moreover, the two approaches do not handle all known kinds of variation. RESULTS: In this article, we propose a method that is able to encode many kinds of genetic variation (SNPs, multi-nucleotide polymorphisms, insertions or deletions, duplications, transpositions, inversions and copy-number variation) in a BWT. It takes the best of both worlds: SNPs are encoded by the IUPAC nucleotide code as in Huang et al. (2013, Short read alignment with populations of genomes. Bioinformatics, 29, i361-i370) and the encoding of the other kinds of genetic variation relies on the idea introduced in Maciuca et al. (2016, A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In: Proceedings of the 16th International Workshop on Algorithms in Bioinformatics, Volume 9838 of Lecture Notes in Computer Science, pp. 222-233. Springer). In contrast to Maciuca et al., however, we use only one additional symbol. This symbol marks variant sites in a chromosome and delimits multiple variants, which are added at the end of the 'marked chromosome'. We show how the backward search algorithm, which is used in BWT-based read mappers, can be modified in such a way that it can cope with the genetic variation encoded in the BWT. We implemented our method and compared it with BWBBLE and gramtools. AVAILABILITY AND IMPLEMENTATION: https://www.uni-ulm.de/in/theo/research/seqana/.

Subject(s)

Genome, Human , Genomics , Algorithms , Computational Biology , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA , Software

3.

Erratum to: A representation of a compressed de Bruijn graph for pan-genome analysis that enables search.

Beller, Timo; Ohlebusch, Enno.

Algorithms Mol Biol ; 11: 28, 2016.

Article in English | MEDLINE | ID: mdl-27933096

ABSTRACT

[This corrects the article DOI: 10.1186/s13015-016-0083-7.].

4.

A representation of a compressed de Bruijn graph for pan-genome analysis that enables search.

Beller, Timo; Ohlebusch, Enno.

Algorithms Mol Biol ; 11: 20, 2016.

Article in English | MEDLINE | ID: mdl-27437028

ABSTRACT

BACKGROUND: Recently, Marcus et al. (Bioinformatics 30:3476-83, 2014) proposed to use a compressed de Bruijn graph to describe the relationship between the genomes of many individuals/strains of the same or closely related species. They devised an [Formula: see text] time algorithm called splitMEM that constructs this graph directly (i.e., without using the uncompressed de Bruijn graph) based on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Baier et al. (Bioinformatics 32:497-504, 2016) improved their result. RESULTS: In this paper, we propose a new space-efficient representation of the compressed de Bruijn graph that adds the possibility to search for a pattern (e.g. an allele-a variant form of a gene) within the pan-genome. The ability to search within the pan-genome graph is of utmost importance and is a design goal of pan-genome data structures.

5.

Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform.

Baier, Uwe; Beller, Timo; Ohlebusch, Enno.

Bioinformatics ; 32(4): 497-504, 2016 Feb 15.

Article in English | MEDLINE | ID: mdl-26504144

ABSTRACT

MOTIVATION: Low-cost genome sequencing gives unprecedented complete information about the genetic structure of populations, and a population graph captures the variations between many individuals of a population. Recently, Marcus et al. proposed to use a compressed de Bruijn graph for representing an entire population of genomes. They devised an O(n log g) time algorithm called splitMEM that constructs this graph directly (i.e. without using the uncompressed de Bruijn graph) based on a suffix tree, where n is the total length of the genomes and g is the length of the longest genome. Since the applicability of their algorithm is limited to rather small datasets, there is a strong need for space-efficient construction algorithms. RESULTS: We present two algorithms that outperform splitMEM in theory and in practice. The first implements a novel linear-time suffix tree algorithm by means of a compressed suffix tree. The second algorithm uses the Burrows-Wheeler transform to build the compressed de Bruijn graph in [Formula: see text] time, where σ is the size of the alphabet. To demonstrate the scalability of the algorithms, we applied it to seven human genomes. AVAILABILITY AND IMPLEMENTATION: https://www.uni-ulm.de/in/theo/research/seqana/.

Subject(s)

Algorithms , Computational Biology/methods , Genome, Human , Genomics/methods , Sequence Analysis, DNA/methods , Computer Simulation , Humans , Models, Genetic

6.

A fast algorithm for the multiple genome rearrangement problem with weighted reversals and transpositions.

Bader, Martin; Abouelhoda, Mohamed I; Ohlebusch, Enno.

BMC Bioinformatics ; 9: 516, 2008 Dec 04.

Article in English | MEDLINE | ID: mdl-19055792

ABSTRACT

BACKGROUND: Due to recent progress in genome sequencing, more and more data for phylogenetic reconstruction based on rearrangement distances between genomes become available. However, this phylogenetic reconstruction is a very challenging task. For the most simple distance measures (the breakpoint distance and the reversal distance), the problem is NP-hard even if one considers only three genomes. RESULTS: In this paper, we present a new heuristic algorithm that directly constructs a phylogenetic tree w.r.t. the weighted reversal and transposition distance. Experimental results on previously published datasets show that constructing phylogenetic trees in this way results in better trees than constructing the trees w.r.t. the reversal distance, and recalculating the weight of the trees with the weighted reversal and transposition distance. An implementation of the algorithm can be obtained from the authors. CONCLUSION: The possibility of creating phylogenetic trees directly w.r.t. the weighted reversal and transposition distance results in biologically more realistic scenarios. Our algorithm can solve today's most challenging biological datasets in a reasonable amount of time.

Subject(s)

Algorithms , Computational Biology/methods , Evolution, Molecular , Gene Rearrangement , Genome , Genomics/methods , Mutation , Analysis of Variance , Animals , Campanulaceae/genetics , DNA, Chloroplast/genetics , DNA, Circular/genetics , DNA, Mitochondrial/genetics , Databases, Genetic , Models, Genetic , Phylogeny

7.

CoCoNUT: an efficient system for the comparison and analysis of genomes.

Abouelhoda, Mohamed I; Kurtz, Stefan; Ohlebusch, Enno.

BMC Bioinformatics ; 9: 476, 2008 Nov 12.

Article in English | MEDLINE | ID: mdl-19014477

ABSTRACT

BACKGROUND: Comparative genomics is the analysis and comparison of genomes from different species. This area of research is driven by the large number of sequenced genomes and heavily relies on efficient algorithms and software to perform pairwise and multiple genome comparisons. RESULTS: Most of the software tools available are tailored for one specific task. In contrast, we have developed a novel system CoCoNUT (Computational Comparative geNomics Utility Toolkit) that allows solving several different tasks in a unified framework: (1) finding regions of high similarity among multiple genomic sequences and aligning them, (2) comparing two draft or multi-chromosomal genomes, (3) locating large segmental duplications in large genomic sequences, and (4) mapping cDNA/EST to genomic sequences. CONCLUSION: CoCoNUT is competitive with other software tools w.r.t. the quality of the results. The use of state of the art algorithms and data structures allows CoCoNUT to solve comparative genomics tasks more efficiently than previous tools. With the improved user interface (including an interactive visualization component), CoCoNUT provides a unified, versatile, and easy-to-use software tool for large scale studies in comparative genomics.

Subject(s)

Comparative Genomic Hybridization/methods , Statistics as Topic/methods , Animals , Humans , Software/trends , Statistics as Topic/trends

8.

Space efficient computation of rare maximal exact matches between multiple sequences.

Ohlebusch, Enno; Kurtz, Stefan.

J Comput Biol ; 15(4): 357-77, 2008 May.

Article in English | MEDLINE | ID: mdl-18361760

ABSTRACT

In this article, we propose a new method for computing rare maximal exact matches between multiple sequences. A rare match between k sequences S(1), ... , S(k) is a string that occurs at most t(i)-times in the sequence S(i), where the t(i) > 0 are user-defined thresholds. First, the suffix tree of one of the sequences (the reference sequence) is built, and then the other sequences are matched separately against this suffix tree. Second, the resulting pairwise exact matches are combined to multiple exact matches. A clever implementation of this method yields a very fast and space efficient program. This program can be applied in several comparative genomics tasks, such as the identification of synteny blocks between whole genomes.

Subject(s)

Algorithms , Base Sequence , Computational Biology , Sequence Analysis, DNA/methods , Databases, Nucleic Acid , Genomics , Sequence Alignment

9.

GENESIS: genome evolution scenarios.

Gog, Simon; Bader, Martin; Ohlebusch, Enno.

Bioinformatics ; 24(5): 711-2, 2008 Mar 01.

Article in English | MEDLINE | ID: mdl-18204053

ABSTRACT

SUMMARY: We implemented a software tool called GENESIS for three different genome rearrangement problems: Sorting a unichromosomal genome by weighted reversals and transpositions (SwRT), sorting a multichromosomal genome by reversals, translocations, fusions and fissions (SRTl), and sorting a multichromosomal genome by weighted reversals, translocations, fusions, fissions and transpositions (SwRTTl). AVAILABILITY: Source code can be obtained by the authors, or use the web interface http://www.uni-ulm.de/in/theo/research/genesis.html.

Subject(s)

Biological Evolution , Genome , Algorithms , Animals , Anopheles/genetics , DNA/genetics , Drosophila melanogaster/genetics

10.

Sorting by weighted reversals, transpositions, and inverted transpositions.

Bader, Martin; Ohlebusch, Enno.

J Comput Biol ; 14(5): 615-36, 2007 Jun.

Article in English | MEDLINE | ID: mdl-17683264

ABSTRACT

During evolution, genomes are subject to genome rearrangements that alter the ordering and orientation of genes on the chromosomes. If a genome consists of a single chromosome (like mitochondrial, chloroplast, or bacterial genomes), the biologically relevant genome rearrangements are (1) inversions--also called reversals--where a section of the genome is excised, reversed in orientation, and reinserted and (2) transpositions, where a section of the genome is excised and reinserted at a new position in the genome; if this also involves an inversion, one speaks of an inverted transposition. To reconstruct ancient events in the evolutionary history of organisms, one is interested in finding an optimal sequence of genome rearrangements that transforms a given genome into another genome. It is well known that this problem is equivalent to the problem of "sorting" a signed permutation into the identity permutation. In this paper, we provide a 1.5-approximation algorithm for sorting by weighted reversals, transpositions and inverted transpositions for biologically realistic weights.

Subject(s)

Algorithms , Chromosome Inversion/genetics , Gene Rearrangement , Genome , Genomics/methods , Models, Genetic , Systems Biology/methods , DNA Transposable Elements/genetics , DNA, Circular/genetics , Systems Biology/trends

11.

An applications-focused review of comparative genomics tools: capabilities, limitations and future challenges.

Chain, Patrick; Kurtz, Stefan; Ohlebusch, Enno; Slezak, Tom.

Brief Bioinform ; 4(2): 105-23, 2003 06.

Article in English | MEDLINE | ID: mdl-12846393

ABSTRACT

A team at the Lawrence Livermore National Laboratory (LLNL) was given the task of using computational tools to speed up the development of DNA diagnostics for pathogen detection. This work will be described in another paper in this issue (see pages 133-149). To achieve this goal it was necessary to understand the merits and limitations of the various available comparative genomics tools. A review of some recent tools for multisequence/genome alignment and substring comparison is presented, within the general framework of applicability to a large-scale application. We note that genome alignments are important for many things, only one of which is pathogen detection. Understanding gene function, gene regulation, gene networks, phylogenetic studies and other aspects of evolution all depend on accurate nucleic acid and protein sequence alignment. Selecting appropriate tools can make a large difference in the quality of results obtained and the effort required.

Subject(s)

Genome , Genomics/methods , Sequence Alignment , Sequence Analysis, DNA/methods , Algorithms , Animals , Bacteria/genetics , Humans , Software , Viruses/genetics

12.

Efficient multiple genome alignment.

Höhl, Michael; Kurtz, Stefan; Ohlebusch, Enno.

Bioinformatics ; 18 Suppl 1: S312-20, 2002.

Article in English | MEDLINE | ID: mdl-12169561

ABSTRACT

MOTIVATION: To allow a direct comparison of the genomic DNA sequences of sufficiently similar organisms, there is an urgent need for software tools that can align more than two genomic sequences. RESULTS: We developed new algorithms and a software tool 'Multiple Genome Aligner' (MGA for short) that efficiently computes multiple genome alignments of large, closely related DNA sequences. For example, it can align 85% percent of the complete genomes of six human adenoviruses (average length 35305 bp.) in 159 seconds. An alignment of 74% of the complete genomes of three of strains of E. coli (lengths: 5528445; 5498450; 4639221 approximately bp.) is produced in 30 minutes.

Subject(s)

Algorithms , Chromosome Mapping/methods , Evolution, Molecular , Gene Expression Profiling/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Animals , Base Sequence , Humans , Molecular Sequence Data , Phylogeny , Sequence Homology, Nucleic Acid

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL