Search | VHL Regional Portal

1.

Lipták, Zsuzsanna; Masillo, Francesco; Puglisi, Simon J.

Algorithms Mol Biol ; 19(1): 11, 2024 Mar 12.

Article in English | MEDLINE | ID: mdl-38475889

ABSTRACT

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

2.

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.

Alanko, Jarno N; Vuohtoniemi, Jaakko; Mäklin, Tommi; Puglisi, Simon J.

Bioinformatics ; 39(39 Suppl 1): i260-i269, 2023 06 30.

Article in English | MEDLINE | ID: mdl-37387143

ABSTRACT

MOTIVATION: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount. RESULTS: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. AVAILABILITY AND IMPLEMENTATION: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

Subject(s)

Genome, Bacterial , Nanopores , Genomics , Metagenomics

3.

Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data.

Walve, Riku; Puglisi, Simon J; Salmela, Leena.

IEEE/ACM Trans Comput Biol Bioinform ; PP2021 May 31.

Article in English | MEDLINE | ID: mdl-34057895

ABSTRACT

A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present Selkie, an Rmap overlap aligner based on a spaced (l,k)-mer index which was pioneered in the Rmap error correction tool Elmeri. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas Elmeri used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev et al. (2006), Selkie maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, Selkie is up to two orders of magnitude faster than previous methods.

4.

Data structures based on k-mers for querying large collections of sequencing data sets.

Marchet, Camille; Boucher, Christina; Puglisi, Simon J; Medvedev, Paul; Salson, Mikaël; Chikhi, Rayan.

Genome Res ; 31(1): 1-12, 2021 01.

Article in English | MEDLINE | ID: mdl-33328168

ABSTRACT

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.

Subject(s)

Algorithms , Software , High-Throughput Nucleotide Sequencing , Reproducibility of Results

5.

Succinct dynamic de Bruijn graphs.

Alipanahi, Bahar; Kuhnle, Alan; Puglisi, Simon J; Salmela, Leena; Boucher, Christina.

Bioinformatics ; 37(14): 1946-1952, 2021 08 04.

Article in English | MEDLINE | ID: mdl-32462192

ABSTRACT

MOTIVATION: The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction, e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. RESULTS: In this article, we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods, e.g. FDBG cannot be constructed on large scale datasets, or cannot support both addition and deletion, e.g. BiFrost. AVAILABILITY AND IMPLEMENTATION: DynamicBOSS is publicly available at https://github.com/baharpan/dynboss. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Software , High-Throughput Nucleotide Sequencing , Research Design , Sequence Analysis, DNA

6.

Fast and accurate correction of optical mapping data via spaced seeds.

Salmela, Leena; Mukherjee, Kingshuk; Puglisi, Simon J; Muggli, Martin D; Boucher, Christina.

Bioinformatics ; 36(9): 2974, 2020 05 01.

Article in English | MEDLINE | ID: mdl-32187358

7.

Fast and accurate correction of optical mapping data via spaced seeds.

Salmela, Leena; Mukherjee, Kingshuk; Puglisi, Simon J; Muggli, Martin D; Boucher, Christina.

Bioinformatics ; 36(3): 682-689, 2020 02 01.

Article in English | MEDLINE | ID: mdl-31504206

ABSTRACT

MOTIVATION: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. RESULTS: We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. AVAILABILITY AND IMPLEMENTATION: Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , Software , Algorithms , Genome, Human , Humans , Restriction Mapping , Sequence Analysis, DNA

8.

Kohdista: an efficient method to index and query possible Rmap alignments.

Muggli, Martin D; Puglisi, Simon J; Boucher, Christina.

Algorithms Mol Biol ; 14: 25, 2019.

Article in English | MEDLINE | ID: mdl-31867049

ABSTRACT

BACKGROUND: Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. RESULTS: We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. CONCLUSION: we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.

9.

Relative Suffix Trees.

Farruggia, Andrea; Gagie, Travis; Navarro, Gonzalo; Puglisi, Simon J; Sirén, Jouni.

Comput J ; 61(5): 773-788, 2018 May.

Article in English | MEDLINE | ID: mdl-29795706

ABSTRACT

Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.

10.

Document retrieval on repetitive string collections.

Gagie, Travis; Hartikainen, Aleksi; Karhu, Kalle; Kärkkäinen, Juha; Navarro, Gonzalo; Puglisi, Simon J; Sirén, Jouni.

Inf Retr Boston ; 20(3): 253-291, 2017.

Article in English | MEDLINE | ID: mdl-28596702

ABSTRACT

Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them. Document retrieval problems are routinely solved by search engines on large natural language collections, but the techniques are less developed on generic string collections. The case of repetitive string collections is even less understood, and there are very few existing solutions. We develop two novel ideas, interleaved LCPs and precomputed document lists, that yield highly compressed indexes solving the problem of document listing (find all the documents where a string appears), top-k document retrieval (find the k documents where a string appears most often), and document counting (count the number of documents where a string appears). We also show that a classical data structure supporting the latter query becomes highly compressible on repetitive data. Finally, we show how the tools we developed can be combined to solve ranked conjunctive and disjunctive multi-term queries under the simple [Formula: see text] model of relevance. We thoroughly evaluate the resulting techniques in various real-life repetitiveness scenarios, and recommend the best choices for each case.

11.

emMAW: computing minimal absent words in external memory.

Héliou, Alice; Pissis, Solon P; Puglisi, Simon J.

Bioinformatics ; 33(17): 2746-2749, 2017 Sep 01.

Article in English | MEDLINE | ID: mdl-28407038

ABSTRACT

MOTIVATION: The biological significance of minimal absent words has been investigated in genomes of organisms from all domains of life. For instance, three minimal absent words of the human genome were found in Ebola virus genomes. There exists an O(n) -time and O(n) -space algorithm for computing all minimal absent words of a sequence of length n on a fixed-sized alphabet based on suffix arrays. A standard implementation of this algorithm, when applied to a large sequence of length n , requires more than 20 n bytes of RAM. Such memory requirements are a significant hurdle to the computation of minimal absent words in large datasets. RESULTS: We present emMAW, the first external-memory algorithm for computing minimal absent words. A free open-source implementation of our algorithm is made available. This allows for computation of minimal absent words on far bigger data sets than was previously possible. Our implementation requires less than 3 h on a standard workstation to process the full human genome when as little as 1 GB of RAM is made available. We stress that our implementation, despite making use of external memory, is fast; indeed, even on relatively smaller datasets when enough RAM is available to hold all necessary data structures, it is less than two times slower than state-of-the-art internal-memory implementations. AVAILABILITY AND IMPLEMENTATION: https://github.com/solonas13/maw (free software under the terms of the GNU GPL). CONTACT: alice.heliou@lix.polytechnique.fr or solon.pissis@kcl.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Sequence Analysis, DNA/methods , Software , Algorithms , Genome, Human , Humans

12.

Succinct colored de Bruijn graphs.

Muggli, Martin D; Bowe, Alexander; Noyes, Noelle R; Morley, Paul S; Belk, Keith E; Raymond, Robert; Gagie, Travis; Puglisi, Simon J; Boucher, Christina.

Bioinformatics ; 33(20): 3181-3187, 2017 Oct 15.

Article in English | MEDLINE | ID: mdl-28200001

ABSTRACT

MOTIVATION: In 2012, Iqbal et al. introduced the colored de Bruijn graph, a variant of the classic de Bruijn graph, which is aimed at 'detecting and genotyping simple and complex genetic variants in an individual or population'. Because they are intended to be applied to massive population level data, it is essential that the graphs be represented efficiently. Unfortunately, current succinct de Bruijn graph representations are not directly applicable to the colored de Bruijn graph, which requires additional information to be succinctly encoded as well as support for non-standard traversal operations. RESULTS: Our data structure dramatically reduces the amount of memory required to store and use the colored de Bruijn graph, with some penalty to runtime, allowing it to be applied in much larger and more ambitious sequence projects than was previously possible. AVAILABILITY AND IMPLEMENTATION: https://github.com/cosmo-team/cosmo/tree/VARI. CONTACT: martin.muggli@colostate.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genotyping Techniques/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Bacteria/genetics , Eukaryota/genetics

13.

Misassembly detection using paired-end sequence reads and optical mapping data.

Muggli, Martin D; Puglisi, Simon J; Ronen, Roy; Boucher, Christina.

Bioinformatics ; 31(12): i80-8, 2015 Jun 15.

Article in English | MEDLINE | ID: mdl-26072512

ABSTRACT

MOTIVATION: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar. RESULTS: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar. AVAILABILITY AND IMPLEMENTATION: misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/.

Subject(s)

Algorithms , Computational Biology/methods , Sequence Analysis, DNA/methods , Software , Animals , Contig Mapping , Francisella tularensis/genetics , Genome , Melopsittacus/genetics , Oryza/genetics , Pinus/genetics

14.

Searching and Indexing Genomic Databases via Kernelization.

Gagie, Travis; Puglisi, Simon J.

Front Bioeng Biotechnol ; 3: 12, 2015.

Article in English | MEDLINE | ID: mdl-25710001

ABSTRACT

The rapid advance of DNA sequencing technologies has yielded databases of thousands of genomes. To search and index these databases effectively, it is important that we take advantage of the similarity between those genomes. Several authors have recently suggested searching or indexing only one reference genome and the parts of the other genomes where they differ. In this paper, we survey the 20-year history of this idea and discuss its relation to kernelization in parameterized complexity.

15.

SHREC: a short-read error correction method.

Schröder, Jan; Schröder, Heiko; Puglisi, Simon J; Sinha, Ranjan; Schmidt, Bertil.

Bioinformatics ; 25(17): 2157-63, 2009 Sep 01.

Article in English | MEDLINE | ID: mdl-19542152

ABSTRACT

MOTIVATION: Second-generation sequencing technologies produce a massive amount of short reads in a single experiment. However, sequencing errors can cause major problems when using this approach for de novo sequencing applications. Moreover, existing error correction methods have been designed and optimized for shotgun sequencing. Therefore, there is an urgent need for the design of fast and accurate computational methods and tools for error correction of large amounts of short read data. RESULTS: We present SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure. Our results show that the method can identify erroneous reads with sensitivity and specificity of over 99% and 96% for simulated data with error rates of up to 3% as well as for real data. Furthermore, it achieves an error correction accuracy of over 80% for simulated data and over 88% for real data. These results are clearly superior to previously published approaches. SHREC is available as an efficient open-source Java implementation that allows processing of 10 million of short reads on a standard workstation.

Subject(s)

Algorithms , Sequence Analysis, DNA/methods , Computational Biology/methods , DNA/genetics , Databases, Nucleic Acid , Genome/genetics , Research Design , Time Factors

16.

A fast hybrid short read fragment assembly algorithm.

Schmidt, Bertil; Sinha, Ranjan; Beresford-Smith, Bryan; Puglisi, Simon J.

Bioinformatics ; 25(17): 2279-80, 2009 Sep 01.

Article in English | MEDLINE | ID: mdl-19535537

ABSTRACT

SUMMARY: The shorter and vastly more numerous reads produced by second-generation sequencing technologies require new tools that can assemble massive numbers of reads in reasonable time. Existing short-read assembly tools can be classified into two categories: greedy extension-based and graph-based. While the graph-based approaches are generally superior in terms of assembly quality, the computer resources required for building and storing a huge graph are very high. In this article, we present Taipan, an assembly algorithm which can be viewed as a hybrid of these two approaches. Taipan uses greedy extensions for contig construction but at each step realizes enough of the corresponding read graph to make better decisions as to how assembly should continue. We show that this approach can achieve an assembly quality at least as good as the graph-based approaches used in the popular Edena and Velvet assembly tools using a moderate amount of computing resources.

Subject(s)

Algorithms , Helicobacter pylori/genetics , Sequence Analysis, DNA/methods , Staphylococcus aureus/genetics , Computational Biology , Databases, Nucleic Acid

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL