Search | VHL Regional Portal

Metagenome SNP calling via read-colored de Bruijn graphs.

Alipanahi, Bahar; Muggli, Martin D; Jundi, Musa; Noyes, Noelle R; Boucher, Christina.

Bioinformatics ; 36(22-23): 5275-5281, 2021 04 01.

Article in English | MEDLINE | ID: mdl-32049324

ABSTRACT

MOTIVATION: Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS: We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION: Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Metagenome , Software , Algorithms , Metagenomics , Polymorphism, Single Nucleotide , Sequence Analysis, DNA

Fast and accurate correction of optical mapping data via spaced seeds.

Salmela, Leena; Mukherjee, Kingshuk; Puglisi, Simon J; Muggli, Martin D; Boucher, Christina.

Bioinformatics ; 36(9): 2974, 2020 05 01.

Article in English | MEDLINE | ID: mdl-32187358

Fast and accurate correction of optical mapping data via spaced seeds.

Salmela, Leena; Mukherjee, Kingshuk; Puglisi, Simon J; Muggli, Martin D; Boucher, Christina.

Bioinformatics ; 36(3): 682-689, 2020 02 01.

Article in English | MEDLINE | ID: mdl-31504206

ABSTRACT

MOTIVATION: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. RESULTS: We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet. AVAILABILITY AND IMPLEMENTATION: Elmeri is publicly available under GNU Affero General Public License at https://github.com/LeenaSalmela/Elmeri. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , Software , Algorithms , Genome, Human , Humans , Restriction Mapping , Sequence Analysis, DNA

Kohdista: an efficient method to index and query possible Rmap alignments.

Muggli, Martin D; Puglisi, Simon J; Boucher, Christina.

Algorithms Mol Biol ; 14: 25, 2019.

Article in English | MEDLINE | ID: mdl-31867049

ABSTRACT

BACKGROUND: Genome-wide optical maps are ordered high-resolution restriction maps that give the position of occurrence of restriction cut sites corresponding to one or more restriction enzymes. These genome-wide optical maps are assembled using an overlap-layout-consensus approach using raw optical map data, which are referred to as Rmaps. Due to the high error-rate of Rmap data, finding the overlap between Rmaps remains challenging. RESULTS: We present Kohdista, which is an index-based algorithm for finding pairwise alignments between single molecule maps (Rmaps). The novelty of our approach is the formulation of the alignment problem as automaton path matching, and the application of modern index-based data structures. In particular, we combine the use of the Generalized Compressed Suffix Array (GCSA) index with the wavelet tree in order to build Kohdista. We validate Kohdista on simulated E. coli data, showing the approach successfully finds alignments between Rmaps simulated from overlapping genomic regions. CONCLUSION: we demonstrate Kohdista is the only method that is capable of finding a significant number of high quality pairwise Rmap alignments for large eukaryote organisms in reasonable time.

Building large updatable colored de Bruijn graphs via merging.

Muggli, Martin D; Alipanahi, Bahar; Boucher, Christina.

Bioinformatics ; 35(14): i51-i60, 2019 07 15.

Article in English | MEDLINE | ID: mdl-31510647

ABSTRACT

MOTIVATION: There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed. RESULTS: We create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods-including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT-and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction. AVAILABILITY AND IMPLEMENTATION: VariMerge is available at https://github.com/cosmo-team/cosmo/tree/VARI-merge under GPLv3 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Sequence Analysis, DNA , Software , Genomics , Metagenome

Error correcting optical mapping data.

Mukherjee, Kingshuk; Washimkar, Darshan; Muggli, Martin D; Salmela, Leena; Boucher, Christina.

Gigascience ; 7(6)2018 06 01.

Article in English | MEDLINE | ID: mdl-29846578

ABSTRACT

Optical mapping is a unique system that is capable of producing high-resolution, high-throughput genomic map data that gives information about the structure of a genome . Recently it has been used for scaffolding contigs and for assembly validation for large-scale sequencing projects, including the maize, goat, and Amborella genomes. However, a major impediment in the use of this data is the variety and quantity of errors in the raw optical mapping data, which are called Rmaps. The challenges associated with using Rmap data are analogous to dealing with insertions and deletions in the alignment of long reads. Moreover, they are arguably harder to tackle since the data are numerical and susceptible to inaccuracy. We develop cOMet to error correct Rmap data, which to the best of our knowledge is the only optical mapping error correction method. Our experimental results demonstrate that cOMet has high prevision and corrects 82.49% of insertion errors and 77.38% of deletion errors in Rmap data generated from the Escherichia coli K-12 reference genome. Out of the deletion errors corrected, 98.26% are true errors. Similarly, out of the insertion errors corrected, 82.19% are true errors. It also successfully scales to large genomes, improving the quality of 78% and 99% of the Rmaps in the plum and goat genomes, respectively. Last, we show the utility of error correction by demonstrating how it improves the assembly of Rmap data. Error corrected Rmap data results in an assembly that is more contiguous and covers a larger fraction of the genome.

Subject(s)

Chromosome Mapping/methods , High-Throughput Nucleotide Sequencing/methods , Animals , Computer Simulation , Databases, Genetic , Escherichia coli/genetics , Genome , Goats/genetics , Prunus domestica/genetics , Sequence Alignment

Succinct colored de Bruijn graphs.

Muggli, Martin D; Bowe, Alexander; Noyes, Noelle R; Morley, Paul S; Belk, Keith E; Raymond, Robert; Gagie, Travis; Puglisi, Simon J; Boucher, Christina.

Bioinformatics ; 33(20): 3181-3187, 2017 Oct 15.

Article in English | MEDLINE | ID: mdl-28200001

ABSTRACT

MOTIVATION: In 2012, Iqbal et al. introduced the colored de Bruijn graph, a variant of the classic de Bruijn graph, which is aimed at 'detecting and genotyping simple and complex genetic variants in an individual or population'. Because they are intended to be applied to massive population level data, it is essential that the graphs be represented efficiently. Unfortunately, current succinct de Bruijn graph representations are not directly applicable to the colored de Bruijn graph, which requires additional information to be succinctly encoded as well as support for non-standard traversal operations. RESULTS: Our data structure dramatically reduces the amount of memory required to store and use the colored de Bruijn graph, with some penalty to runtime, allowing it to be applied in much larger and more ambitious sequence projects than was previously possible. AVAILABILITY AND IMPLEMENTATION: https://github.com/cosmo-team/cosmo/tree/VARI. CONTACT: martin.muggli@colostate.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genotyping Techniques/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Bacteria/genetics , Eukaryota/genetics

Misassembly detection using paired-end sequence reads and optical mapping data.

Muggli, Martin D; Puglisi, Simon J; Ronen, Roy; Boucher, Christina.

Bioinformatics ; 31(12): i80-8, 2015 Jun 15.

Article in English | MEDLINE | ID: mdl-26072512

ABSTRACT

MOTIVATION: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar. RESULTS: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar. AVAILABILITY AND IMPLEMENTATION: misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/.

Subject(s)

Algorithms , Computational Biology/methods , Sequence Analysis, DNA/methods , Software , Animals , Contig Mapping , Francisella tularensis/genetics , Genome , Melopsittacus/genetics , Oryza/genetics , Pinus/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL