Search | VHL Regional Portal

The khmer software package: enabling efficient nucleotide sequence analysis.

Crusoe, Michael R; Alameldin, Hussien F; Awad, Sherine; Boucher, Elmar; Caldwell, Adam; Cartwright, Reed; Charbonneau, Amanda; Constantinides, Bede; Edvenson, Greg; Fay, Scott; Fenton, Jacob; Fenzl, Thomas; Fish, Jordan; Garcia-Gutierrez, Leonor; Garland, Phillip; Gluck, Jonathan; González, Iván; Guermond, Sarah; Guo, Jiarong; Gupta, Aditi; Herr, Joshua R; Howe, Adina; Hyer, Alex; Härpfer, Andreas; Irber, Luiz; Kidd, Rhys; Lin, David; Lippi, Justin; Mansour, Tamer; McA'Nulty, Pamela; McDonald, Eric; Mizzi, Jessica; Murray, Kevin D; Nahum, Joshua R; Nanlohy, Kaben; Nederbragt, Alexander Johan; Ortiz-Zuazaga, Humberto; Ory, Jeramia; Pell, Jason; Pepe-Ranney, Charles; Russ, Zachary N; Schwarz, Erich; Scott, Camille; Seaman, Josiah; Sievert, Scott; Simpson, Jared; Skennerton, Connor T; Spencer, James; Srinivasan, Ramakrishnan; Standage, Daniel.

F1000Res ; 4: 900, 2015.

Article in English | MEDLINE | ID: mdl-26535114

ABSTRACT

The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/.

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Zhang, Qingpeng; Pell, Jason; Canino-Koning, Rosangela; Howe, Adina Chuang; Brown, C Titus.

PLoS One ; 9(7): e101271, 2014.

Article in English | MEDLINE | ID: mdl-25062443

ABSTRACT

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer.

Subject(s)

Computational Biology , Nucleotides , Sequence Analysis, DNA , Software , Algorithms , Humans

The genome and developmental transcriptome of the strongylid nematode Haemonchus contortus.

Schwarz, Erich M; Korhonen, Pasi K; Campbell, Bronwyn E; Young, Neil D; Jex, Aaron R; Jabbar, Abdul; Hall, Ross S; Mondal, Alinda; Howe, Adina C; Pell, Jason; Hofmann, Andreas; Boag, Peter R; Zhu, Xing-Quan; Gregory, T; Loukas, Alex; Williams, Brian A; Antoshechkin, Igor; Brown, C; Sternberg, Paul W; Gasser, Robin B.

Genome Biol ; 14(8): R89, 2013 Aug 28.

Article in English | MEDLINE | ID: mdl-23985341

ABSTRACT

BACKGROUND: The barber's pole worm, Haemonchus contortus, is one of the most economically important parasites of small ruminants worldwide. Although this parasite can be controlled using anthelmintic drugs, resistance against most drugs in common use has become a widespread problem. We provide a draft of the genome and the transcriptomes of all key developmental stages of H. contortus to support biological and biotechnological research areas of this and related parasites. RESULTS: The draft genome of H. contortus is 320 Mb in size and encodes 23,610 protein-coding genes. On a fundamental level, we elucidate transcriptional alterations taking place throughout the life cycle, characterize the parasite's gene silencing machinery, and explore molecules involved in development, reproduction, host-parasite interactions, immunity, and disease. The secretome of H. contortus is particularly rich in peptidases linked to blood-feeding activity and interactions with host tissues, and a diverse array of molecules is involved in complex immune responses. On an applied level, we predict drug targets and identify vaccine molecules. CONCLUSIONS: The draft genome and developmental transcriptome of H. contortus provide a major resource to the scientific community for a wide range of genomic, genetic, proteomic, metabolomic, evolutionary, biological, ecological, and epidemiological investigations, and a solid foundation for biotechnological outcomes, including new anthelmintics, vaccines and diagnostic tests. This first draft genome of any strongylid nematode paves the way for a rapid acceleration in our understanding of a wide range of socioeconomically important parasites of one of the largest nematode orders.

Subject(s)

Antigens, Helminth/genetics , Genes, Helminth , Genome, Helminth , Haemonchus/genetics , Life Cycle Stages/genetics , Transcriptome , Animals , Anthelmintics/pharmacology , Drug Resistance/genetics , Female , Gene Expression Regulation , Genome Size , Haemonchiasis/parasitology , Haemonchiasis/veterinary , Haemonchus/drug effects , Haemonchus/growth & development , Helminth Proteins/chemistry , Helminth Proteins/genetics , Host-Parasite Interactions , Male , Peptide Hydrolases/chemistry , Peptide Hydrolases/genetics , Sheep , Sheep Diseases/parasitology

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.

Pell, Jason; Hintze, Arend; Canino-Koning, Rosangela; Howe, Adina; Tiedje, James M; Brown, C Titus.

Proc Natl Acad Sci U S A ; 109(33): 13272-7, 2012 Aug 14.

Article in English | MEDLINE | ID: mdl-22847406

ABSTRACT

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly.

Subject(s)

Computational Biology , Genome, Bacterial/genetics , Metagenome/genetics , Sequence Analysis, DNA/methods , Base Pairing/genetics , Chromosomes, Bacterial/genetics , DNA, Circular/genetics , Escherichia coli/genetics , Information Theory , Nonlinear Dynamics , Soil Microbiology

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL