Search | VHL Regional Portal

1.

MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets.

Reddy, Rachamalla Maheedhar; Mohammed, Monzoorul Haque; Mande, Sharmila S.

Genomics ; 103(2-3): 161-8, 2014.

Article in English | MEDLINE | ID: mdl-24607570

ABSTRACT

A key challenge in analyzing metagenomics data pertains to assembly of sequenced DNA fragments (i.e. reads) originating from various microbes in a given environmental sample. Several existing methodologies can assemble reads originating from a single genome. However, these methodologies cannot be applied for efficient assembly of metagenomic sequence datasets. In this study, we present MetaCAA - a clustering-aided methodology which helps in improving the quality of metagenomic sequence assembly. MetaCAA initially groups sequences constituting a given metagenome into smaller clusters. Subsequently, sequences in each cluster are independently assembled using CAP3, an existing single genome assembly program. Contigs formed in each of the clusters along with the unassembled reads are then subjected to another round of assembly for generating the final set of contigs. Validation using simulated and real-world metagenomic datasets indicates that MetaCAA aids in improving the overall quality of assembly. A software implementation of MetaCAA is available at https://metagenomics.atc.tcs.com/MetaCAA.

Subject(s)

Datasets as Topic , Metagenome , Metagenomics/methods , Sequence Analysis, DNA/methods , Software

2.

Classification of metagenomic sequences: methods and challenges.

Mande, Sharmila S; Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar.

Brief Bioinform ; 13(6): 669-81, 2012 Nov.

Article in English | MEDLINE | ID: mdl-22962338

ABSTRACT

Characterizing the taxonomic diversity of microbial communities is one of the primary objectives of metagenomic studies. Taxonomic analysis of microbial communities, a process referred to as binning, is challenging for the following reasons. Primarily, query sequences originating from the genomes of most microbes in an environmental sample lack taxonomically related sequences in existing reference databases. This absence of a taxonomic context makes binning a very challenging task. Limitations of current sequencing platforms, with respect to short read lengths and sequencing errors/artifacts, are also key factors that determine the overall binning efficiency. Furthermore, the sheer volume of metagenomic datasets also demands highly efficient algorithms that can operate within reasonable requirements of compute power. This review discusses the premise, methodologies, advantages, limitations and challenges of various methods available for binning of metagenomic datasets obtained using the shotgun sequencing approach. Various parameters as well as strategies used for evaluating binning efficiency are then reviewed.

Subject(s)

Metagenome , Algorithms , Databases, Genetic , Metagenomics , Sequence Analysis, DNA/methods

3.

BIND - an algorithm for loss-less compression of nucleotide sequence data.

Bose, Tungadri; Mohammed, Monzoorul Haque; Dutta, Anirban; Mande, Sharmila S.

J Biosci ; 37(4): 785-9, 2012 Sep.

Article in English | MEDLINE | ID: mdl-22922203

ABSTRACT

Recent advances in DNA sequencing technologies have enabled the current generation of life science researchers to probe deeper into the genomic blueprint. The amount of data generated by these technologies has been increasing exponentially since the last decade. Storage, archival and dissemination of such huge data sets require efficient solutions, both from the hardware as well as software perspective. The present paper describes BIND-an algorithm specialized for compressing nucleotide sequence data. By adopting a unique 'block-length' encoding for representing binary data (as a key step), BIND achieves significant compression gains as compared to the widely used general purpose compression algorithms (gzip, bzip2 and lzma). Moreover, in contrast to implementations of existing specialized genomic compression approaches, the implementation of BIND is enabled to handle non-ATGC and lowercase characters. This makes BIND a loss-less compression approach that is suitable for practical use. More importantly, validation results of BIND (with real-world data sets) indicate reasonable speeds of compression and decompression that can be achieved with minimal processor/ memory usage. BIND is available for download at http://metagenomics.atc.tcs.com/compression/BIND. No license is required for academic or non-profit use.

Subject(s)

Algorithms , Data Compression/methods , Information Storage and Retrieval , Sequence Analysis, DNA , Base Sequence , Computing Methodologies , Software

4.

DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis.

Mohammed, Monzoorul Haque; Dutta, Anirban; Bose, Tungadri; Chadaram, Sudha; Mande, Sharmila S.

Bioinformatics ; 28(19): 2527-9, 2012 Oct 01.

Article in English | MEDLINE | ID: mdl-22833526

ABSTRACT

SUMMARY: An unprecedented quantity of genome sequence data is currently being generated using next-generation sequencing platforms. This has necessitated the development of novel bioinformatics approaches and algorithms that not only facilitate a meaningful analysis of these data but also aid in efficient compression, storage, retrieval and transmission of huge volumes of the generated data. We present a novel compression algorithm (DELIMINATE) that can rapidly compress genomic sequence data in a loss-less fashion. Validation results indicate relatively higher compression efficiency of DELIMINATE when compared with popular general purpose compression algorithms, namely, gzip, bzip2 and lzma. AVAILABILITY AND IMPLEMENTATION: Linux, Windows and Mac implementations (both 32 and 64-bit) of DELIMINATE are freely available for download at: http://metagenomics.atc.tcs.com/compression/DELIMINATE. CONTACT: sharmila@atc.tcs.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , Computational Biology/methods , Data Compression/methods , Genomics/methods , Base Sequence , Sequence Analysis, DNA/methods

5.

TWARIT: an extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences.

Reddy, Rachamalla Maheedhar; Mohammed, Monzoorul Haque; Mande, Sharmila S.

Gene ; 505(2): 259-65, 2012 Sep 01.

Article in English | MEDLINE | ID: mdl-22710135

ABSTRACT

Phylogenetic assignment of individual sequence reads to their respective taxa, referred to as 'taxonomic binning', constitutes a key step of metagenomic analysis. Existing binning methods have limitations either with respect to time or accuracy/specificity of binning. Given these limitations, development of a method that can bin vast amounts of metagenomic sequence data in a rapid, efficient and computationally inexpensive manner can profoundly influence metagenomic analysis in computational resource poor settings. We introduce TWARIT, a hybrid binning algorithm, that employs a combination of short-read alignment and composition-based signature sorting approaches to achieve rapid binning rates without compromising on binning accuracy and specificity. TWARIT is validated with simulated and real-world metagenomes and the results demonstrate significantly lower overall binning times compared to that of existing methods. Furthermore, the binning accuracy and specificity of TWARIT are observed to be comparable/superior to them. A web server implementing TWARIT algorithm is available at http://metagenomics.atc.tcs.com/Twarit/

Subject(s)

Algorithms , Metagenomics/methods , Phylogeny , Sequence Analysis, DNA/classification , Sequence Alignment

6.

C16S - a Hidden Markov Model based algorithm for taxonomic classification of 16S rRNA gene sequences.

Ghosh, Tarini Shankar; Gajjalla, Purnachander; Mohammed, Monzoorul Haque; Mande, Sharmila S.

Genomics ; 99(4): 195-201, 2012 Apr.

Article in English | MEDLINE | ID: mdl-22326741

ABSTRACT

Recent advances in high throughput sequencing technologies and concurrent refinements in 16S rDNA isolation techniques have facilitated the rapid extraction and sequencing of 16S rDNA content of microbial communities. The taxonomic affiliation of these 16S rDNA fragments is subsequently obtained using either BLAST-based or word frequency based approaches. However, the classification accuracy of such methods is observed to be limited in typical metagenomic scenarios, wherein a majority of organisms are hitherto unknown. In this study, we present a 16S rDNA classification algorithm, called C16S, that uses genus-specific Hidden Markov Models for taxonomic classification of 16S rDNA sequences. Results obtained using C16S have been compared with the widely used RDP classifier. The performance of C16S algorithm was observed to be consistently higher than the RDP classifier. In some scenarios, this increase in accuracy is as high as 34%. A web-server for the C16S algorithm is available at http://metagenomics.atc.tcs.com/C16S/.

Subject(s)

Algorithms , Markov Chains , RNA, Ribosomal, 16S/classification , RNA, Ribosomal, 16S/genetics , DNA Fragmentation , Databases, Genetic , Metagenomics , Models, Biological , Proteobacteria/classification , Proteobacteria/genetics , Reproducibility of Results , Rhizobium/classification , Rhizobium/genetics , Sequence Alignment , Sequence Analysis, DNA/methods

7.

Eu-Detect: an algorithm for detecting eukaryotic sequences in metagenomic data sets.

Mohammed, Monzoorul Haque; Chadaram, Sudha; Komanduri, Dinakar; Ghosh, Tarini Shankar; Mande, Sharmila S.

J Biosci ; 36(4): 709-17, 2011 Sep.

Article in English | MEDLINE | ID: mdl-21857117

ABSTRACT

Physical partitioning techniques are routinely employed (during sample preparation stage) for segregating the prokaryotic and eukaryotic fractions of metagenomic samples. In spite of these efforts, several metagenomic studies focusing on bacterial and archaeal populations have reported the presence of contaminating eukaryotic sequences in metagenomic data sets. Contaminating sequences originate not only from genomes of micro-eukaryotic species but also from genomes of (higher) eukaryotic host cells. The latter scenario usually occurs in the case of host-associated metagenomes. Identification and removal of contaminating sequences is important, since these sequences not only impact estimates of microbial diversity but also affect the accuracy of several downstream analyses. Currently, the computational techniques used for identifying contaminating eukaryotic sequences, being alignment based, are slow, inefficient, and require huge computing resources. In this article, we present Eu-Detect, an alignment-free algorithm that can rapidly identify eukaryotic sequences contaminating metagenomic data sets. Validation results indicate that on a desktop with modest hardware specifications, the Eu-Detect algorithm is able to rapidly segregate DNA sequence fragments of prokaryotic and eukaryotic origin, with high sensitivity. A Web server for the Eu-Detect algorithm is available at http://metagenomics.atc.tcs.com/Eu-Detect/.

Subject(s)

Algorithms , Genome, Archaeal , Genome, Bacterial , Metagenome , Metagenomics/methods , Sequence Analysis, DNA/methods , Artifacts , Base Sequence , Eukaryota/genetics , Sensitivity and Specificity , Software

8.

Metagenome of the gut of a malnourished child.

Gupta, Sourav Sen; Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Kanungo, Suman; Nair, Gopinath Balakrish; Mande, Sharmila S.

Gut Pathog ; 3: 7, 2011 May 20.

Article in English | MEDLINE | ID: mdl-21599906

ABSTRACT

BACKGROUND: Malnutrition, a major health problem, affects a significant proportion of preschool children in developing countries. The devastating consequences of malnutrition include diarrhoea, malabsorption, increased intestinal permeability, suboptimal immune response, etc. Nutritional interventions and dietary solutions have not been effective for treatment of malnutrition till date. Metagenomic procedures allow one to access the complex cross-talk between the gut and its microbial flora and understand how a different community composition affects various states of human health. In this study, a metagenomic approach was employed for analysing the differences between gut microbial communities obtained from a malnourished and an apparently healthy child. RESULTS: Our results indicate that the malnourished child gut has an abundance of enteric pathogens which are known to cause intestinal inflammation resulting in malabsorption of nutrients. We also identified a few functional sub-systems from these pathogens, which probably impact the overall metabolic capabilities of the malnourished child gut. CONCLUSION: The present study comprehensively characterizes the microbial community resident in the gut of a malnourished child. This study has attempted to extend the understanding of the basis of malnutrition beyond nutrition deprivation.

9.

ProViDE: A software tool for accurate estimation of viral diversity in metagenomic samples.

Ghosh, Tarini Shankar; Mohammed, Monzoorul Haque; Komanduri, Dinakar; Mande, Sharmila Shekhar.

Bioinformation ; 6(2): 91-4, 2011 Mar 26.

Article in English | MEDLINE | ID: mdl-21544173

ABSTRACT

Given the absence of universal marker genes in the viral kingdom, researchers typically use BLAST (with stringent E-values) for taxonomic classification of viral metagenomic sequences. Since majority of metagenomic sequences originate from hitherto unknown viral groups, using stringent e-values results in most sequences remaining unclassified. Furthermore, using less stringent e-values results in a high number of incorrect taxonomic assignments. The SOrt-ITEMS algorithm provides an approach to address the above issues. Based on alignment parameters, SOrt-ITEMS follows an elaborate work-flow for assigning reads originating from hitherto unknown archaeal/bacterial genomes. In SOrt-ITEMS, alignment parameter thresholds were generated by observing patterns of sequence divergence within and across various taxonomic groups belonging to bacterial and archaeal kingdoms. However, many taxonomic groups within the viral kingdom lack a typical Linnean-like taxonomic hierarchy. In this paper, we present ProViDE (Program for Viral Diversity Estimation), an algorithm that uses a customized set of alignment parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom. Validation results indicate that the percentage of 'correct' assignments by ProViDE is around 1.7 to 3 times higher than that by the widely used similarity based method MEGAN. The misclassification rate of ProViDE is around 3 to 19% (as compared to 5 to 42% by MEGAN) indicating significantly better assignment accuracy. ProViDE software and a supplementary file (containing supplementary figures and tables referred to in this article) is available for download from http://metagenomics.atc.tcs.com/binning/ProViDE/

10.

SPHINX--an algorithm for taxonomic binning of metagenomic sequences.

Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Singh, Nitin Kumar; Mande, Sharmila S.

Bioinformatics ; 27(1): 22-30, 2011 Jan 01.

Article in English | MEDLINE | ID: mdl-21030462

ABSTRACT

MOTIVATION: Compared with composition-based binning algorithms, the binning accuracy and specificity of alignment-based binning algorithms is significantly higher. However, being alignment-based, the latter class of algorithms require enormous amount of time and computing resources for binning huge metagenomic datasets. The motivation was to develop a binning approach that can analyze metagenomic datasets as rapidly as composition-based approaches, but nevertheless has the accuracy and specificity of alignment-based algorithms. This article describes a hybrid binning approach (SPHINX) that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. RESULTS: Validation results with simulated sequence datasets indicate that SPHINX is able to analyze metagenomic sequences as rapidly as composition-based algorithms. Furthermore, the binning efficiency (in terms of accuracy and specificity of assignments) of SPHINX is observed to be comparable with results obtained using alignment-based algorithms. AVAILABILITY: A web server for the SPHINX algorithm is available at http://metagenomics.atc.tcs.com/SPHINX/.

Subject(s)

Algorithms , Metagenomics/methods , Animals , Cluster Analysis , Databases, Nucleic Acid , Gastrointestinal Tract/microbiology , Mice , Sensitivity and Specificity , Sequence Alignment

11.

INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences.

Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Reddy, Rachamalla Maheedhar; Reddy, Chennareddy Venkata Siva Kumar; Singh, Nitin Kumar; Mande, Sharmila S.

BMC Genomics ; 12 Suppl 3: S4, 2011 Nov 30.

Article in English | MEDLINE | ID: mdl-22369237

ABSTRACT

BACKGROUND: Taxonomic classification of metagenomic sequences is the first step in metagenomic analysis. Existing taxonomic classification approaches are of two types, similarity-based and composition-based. Similarity-based approaches, though accurate and specific, are extremely slow. Since, metagenomic projects generate millions of sequences, adopting similarity-based approaches becomes virtually infeasible for research groups having modest computational resources. In this study, we present INDUS - a composition-based approach that incorporates the following novel features. First, INDUS discards the 'one genome-one composition' model adopted by existing compositional approaches. Second, INDUS uses 'compositional distance' information for identifying appropriate assignment levels. Third, INDUS incorporates steps that attempt to reduce biases due to database representation. RESULTS: INDUS is able to rapidly classify sequences in both simulated and real metagenomic sequence data sets with classification efficiency significantly higher than existing composition-based approaches. Although the classification efficiency of INDUS is observed to be comparable to those by similarity-based approaches, the binning time (as compared to alignment based approaches) is 23-33 times lower. CONCLUSION: Given it's rapid execution time, and high levels of classification efficiency, INDUS is expected to be of immense interest to researchers working in metagenomics and microbial ecology. AVAILABILITY: A web-server for the INDUS algorithm is available at http://metagenomics.atc.tcs.com/INDUS/

Subject(s)

Classification/methods , Databases, Factual , Metagenomics , Algorithms , Genome , Internet , User-Computer Interface

12.

i-rDNA: alignment-free algorithm for rapid in silico detection of ribosomal gene fragments from metagenomic sequence data sets.

Mohammed, Monzoorul Haque; Ghosh, Tarini Shankar; Chadaram, Sudha; Mande, Sharmila S.

BMC Genomics ; 12 Suppl 3: S12, 2011 Nov 30.

Article in English | MEDLINE | ID: mdl-22369265

ABSTRACT

BACKGROUND: Obtaining accurate estimates of microbial diversity using rDNA profiling is the first step in most metagenomics projects. Consequently, most metagenomic projects spend considerable amounts of time, money and manpower for experimentally cloning, amplifying and sequencing the rDNA content in a metagenomic sample. In the second step, the entire genomic content of the metagenome is extracted, sequenced and analyzed. Since DNA sequences obtained in this second step also contain rDNA fragments, rapid in silico identification of these rDNA fragments would drastically reduce the cost, time and effort of current metagenomic projects by entirely bypassing the experimental steps of primer based rDNA amplification, cloning and sequencing. In this study, we present an algorithm called i-rDNA that can facilitate the rapid detection of 16S rDNA fragments from amongst millions of sequences in metagenomic data sets with high detection sensitivity. RESULTS: Performance evaluation with data sets/database variants simulating typical metagenomic scenarios indicates the significantly high detection sensitivity of i-rDNA. Moreover, i-rDNA can process a million sequences in less than an hour on a simple desktop with modest hardware specifications. CONCLUSIONS: In addition to the speed of execution, high sensitivity and low false positive rate, the utility of the algorithmic approach discussed in this paper is immense given that it would help in bypassing the entire experimental step of primer-based rDNA amplification, cloning and sequencing. Application of this algorithmic approach would thus drastically reduce the cost, time and human efforts invested in all metagenomic projects. AVAILABILITY: A web-server for the i-rDNA algorithm is available at http://metagenomics.atc.tcs.com/i-rDNA/

Subject(s)

Algorithms , Metagenomics , RNA, Ribosomal, 16S/genetics , Cloning, Molecular , Databases, Genetic , Internet , Search Engine , Sequence Analysis, RNA

13.

HabiSign: a novel approach for comparison of metagenomes and rapid identification of habitat-specific sequences.

Ghosh, Tarini Shankar; Mohammed, Monzoorul Haque; Rajasingh, Hannah; Chadaram, Sudha; Mande, Sharmila S.

BMC Bioinformatics ; 12 Suppl 13: S9, 2011.

Article in English | MEDLINE | ID: mdl-22373355

ABSTRACT

BACKGROUND: One of the primary goals of comparative metagenomic projects is to study the differences in the microbial communities residing in diverse environments. Besides providing valuable insights into the inherent structure of the microbial populations, these studies have potential applications in several important areas of medical research like disease diagnostics, detection of pathogenic contamination and identification of hitherto unknown pathogens. Here we present a novel and rapid, alignment-free method called HabiSign, which utilizes patterns of tetra-nucleotide usage in microbial genomes to bring out the differences in the composition of both diverse and related microbial communities. RESULTS: Validation results show that the metagenomic signatures obtained using the HabiSign method are able to accurately cluster metagenomes at biome, phenotypic and species levels, as compared to an average tetranucleotide frequency based approach and the recently published dinucleotide relative abundance based approach. More importantly, the method is able to identify subsets of sequences that are specific to a particular habitat. Apart from this, being alignment-free, the method can rapidly compare and group multiple metagenomic data sets in a short span of time. CONCLUSIONS: The proposed method is expected to have immense applicability in diverse areas of metagenomic research ranging from disease diagnostics and pathogen detection to bio-prospecting. A web-server for the HabiSign algorithm is available at http://metagenomics.atc.tcs.com/HabiSign/.

Subject(s)

Algorithms , Bacterial Typing Techniques , Metagenome , Animals , Metagenomics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL