Search | VHL Regional Portal

A RESTful API for accessing microbial community data for MG-RAST.

Wilke, Andreas; Bischof, Jared; Harrison, Travis; Brettin, Tom; D'Souza, Mark; Gerlach, Wolfgang; Matthews, Hunter; Paczian, Tobias; Wilkening, Jared; Glass, Elizabeth M; Desai, Narayan; Meyer, Folker.

PLoS Comput Biol ; 11(1): e1004008, 2015 Jan.

Article in English | MEDLINE | ID: mdl-25569221

ABSTRACT

Metagenomic sequencing has produced significant amounts of data in recent years. For example, as of summer 2013, MG-RAST has been used to annotate over 110,000 data sets totaling over 43 Terabases. With metagenomic sequencing finding even wider adoption in the scientific community, the existing web-based analysis tools and infrastructure in MG-RAST provide limited capability for data retrieval and analysis, such as comparative analysis between multiple data sets. Moreover, although the system provides many analysis tools, it is not comprehensive. By opening MG-RAST up via a web services API (application programmers interface) we have greatly expanded access to MG-RAST data, as well as provided a mechanism for the use of third-party analysis tools with MG-RAST data. This RESTful API makes all data and data objects created by the MG-RAST pipeline accessible as JSON objects. As part of the DOE Systems Biology Knowledgebase project (KBase, http://kbase.us) we have implemented a web services API for MG-RAST. This API complements the existing MG-RAST web interface and constitutes the basis of KBase's microbial community capabilities. In addition, the API exposes a comprehensive collection of data to programmers. This API, which uses a RESTful (Representational State Transfer) implementation, is compatible with most programming environments and should be easy to use for end users and third parties. It provides comprehensive access to sequence data, quality control results, annotations, and many other data types. Where feasible, we have used standards to expose data and metadata. Code examples are provided in a number of languages both to show the versatility of the API and to provide a starting point for users. We present an API that exposes the data in MG-RAST for consumption by our users, greatly enhancing the utility of the MG-RAST service.

Subject(s)

Database Management Systems , Databases, Genetic , Genome, Bacterial/genetics , Metagenomics/methods , User-Computer Interface , Internet , Molecular Sequence Annotation/methods , Software

A metagenomics portal for a democratized sequencing world.

Wilke, Andreas; Glass, Elizabeth M; Bartels, Daniela; Bischof, Jared; Braithwaite, Daniel; D'Souza, Mark; Gerlach, Wolfgang; Harrison, Travis; Keegan, Kevin; Matthews, Hunter; Kottmann, Renzo; Paczian, Tobias; Tang, Wei; Trimble, William L; Yilmaz, Pelin; Wilkening, Jared; Desai, Narayan; Meyer, Folker.

Methods Enzymol ; 531: 487-523, 2013.

Article in English | MEDLINE | ID: mdl-24060134

ABSTRACT

The democratized world of sequencing is leading to numerous data analysis challenges; MG-RAST addresses many of these challenges for diverse datasets, including amplicon datasets, shotgun metagenomes, and metatranscriptomes. The changes from version 2 to version 3 include the addition of a dedicated gene calling stage using FragGenescan, clustering of predicted proteins at 90% identity, and the use of BLAT for the computation of similarities. Together with changes in the underlying software infrastructure, this has enabled the dramatic scaling up of pipeline throughput while remaining on a limited hardware budget. The Web-based service allows upload, fully automated analysis, and visualization of results. As a result of the plummeting cost of sequencing and the readily available analytical power of MG-RAST, over 78,000 metagenomic datasets have been analyzed, with over 12,000 of them publicly available in MG-RAST.

Subject(s)

Computational Biology/methods , Metagenomics , Software , Bacteria/classification , Bacteria/genetics , Genome, Bacterial , High-Throughput Nucleotide Sequencing , Internet

Short-read reading-frame predictors are not created equal: sequence error causes loss of signal.

Trimble, William L; Keegan, Kevin P; D'Souza, Mark; Wilke, Andreas; Wilkening, Jared; Gilbert, Jack; Meyer, Folker.

BMC Bioinformatics ; 13: 183, 2012 Jul 28.

Article in English | MEDLINE | ID: mdl-22839106

ABSTRACT

BACKGROUND: Gene prediction algorithms (or gene callers) are an essential tool for analyzing shotgun nucleic acid sequence data. Gene prediction is a ubiquitous step in sequence analysis pipelines; it reduces the volume of data by identifying the most likely reading frame for a fragment, permitting the out-of-frame translations to be ignored. In this study we evaluate five widely used ab initio gene-calling algorithms-FragGeneScan, MetaGeneAnnotator, MetaGeneMark, Orphelia, and Prodigal-for accuracy on short (75-1000 bp) fragments containing sequence error from previously published artificial data and "real" metagenomic datasets. RESULTS: While gene prediction tools have similar accuracies predicting genes on error-free fragments, in the presence of sequencing errors considerable differences between tools become evident. For error-containing short reads, FragGeneScan finds more prokaryotic coding regions than does MetaGeneAnnotator, MetaGeneMark, Orphelia, or Prodigal. This improved detection of genes in error-containing fragments, however, comes at the cost of much lower (50%) specificity and overprediction of genes in noncoding regions. CONCLUSIONS: Ab initio gene callers offer a significant reduction in the computational burden of annotating individual nucleic acid reads and are used in many metagenomic annotation systems. For predicting reading frames on raw reads, we find the hidden Markov model approach in FragGeneScan is more sensitive than other gene prediction tools, while Prodigal, MGA, and MGM are better suited for higher-quality sequences such as assembled contigs.

Subject(s)

Metagenomics/methods , Molecular Sequence Annotation/methods , Reading Frames , Sequence Analysis, DNA/methods , Algorithms , Base Sequence

Report of the 13(th) Genomic Standards Consortium Meeting, Shenzhen, China, March 4-7, 2012.

Gilbert, Jack A; Bao, Yiming; Wang, Hui; Sansone, Susanna-Assunta; Edmunds, Scott C; Morrison, Norman; Meyer, Folker; Schriml, Lynn M; Davies, Neil; Sterk, Peter; Wilkening, Jared; Garrity, George M; Field, Dawn; Robbins, Robert; Smith, Daniel P; Mizrachi, Ilene; Moreau, Corrie.

Stand Genomic Sci ; 6(2): 276-86, 2012 May 25.

Article in English | MEDLINE | ID: mdl-22768370

ABSTRACT

This report details the outcome of the 13(th) Meeting of the Genomic Standards Consortium. The three-day conference was held at the Kingkey Palace Hotel, Shenzhen, China, on March 5-7, 2012, and was hosted by the Beijing Genomics Institute. The meeting, titled From Genomes to Interactions to Communities to Models, highlighted the role of data standards associated with genomic, metagenomic, and amplicon sequence data and the contextual information associated with the sample. To this end the meeting focused on genomic projects for animals, plants, fungi, and viruses; metagenomic studies in host-microbe interactions; and the dynamics of microbial communities. In addition, the meeting hosted a Genomic Observatories Network session, a Genomic Standards Consortium biodiversity working group session, and a Microbiology of the Built Environment session sponsored by the Alfred P. Sloan Foundation.

A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE.

Keegan, Kevin P; Trimble, William L; Wilkening, Jared; Wilke, Andreas; Harrison, Travis; D'Souza, Mark; Meyer, Folker.

PLoS Comput Biol ; 8(6): e1002541, 2012.

Article in English | MEDLINE | ID: mdl-22685393

ABSTRACT

We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as "noise" or "error") within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Subject(s)

Metagenomics/statistics & numerical data , Sequence Analysis/statistics & numerical data , Computational Biology , Data Interpretation, Statistical , Genomics/statistics & numerical data , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans

The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools.

Wilke, Andreas; Harrison, Travis; Wilkening, Jared; Field, Dawn; Glass, Elizabeth M; Kyrpides, Nikos; Mavrommatis, Konstantinos; Meyer, Folker.

BMC Bioinformatics ; 13: 141, 2012 Jun 21.

Article in English | MEDLINE | ID: mdl-22720753

ABSTRACT

BACKGROUND: Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference. DESCRIPTION: We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank. CONCLUSIONS: The data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets.

Subject(s)

Databases, Protein , Software , Computational Biology , Databases, Nucleic Acid , Metagenomics , Proteins/chemistry , Proteins/genetics

Connecting genotype to phenotype in the era of high-throughput sequencing.

Henry, Christopher S; Overbeek, Ross; Xia, Fangfang; Best, Aaron A; Glass, Elizabeth; Gilbert, Jack; Larsen, Peter; Edwards, Rob; Disz, Terry; Meyer, Folker; Vonstein, Veronika; Dejongh, Matthew; Bartels, Daniela; Desai, Narayan; D'Souza, Mark; Devoid, Scott; Keegan, Kevin P; Olson, Robert; Wilke, Andreas; Wilkening, Jared; Stevens, Rick L.

Biochim Biophys Acta ; 1810(10): 967-77, 2011 Oct.

Article in English | MEDLINE | ID: mdl-21421023

ABSTRACT

BACKGROUND: The development of next generation sequencing technology is rapidly changing the face of the genome annotation and analysis field. One of the primary uses for genome sequence data is to improve our understanding and prediction of phenotypes for microbes and microbial communities, but the technologies for predicting phenotypes must keep pace with the new sequences emerging. SCOPE OF REVIEW: This review presents an integrated view of the methods and technologies used in the inference of phenotypes for microbes and microbial communities based on genomic and metagenomic data. Given the breadth of this topic, we place special focus on the resources available within the SEED Project. We discuss the two steps involved in connecting genotype to phenotype: sequence annotation, and phenotype inference, and we highlight the challenges in each of these steps when dealing with both single genome and metagenome data. MAJOR CONCLUSIONS: This integrated view of the genotype-to-phenotype problem highlights the importance of a controlled ontology in the annotation of genomic data, as this benefits subsequent phenotype inference and metagenome annotation. We also note the importance of expanding the set of reference genomes to improve the annotation of all sequence data, and we highlight metagenome assembly as a potential new source for complete genomes. Finally, we find that phenotype inference, particularly from metabolic models, generates predictions that can be validated and reconciled to improve annotations. GENERAL SIGNIFICANCE: This review presents the first look at the challenges and opportunities associated with the inference of phenotype from genotype during the next generation sequencing revolution. This article is part of a Special Issue entitled: Systems Biology of Microorganisms.

Subject(s)

Genotype , Phenotype , Sequence Analysis, DNA/methods , Animals , Humans , Metagenomics/methods

Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes.

Glass, Elizabeth M; Wilkening, Jared; Wilke, Andreas; Antonopoulos, Dionysios; Meyer, Folker.

Cold Spring Harb Protoc ; 2010(1): pdb.prot5368, 2010 Jan.

Article in English | MEDLINE | ID: mdl-20150127

Subject(s)

Computational Biology/methods , Genetic Techniques , Metagenomics , DNA/metabolism , Databases, Genetic , Genome , Internet , Phylogeny , Programming Languages

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL