Search | VHL Regional Portal

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

Kolker, Natali; Higdon, Roger; Broomall, William; Stanberry, Larissa; Welch, Dean; Lu, Wei; Haynes, Winston; Barga, Roger; Kolker, Eugene.

OMICS ; 15(7-8): 513-21, 2011.

Article in English | MEDLINE | ID: mdl-21809957

ABSTRACT

To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

Subject(s)

Proteins/classification , Databases, Protein , Proteins/chemistry , Proteins/metabolism

Bioinformatics and data-intensive scientific discovery in the beginning of the 21st century.

Barga, Roger; Howe, Bill; Beck, David; Bowers, Stuart; Dobyns, William; Haynes, Winston; Higdon, Roger; Howard, Chris; Roth, Christian; Stewart, Elizabeth; Welch, Dean; Kolker, Eugene.

OMICS ; 15(4): 199-201, 2011 Apr.

Article in English | MEDLINE | ID: mdl-21476840

ABSTRACT

This article is a summary of the bioinformatics issues and challenges of data-intensive science as discussed in the NSF-funded Data-Intensive Science (DIS) workshop in Seattle, September 19-20, 2010.

Subject(s)

Biological Science Disciplines/methods , Computational Biology/methods , Computational Biology/trends

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL