Search | VHL Regional Portal

Enhancing navigation in biomedical databases by community voting and database-driven text classification.

Duchrow, Timo; Shtatland, Timur; Guettler, Daniel; Pivovarov, Misha; Kramer, Stefan; Weissleder, Ralph.

BMC Bioinformatics ; 10: 317, 2009 Oct 03.

Article in English | MEDLINE | ID: mdl-19799796

ABSTRACT

BACKGROUND: The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them. RESULTS: Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly. CONCLUSION: Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases.The system can be accessed at http://pepbank.mgh.harvard.edu.

Subject(s)

Computational Biology/methods , Databases, Factual , Information Storage and Retrieval , Classification , Internet

PepBank--a database of peptides based on sequence text mining and public peptide data sources.

Shtatland, Timur; Guettler, Daniel; Kossodo, Misha; Pivovarov, Misha; Weissleder, Ralph.

BMC Bioinformatics ; 8: 280, 2007 Aug 01.

Article in English | MEDLINE | ID: mdl-17678535

ABSTRACT

BACKGROUND: Peptides are important molecules with diverse biological functions and biomedical uses. To date, there does not exist a single, searchable archive for peptide sequences or associated biological data. Rather, peptide sequences still have to be mined from abstracts and full-length articles, and/or obtained from the fragmented public sources. DESCRIPTION: We have constructed a new database (PepBank), which at the time of writing contains a total of 19,792 individual peptide entries. The database has a web-based user interface with a simple, Google-like search function, advanced text search, and BLAST and Smith-Waterman search capabilities. The major source of peptide sequence data comes from text mining of MEDLINE abstracts. Another component of the database is the peptide sequence data from public sources (ASPD and UniProt). An additional, smaller part of the database is manually curated from sets of full text articles and text mining results. We show the utility of the database in different examples of affinity ligand discovery. CONCLUSION: We have created and maintain a database of peptide sequences. The database has biological and medical applications, for example, to predict the binding partners of biologically interesting peptides, to develop peptide based therapeutic or diagnostic agents, or to predict molecular targets or binding specificities of peptides resulting from phage display selection. The database is freely available on http://pepbank.mgh.harvard.edu/, and the text mining source code (Peptide::Pubmed) is freely available above as well as on CPAN (http://www.cpan.org/).

Subject(s)

Database Management Systems , Databases, Protein , Internet , Natural Language Processing , Peptides/chemistry , Sequence Analysis, Protein/methods , User-Computer Interface , Amino Acid Sequence , Information Storage and Retrieval/methods , Molecular Sequence Data , Periodicals as Topic

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL