Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 9 de 9
Filter
Add more filters










Database
Language
Publication year range
1.
Bioinformatics ; 28(16): 2154-61, 2012 Aug 15.
Article in English | MEDLINE | ID: mdl-22711795

ABSTRACT

MOTIVATION: Although the amount of data in biology is rapidly increasing, critical information for understanding biological events like phosphorylation or gene expression remains locked in the biomedical literature. Most current text mining (TM) approaches to extract information about biological events are focused on either limited-scale studies and/or abstracts, with data extracted lacking context and rarely available to support further research. RESULTS: Here we present BioContext, an integrated TM system which extracts, extends and integrates results from a number of tools performing entity recognition, biomolecular event extraction and contextualization. Application of our system to 10.9 million MEDLINE abstracts and 234 000 open-access full-text articles from PubMed Central yielded over 36 million mentions representing 11.4 million distinct events. Event participants included over 290 000 distinct genes/proteins that are mentioned more than 80 million times and linked where possible to Entrez Gene identifiers. Over a third of events contain contextual information such as the anatomical location of the event occurrence or whether the event is reported as negated or speculative. AVAILABILITY: The BioContext pipeline is available for download (under the BSD license) at http://www.biocontext.org, along with the extracted data which is also available for online browsing.


Subject(s)
Biochemical Phenomena , Computational Biology/methods , Data Mining , Software , MEDLINE , PubMed
2.
Database (Oxford) ; 2012: bas023, 2012.
Article in English | MEDLINE | ID: mdl-22529179

ABSTRACT

Manual curation has long been used for extracting key information found within the primary literature for input into biological databases. The human immunodeficiency virus type 1 (HIV-1), human protein interaction database (HHPID), for example, contains 2589 manually extracted interactions, linked to 14,312 mentions in 3090 articles. The advancement of text-mining (TM) techniques has offered a possibility to rapidly retrieve such data from large volumes of text to a high degree of accuracy. Here, we present a recreation of the HHPID using the current state of the art in TM. To retrieve interactions, we performed gene/protein named entity recognition (NER) and applied two molecular event extraction tools on all abstracts and titles cited in the HHPID. Our best NER scores for precision, recall and F-score were 87.5%, 90.0% and 88.6%, respectively, while event extraction achieved 76.4%, 84.2% and 80.1%, respectively. We demonstrate that over 50% of the HHPID interactions can be recreated from abstracts and titles. Furthermore, from 49 available open-access full-text articles, we extracted a total of 237 unique HIV-1-human interactions, as opposed to 187 interactions recorded in the HHPID from the same articles. On average, we extracted 23 times more mentions of interactions and events from a full-text article than from an abstract and title, with a 6-fold increase in the number of unique interactions. We further demonstrated that more frequently occurring interactions extracted by TM are more likely to be true positives. Overall, the results demonstrate that TM was able to recover a large proportion of interactions, many of which were found within the HHPID, making TM a useful assistant in the manual curation process. Finally, we also retrieved other types of interactions in the context of HIV-1 that are not currently present in the HHPID, thus, expanding the scope of this data set. All data is available at http://gnode1.mib.man.ac.uk/HIV1-text-mining.


Subject(s)
Data Mining/methods , Databases, Protein , HIV-1/physiology , Protein Interaction Mapping/methods , Host-Pathogen Interactions , Humans , Molecular Sequence Annotation/methods , Protein Interaction Domains and Motifs , Proteins/chemistry , Proteins/metabolism , Viral Proteins/chemistry , Viral Proteins/metabolism
3.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Article in English | MEDLINE | ID: mdl-22151901

ABSTRACT

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Subject(s)
Algorithms , Data Mining/methods , Genes , Animals , Data Mining/standards , Humans , National Library of Medicine (U.S.) , Periodicals as Topic , United States
4.
PLoS One ; 6(9): e24716, 2011.
Article in English | MEDLINE | ID: mdl-21980353

ABSTRACT

BACKGROUND: The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions. METHODOLOGY/PRINCIPAL FINDINGS: To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data. CONCLUSION/SIGNIFICANCE: By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.


Subject(s)
Computational Biology/methods , PubMed , Animals , Chromosome Mapping/methods , Data Mining , Database Management Systems , Databases, Genetic , Genome , Genomics , Humans , Information Storage and Retrieval , MEDLINE , Models, Biological , Models, Genetic , Software , User-Computer Interface
5.
Bioinformatics ; 27(19): 2769-71, 2011 Oct 01.
Article in English | MEDLINE | ID: mdl-21813477

ABSTRACT

SUMMARY: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987. AVAILABILITY: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net. CONTACT: jorg.hakenberg@roche.com.


Subject(s)
Data Mining , Gene Library , Electronic Data Processing , Genes , Internet , Proteins , Publishing , Terminology as Topic
6.
Bioinformatics ; 27(7): 980-6, 2011 Apr 01.
Article in English | MEDLINE | ID: mdl-21325301

ABSTRACT

MOTIVATION: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study. RESULTS: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments. CONCLUSION: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data. AVAILABILITY AND IMPLEMENTATION: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org.


Subject(s)
DNA/chemistry , Data Mining/methods , Genes , Genome , Molecular Sequence Annotation , PubMed , Base Sequence , Chromatin Immunoprecipitation , Databases, Nucleic Acid , Reverse Transcriptase Polymerase Chain Reaction , Sequence Analysis, DNA , Software
7.
Kidney Int ; 77(10): 891-6, 2010 May.
Article in English | MEDLINE | ID: mdl-20200501

ABSTRACT

Nephronophthisis is a heterogenetic autosomal recessive disorder associated with multiple developmental abnormalities, including cystic kidney disease and retinal degeneration. Retinal dystrophies, in particular the X-linked forms, are believed to represent a distinct group of hereditary diseases; however, their genetic complexity and overlap with other syndromic diseases is increasingly apparent. In this study, we report that depletion of retinitis pigmentosa GTPase regulator (RPGR) during zebrafish embryogenesis causes developmental changes indistinguishable from the abnormalities caused by the depletion of nephrocystin-5 or nephrocystin-6. However, RPGR did not directly interact with either gene product. RPGR-interacting protein 1 was found to act as an adaptor connecting RPGR to nephrocystin-6, thereby linking it to the nephronophthisis protein network. This interaction was abolished by truncating mutations (c.1107delA) of the interacting protein. Our findings underline the importance of the interplay between the two protein networks, suggesting a phenotypic modulation in both retinitis pigmentosa and nephronophthisis.


Subject(s)
Mutation , Proteins/genetics , Proteins/metabolism , Retinitis Pigmentosa/genetics , Retinitis Pigmentosa/metabolism , Animals , Eye Proteins , GTP Phosphohydrolases/genetics , GTP Phosphohydrolases/metabolism , Zebrafish/genetics , Zebrafish/metabolism , Zebrafish Proteins
8.
BMC Bioinformatics ; 11: 85, 2010 Feb 11.
Article in English | MEDLINE | ID: mdl-20149233

ABSTRACT

BACKGROUND: The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. RESULTS: In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. CONCLUSIONS: LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.


Subject(s)
Computational Biology/methods , Information Storage and Retrieval/methods , Software , PubMed , Vocabulary, Controlled
9.
Hum Mol Genet ; 17(23): 3655-62, 2008 Dec 01.
Article in English | MEDLINE | ID: mdl-18723859

ABSTRACT

Nephronophthisis (NPHP) is an autosomal recessive cystic kidney disease, caused by mutations of at least nine different genes. Several extrarenal manifestations characterize this disorder, including cerebellar defects, situs inversus and retinitis pigmentosa. While the clinical manifestations vary significantly in NPHP, mutations of NPHP5 and NPHP6 are always associated with progressive blindness. This clinical finding suggests that the gene products, nephrocystin-5 and nephrocystin-6, participate in overlapping signaling pathways to maintain photoreceptor homeostasis. To analyze the genetic interaction between these two proteins in more detail, we studied zebrafish embryos after depletion of NPHP5 and NPHP6. Knockdown of zebrafish zNPHP5 and zNPHP6 produced similar phenotypes, and synergistic effects were observed after the combined knockdown of zNPHP5 and zNPHP6. The N-terminal domain of nephrocystin-6-bound nephrocystin-5, and mapping studies delineated the interacting site from amino acid 696 to 896 of NPHP6. In Xenopus laevis, knockdown of NPHP5 caused substantial neural tube closure defects. This phenotype was copied by expression of the nephrocystin-5-binding fragment of nephrocystin-6, and rescued by co-expression of nephrocystin-5, supporting a physical interaction between both gene products in vivo. Since the N- and C-terminal fragments of nephrocystin-6 engage in the formation of homo- and heteromeric protein complexes, conformational changes seem to regulate the interaction of nephrocystin-6 with its binding partners.


Subject(s)
Calmodulin-Binding Proteins/genetics , Calmodulin-Binding Proteins/metabolism , Kidney Diseases, Cystic/metabolism , Zebrafish/genetics , Zebrafish/metabolism , Amino Acid Motifs , Animals , Calmodulin-Binding Proteins/chemistry , Female , Gene Knockdown Techniques , Humans , Kidney Diseases, Cystic/complications , Kidney Diseases, Cystic/embryology , Kidney Diseases, Cystic/genetics , Male , Microinjections , Neural Tube/embryology , Neural Tube/growth & development , Neural Tube/metabolism , Phenotype , Protein Binding , Protein Structure, Tertiary , Sequence Deletion , Xenopus Proteins/genetics , Xenopus Proteins/metabolism , Xenopus laevis/embryology , Xenopus laevis/genetics , Xenopus laevis/growth & development , Xenopus laevis/metabolism , Zebrafish/embryology , Zebrafish/growth & development
SELECTION OF CITATIONS
SEARCH DETAIL
...