Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
1.
J Biomed Inform ; 83: 73-86, 2018 07.
Article in English | MEDLINE | ID: mdl-29860093

ABSTRACT

INTRODUCTION: The FDA Adverse Event Reporting System (FAERS) is a primary data source for identifying unlabeled adverse events (AEs) in a drug or biologic drug product's postmarketing phase. Many AE reports must be reviewed by drug safety experts to identify unlabeled AEs, even if the reported AEs are previously identified, labeled AEs. Integrating the labeling status of drug product AEs into FAERS could increase report triage and review efficiency. Medical Dictionary for Regulatory Activities (MedDRA) is the standard for coding AE terms in FAERS cases. However, drug manufacturers are not required to use MedDRA to describe AEs in product labels. We hypothesized that natural language processing (NLP) tools could assist in automating the extraction and MedDRA mapping of AE terms in drug product labels. MATERIALS AND METHODS: We evaluated the performance of three NLP systems, (ETHER, I2E, MetaMap) for their ability to extract AE terms from drug labels and translate the terms to MedDRA Preferred Terms (PTs). Pharmacovigilance-based annotation guidelines for extracting AE terms from drug labels were developed for this study. We compared each system's output to MedDRA PT AE lists, manually mapped by FDA pharmacovigilance experts using the guidelines, for ten drug product labels known as the "gold standard AE list" (GSL) dataset. Strict time and configuration conditions were imposed in order to test each system's capabilities under conditions of no human intervention and minimal system configuration. Each NLP system's output was evaluated for precision, recall and F measure in comparison to the GSL. A qualitative error analysis (QEA) was conducted to categorize a random sample of each NLP system's false positive and false negative errors. RESULTS: A total of 417, 278, and 250 false positive errors occurred in the ETHER, I2E, and MetaMap outputs, respectively. A total of 100, 80, and 187 false negative errors occurred in ETHER, I2E, and MetaMap outputs, respectively. Precision ranged from 64% to 77%, recall from 64% to 83% and F measure from 67% to 79%. I2E had the highest precision (77%), recall (83%) and F measure (79%). ETHER had the lowest precision (64%). MetaMap had the lowest recall (64%). The QEA found that the most prevalent false positive errors were context errors such as "Context error/General term", "Context error/Instructions or monitoring parameters", "Context error/Medical history preexisting condition underlying condition risk factor or contraindication", and "Context error/AE manifestations or secondary complication". The most prevalent false negative errors were in the "Incomplete or missed extraction" error category. Missing AE terms were typically due to long terms, or terms containing non-contiguous words which do not correspond exactly to MedDRA synonyms. MedDRA mapping errors were a minority of errors for ETHER and I2E but were the most prevalent false positive errors for MetaMap. CONCLUSIONS: The results demonstrate that it may be feasible to use NLP tools to extract and map AE terms to MedDRA PTs. However, the NLP tools we tested would need to be modified or reconfigured to lower the error rates to support their use in a regulatory setting. Tools specific for extracting AE terms from drug labels and mapping the terms to MedDRA PTs may need to be developed to support pharmacovigilance. Conducting research using additional NLP systems on a larger, diverse GSL would also be informative.


Subject(s)
Adverse Drug Reaction Reporting Systems , Drug Labeling , Drug-Related Side Effects and Adverse Reactions , Natural Language Processing , Terminology as Topic , Humans , Pharmacovigilance , United States , United States Food and Drug Administration
2.
Drug Discov Today ; 21(3): 473-80, 2016 Mar.
Article in English | MEDLINE | ID: mdl-26854423

ABSTRACT

Comparative effectiveness research (CER) provides evidence for the relative effectiveness and risks of different treatment options and informs decisions made by healthcare providers, payers, and pharmaceutical companies. CER data come from retrospective analyses as well as prospective clinical trials. Here, we describe the development of a text-mining pipeline based on natural language processing (NLP) that extracts key information from three different trial data sources: NIH ClinicalTrials.gov, WHO International Clinical Trials Registry Platform (ICTRP), and Citeline Trialtrove. The pipeline leverages tailored terminologies to produce an integrated and structured output, capturing any trials in which pharmaceutical products of interest are compared with another therapy. The timely information alerts generated by this system provide the earliest and most complete picture of emerging clinical research.


Subject(s)
Comparative Effectiveness Research , Data Mining , Clinical Trials as Topic , Databases, Factual , Humans , Natural Language Processing , Registries
3.
J Biomed Inform ; 58 Suppl: S120-S127, 2015 Dec.
Article in English | MEDLINE | ID: mdl-26209007

ABSTRACT

This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.


Subject(s)
Cardiovascular Diseases/epidemiology , Data Mining/methods , Diabetes Complications/epidemiology , Electronic Health Records/organization & administration , Narration , Natural Language Processing , Aged , Cardiovascular Diseases/diagnosis , Cohort Studies , Comorbidity , Computer Security , Confidentiality , Diabetes Complications/diagnosis , Female , Humans , Incidence , Longitudinal Studies , Male , Middle Aged , Pattern Recognition, Automated/methods , Risk Assessment/methods , United Kingdom/epidemiology , Vocabulary, Controlled
4.
Mol Brain ; 7: 88, 2014 Nov 28.
Article in English | MEDLINE | ID: mdl-25429717

ABSTRACT

BACKGROUND: Synapses are fundamental components of brain circuits and are disrupted in over 100 neurological and psychiatric diseases. The synapse proteome is physically organized into multiprotein complexes and polygenic mutations converge on postsynaptic complexes in schizophrenia, autism and intellectual disability. Directly characterising human synapses and their multiprotein complexes from post-mortem tissue is essential to understanding disease mechanisms. However, multiprotein complexes have not been directly isolated from human synapses and the feasibility of their isolation from post-mortem tissue is unknown. RESULTS: Here we establish a screening assay and criteria to identify post-mortem brain samples containing well-preserved synapse proteomes, revealing that neocortex samples are best preserved. We also develop a rapid method for the isolation of synapse proteomes from human brain, allowing large numbers of post-mortem samples to be processed in a short time frame. We perform the first purification and proteomic mass spectrometry analysis of MAGUK Associated Signalling Complexes (MASC) from neurosurgical and post-mortem tissue and find genetic evidence for their involvement in over seventy human brain diseases. CONCLUSIONS: We have demonstrated that synaptic proteome integrity can be rapidly assessed from human post-mortem brain samples prior to its analysis with sophisticated proteomic methods. We have also shown that proteomics of synapse multiprotein complexes from well preserved post-mortem tissue is possible, obtaining structures highly similar to those isolated from biopsy tissue. Finally we have shown that MASC from human synapses are involved with over seventy brain disorders. These findings should have wide application in understanding the synaptic basis of psychiatric and other mental disorders.


Subject(s)
Postmortem Changes , Proteome/metabolism , Proteomics , Synapses/metabolism , Cerebral Cortex/metabolism , Chromatography, Affinity , Humans , Membrane Proteins/metabolism , Nerve Tissue Proteins/metabolism , Signal Transduction , Subcellular Fractions/metabolism , Tissue Banks
5.
J Biomed Semantics ; 2 Suppl 5: S11, 2011 Oct 06.
Article in English | MEDLINE | ID: mdl-22166494

ABSTRACT

BACKGROUND: Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. RESULTS: All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I.The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants' solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. CONCLUSIONS: The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs' annotation solutions in comparison to the SSC-I.

6.
J Bioinform Comput Biol ; 8(1): 163-79, 2010 Feb.
Article in English | MEDLINE | ID: mdl-20183881

ABSTRACT

The CALBC initiative aims to provide a large-scale biomedical text corpus that contains semantic annotations for named entities of different kinds. The generation of this corpus requires that the annotations from different automatic annotation systems be harmonized. In the first phase, the annotation systems from five participants (EMBL-EBI, EMC Rotterdam, NLM, JULIE Lab Jena, and Linguamatics) were gathered. All annotations were delivered in a common annotation format that included concept identifiers in the boundary assignments and that enabled comparison and alignment of the results. During the harmonization phase, the results produced from those different systems were integrated in a single harmonized corpus ("silver standard" corpus) by applying a voting scheme. We give an overview of the processed data and the principles of harmonization--formal boundary reconciliation and semantic matching of named entities. Finally, all submissions of the participants were evaluated against that silver standard corpus. We found that species and disease annotations are better standardized amongst the partners than the annotations of genes and proteins. The raw corpus is now available for additional named entity annotations. Parts of it will be made available later on for a public challenge. We expect that we can improve corpus building activities both in terms of the numbers of named entity classes being covered, as well as the size of the corpus in terms of annotated documents.


Subject(s)
Computational Biology/standards , Data Mining/standards , Cooperative Behavior , Data Mining/statistics & numerical data , Databases, Factual/statistics & numerical data , Unified Medical Language System
7.
Methods Mol Biol ; 563: 3-13, 2009.
Article in English | MEDLINE | ID: mdl-19597777

ABSTRACT

Natural language processing (NLP) technology can be used to rapidly extract protein-protein interactions from large collections of published literature. In this chapter we will work through a case study using MEDLINE biomedical abstracts (1) to find how a specific set of 50 genes interact with each other. We will show what steps are required to achieve this using the I2E software from Linguamatics ( www.linguamatics.com (2)).To extract protein networks from the literature, there are two typical strategies. The first is to find pairs of proteins which are mentioned together in the same context, for example, the same sentence, with the assumption that textual proximity implies biological association. The second approach is to use precise linguistic patterns based on NLP to find specific relationships between proteins. This can reveal the direction of the relationship and its nature such as "phosphorylation" or "upregulation". The I2E system uses a flexible text-mining approach, supporting both of these strategies, as well as hybrid strategies which fall between the two. In this chapter we show how multiple strategies can be combined to obtain high-quality results.


Subject(s)
Natural Language Processing , Protein Interaction Mapping , Proteins/metabolism , Software , Abstracting and Indexing , MEDLINE , United States
8.
Comp Funct Genomics ; 6(1-2): 67-71, 2005.
Article in English | MEDLINE | ID: mdl-18629299

ABSTRACT

Over recent years, there has been a growing interest in extracting information automatically or semi-automatically from the scientific literature. This paper describes a novel ontology-based interactive information extraction (OBIIE) framework and a specific OBIIE system. We describe how this system enables life scientists to make ad hoc queries similar to using a standard search engine, but where the results are obtained in a database format similar to a pre-programmed information extraction engine. We present a case study in which the system was evaluated for extracting co-factors from EMBASE and MEDLINE.

SELECTION OF CITATIONS
SEARCH DETAIL
...