Search | VHL Regional Portal

Negation's not solved: generalizability versus optimizability in clinical natural language processing.

Wu, Stephen; Miller, Timothy; Masanz, James; Coarr, Matt; Halgrim, Scott; Carrell, David; Clark, Cheryl.

PLoS One ; 9(11): e112774, 2014.

Article in English | MEDLINE | ID: mdl-25393544

ABSTRACT

A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been "solved." This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP.

Subject(s)

Algorithms , Artificial Intelligence/statistics & numerical data , Natural Language Processing , Clinical Medicine/education , Humans , Semantics , Textbooks as Topic , Vocabulary, Controlled

Open Source Clinical NLP - More than Any Single System.

Masanz, James; Pakhomov, Serguei V; Xu, Hua; Wu, Stephen T; Chute, Christopher G; Liu, Hongfang.

AMIA Jt Summits Transl Sci Proc ; 2014: 76-82, 2014.

Article in English | MEDLINE | ID: mdl-25954581

ABSTRACT

The number of Natural Language Processing (NLP) tools and systems for processing clinical free-text has grown as interest and processing capability have surged. Unfortunately any two systems typically cannot simply interoperate, even when both are built upon a framework designed to facilitate the creation of pluggable components. We present two ongoing activities promoting open source clinical NLP. The Open Health Natural Language Processing (OHNLP) Consortium was originally founded to foster a collaborative community around clinical NLP, releasing UIMA-based open source software. OHNLP's mission currently includes maintaining a catalog of clinical NLP software and providing interfaces to simplify the interaction of NLP systems. Meanwhile, Apache cTAKES aims to integrate best-of-breed annotators, providing a world-class NLP system for accessing clinical information within free-text. These two activities are complementary. OHNLP promotes open source clinical NLP activities in the research community and Apache cTAKES bridges research to the health information technology (HIT) practice.

Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium.

Pathak, Jyotishman; Bailey, Kent R; Beebe, Calvin E; Bethard, Steven; Carrell, David C; Chen, Pei J; Dligach, Dmitriy; Endle, Cory M; Hart, Lacey A; Haug, Peter J; Huff, Stanley M; Kaggal, Vinod C; Li, Dingcheng; Liu, Hongfang; Marchant, Kyle; Masanz, James; Miller, Timothy; Oniki, Thomas A; Palmer, Martha; Peterson, Kevin J; Rea, Susan; Savova, Guergana K; Stancl, Craig R; Sohn, Sunghwan; Solbrig, Harold R; Suesse, Dale B; Tao, Cui; Taylor, David P; Westberg, Les; Wu, Stephen; Zhuo, Ning; Chute, Christopher G.

J Am Med Inform Assoc ; 20(e2): e341-8, 2013 Dec.

Article in English | MEDLINE | ID: mdl-24190931

ABSTRACT

RESEARCH OBJECTIVE: To develop scalable informatics infrastructure for normalization of both structured and unstructured electronic health record (EHR) data into a unified, concept-based model for high-throughput phenotype extraction. MATERIALS AND METHODS: Software tools and applications were developed to extract information from EHRs. Representative and convenience samples of both structured and unstructured data from two EHR systems-Mayo Clinic and Intermountain Healthcare-were used for development and validation. Extracted information was standardized and normalized to meaningful use (MU) conformant terminology and value set standards using Clinical Element Models (CEMs). These resources were used to demonstrate semi-automatic execution of MU clinical-quality measures modeled using the Quality Data Model (QDM) and an open-source rules engine. RESULTS: Using CEMs and open-source natural language processing and terminology services engines-namely, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) and Common Terminology Services (CTS2)-we developed a data-normalization platform that ensures data security, end-to-end connectivity, and reliable data flow within and across institutions. We demonstrated the applicability of this platform by executing a QDM-based MU quality measure that determines the percentage of patients between 18 and 75 years with diabetes whose most recent low-density lipoprotein cholesterol test result during the measurement year was <100 mg/dL on a randomly selected cohort of 273 Mayo Clinic patients. The platform identified 21 and 18 patients for the denominator and numerator of the quality measure, respectively. Validation results indicate that all identified patients meet the QDM-based criteria. CONCLUSIONS: End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and terminologies, as well as robust information models for storing, discovering, and processing that information. This study demonstrates the application of modular and open-source resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent format for high-throughput phenotyping to identify patient cohorts.

Subject(s)

Data Mining , Electronic Health Records/standards , Medical Informatics Applications , Natural Language Processing , Phenotype , Algorithms , Biomedical Research , Computer Security , Humans , Software , Vocabulary, Controlled

A common type system for clinical natural language processing.

Wu, Stephen T; Kaggal, Vinod C; Dligach, Dmitriy; Masanz, James J; Chen, Pei; Becker, Lee; Chapman, Wendy W; Savova, Guergana K; Liu, Hongfang; Chute, Christopher G.

J Biomed Semantics ; 4(1): 1, 2013 Jan 03.

Article in English | MEDLINE | ID: mdl-23286462

ABSTRACT

BACKGROUND: One challenge in reusing clinical data stored in electronic medical records is that these data are heterogenous. Clinical Natural Language Processing (NLP) plays an important role in transforming information in clinical text to a standard representation that is comparable and interoperable. Information may be processed and shared when a type system specifies the allowable data structures. Therefore, we aim to define a common type system for clinical NLP that enables interoperability between structured and unstructured data generated in different clinical settings. RESULTS: We describe a common type system for clinical NLP that has an end target of deep semantics based on Clinical Element Models (CEMs), thus interoperating with structured data and accommodating diverse NLP approaches. The type system has been implemented in UIMA (Unstructured Information Management Architecture) and is fully functional in a popular open-source clinical NLP system, cTAKES (clinical Text Analysis and Knowledge Extraction System) versions 2.0 and later. CONCLUSIONS: We have created a type system that targets deep semantics, thereby allowing for NLP systems to encapsulate knowledge from text and share it alongside heterogenous clinical data sources. Rather than surface semantics that are typically the end product of NLP algorithms, CEM-based semantics explicitly build in deep clinical semantics as the point of interoperability with more structured data types.

The MiPACQ clinical question answering system.

Cairns, Brian L; Nielsen, Rodney D; Masanz, James J; Martin, James H; Palmer, Martha S; Ward, Wayne H; Savova, Guergana K.

AMIA Annu Symp Proc ; 2011: 171-80, 2011.

Article in English | MEDLINE | ID: mdl-22195068

ABSTRACT

The Multi-source Integrated Platform for Answering Clinical Questions (MiPACQ) is a QA pipeline that integrates a variety of information retrieval and natural language processing systems into an extensible question answering system. We present the system's architecture and an evaluation of MiPACQ on a human-annotated evaluation dataset based on the Medpedia health and medical encyclopedia. Compared with our baseline information retrieval system, the MiPACQ rule-based system demonstrates 84% improvement in Precision at One and the MiPACQ machine-learning-based system demonstrates 134% improvement. Other performance metrics including mean reciprocal rank and area under the precision/recall curves also showed significant improvement, validating the effectiveness of the MiPACQ design and implementation.

Subject(s)

Electronic Health Records , Natural Language Processing , Search Engine , Software , Artificial Intelligence , Computer Systems , Humans , Information Systems

Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Savova, Guergana K; Masanz, James J; Ogren, Philip V; Zheng, Jiaping; Sohn, Sunghwan; Kipper-Schuler, Karin C; Chute, Christopher G.

J Am Med Inform Assoc ; 17(5): 507-13, 2010.

Article in English | MEDLINE | ID: mdl-20819853

ABSTRACT

We aim to build and evaluate an open-source natural language processing system for information extraction from electronic medical record clinical free-text. We describe and evaluate our system, the clinical Text Analysis and Knowledge Extraction System (cTAKES), released open-source at http://www.ohnlp.org. The cTAKES builds on existing open-source technologies-the Unstructured Information Management Architecture framework and OpenNLP natural language processing toolkit. Its components, specifically trained for the clinical domain, create rich linguistic and semantic annotations. Performance of individual components: sentence boundary detector accuracy=0.949; tokenizer accuracy=0.949; part-of-speech tagger accuracy=0.936; shallow parser F-score=0.924; named entity recognizer and system-level evaluation F-score=0.715 for exact and 0.824 for overlapping spans, and accuracy for concept mapping, negation, and status attributes for exact and overlapping spans of 0.957, 0.943, 0.859, and 0.580, 0.939, and 0.839, respectively. Overall performance is discussed against five applications. The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text.

Subject(s)

Electronic Health Records , Information Storage and Retrieval/methods , Natural Language Processing , Biomedical Research

Classification of medication status change in clinical narratives.

Sohn, Sunghwan; Murphy, Sean P; Masanz, James J; Kocher, Jean-Pierre A; Savova, Guergana K.

AMIA Annu Symp Proc ; 2010: 762-6, 2010 Nov 13.

Article in English | MEDLINE | ID: mdl-21347081

ABSTRACT

The patient's medication history and status changes play essential roles in medical treatment. A notable amount of medication status information typically resides in unstructured clinical narratives that require a sophisticated approach to automated classification. In this paper, we investigated rule-based and machine learning methods of medication status change classification from clinical free text. We also examined the impact of balancing training data in machine learning classification when using the data from skewed class distribution.

Subject(s)

Machine Learning , Narration , Artificial Intelligence , Humans , Natural Language Processing

Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model.

Coden, Anni; Savova, Guergana; Sominsky, Igor; Tanenblatt, Michael; Masanz, James; Schuler, Karin; Cooper, James; Guan, Wei; de Groen, Piet C.

J Biomed Inform ; 42(5): 937-49, 2009 Oct.

Article in English | MEDLINE | ID: mdl-19135551

ABSTRACT

We introduce an extensible and modifiable knowledge representation model to represent cancer disease characteristics in a comparable and consistent fashion. We describe a system, MedTAS/P which automatically instantiates the knowledge representation model from free-text pathology reports. MedTAS/P is based on an open-source framework and its components use natural language processing principles, machine learning and rules to discover and populate elements of the model. To validate the model and measure the accuracy of MedTAS/P, we developed a gold-standard corpus of manually annotated colon cancer pathology reports. MedTAS/P achieves F1-scores of 0.97-1.0 for instantiating classes in the knowledge representation model such as histologies or anatomical sites, and F1-scores of 0.82-0.93 for primary tumors or lymph nodes, which require the extractions of relations. An F1-score of 0.65 is reported for metastatic tumors, a lower score predominantly due to a very small number of instances in the training and test sets.

Subject(s)

Information Storage and Retrieval/methods , Models, Theoretical , Natural Language Processing , Neoplasms/pathology , Pattern Recognition, Automated/methods , Databases, Factual , Humans , Medical Informatics/methods , Medical Records , Terminology as Topic

Towards temporal relation discovery from the clinical narrative.

Savova, Guergana; Bethard, Steven; Styler, Will; Martin, James; Palmer, Martha; Masanz, James; Ward, Wayne.

AMIA Annu Symp Proc ; 2009: 568-72, 2009 Nov 14.

Article in English | MEDLINE | ID: mdl-20351919

ABSTRACT

Disease progression and understanding relies on temporal concepts. Discovery of automated temporal relations and timelines from the clinical narrative allows for mining large data sets of clinical text to uncover patterns at the disease and patient level. Our overall goal is the complex task of building a system for automated temporal relation discovery. As a first step, we evaluate enabling methods from the general natural language processing domain - deep parsing and semantic role labeling in predicate-argument structures - to explore their portability to the clinical domain. As a second step, we develop an annotation schema for temporal relations based on TimeML. In this paper we report results and findings from these first steps. Our next efforts will scale up the data collection to develop domain-specific modules for the enabling technologies within Mayo's open-source clinical Text Analysis and Knowledge Extraction System.

Subject(s)

Disease Progression , Narration , Natural Language Processing , Humans , Methods , Semantics , Time

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL