Search | VHL Regional Portal

Biomedical and clinical English model packages for the Stanza Python NLP library.

Zhang, Yuhao; Zhang, Yuhui; Qi, Peng; Manning, Christopher D; Langlotz, Curtis P.

J Am Med Inform Assoc ; 28(9): 1892-1899, 2021 08 13.

Article in English | MEDLINE | ID: mdl-34157094

ABSTRACT

OBJECTIVE: The study sought to develop and evaluate neural natural language processing (NLP) packages for the syntactic analysis and named entity recognition of biomedical and clinical English text. MATERIALS AND METHODS: We implement and train biomedical and clinical English NLP pipelines by extending the widely used Stanza library originally designed for general NLP tasks. Our models are trained with a mix of public datasets such as the CRAFT treebank as well as with a private corpus of radiology reports annotated with 5 radiology-domain entities. The resulting pipelines are fully based on neural networks, and are able to perform tokenization, part-of-speech tagging, lemmatization, dependency parsing, and named entity recognition for both biomedical and clinical text. We compare our systems against popular open-source NLP libraries such as CoreNLP and scispaCy, state-of-the-art models such as the BioBERT models, and winning systems from the BioNLP CRAFT shared task. RESULTS: For syntactic analysis, our systems achieve much better performance compared with the released scispaCy models and CoreNLP models retrained on the same treebanks, and are on par with the winning system from the CRAFT shared task. For NER, our systems substantially outperform scispaCy, and are better or on par with the state-of-the-art performance from BioBERT, while being much more computationally efficient. CONCLUSIONS: We introduce biomedical and clinical NLP packages built for the Stanza library. These packages offer performance that is similar to the state of the art, and are also optimized for ease of use. To facilitate research, we make all our models publicly available. We also provide an online demonstration (http://stanza.run/bio).

Subject(s)

Language , Natural Language Processing , Neural Networks, Computer

Emergent linguistic structure in artificial neural networks trained by self-supervision.

Manning, Christopher D; Clark, Kevin; Hewitt, John; Khandelwal, Urvashi; Levy, Omer.

Proc Natl Acad Sci U S A ; 117(48): 30046-30054, 2020 12 01.

Article in English | MEDLINE | ID: mdl-32493748

ABSTRACT

This paper explores the knowledge of linguistic structure learned by large artificial neural networks, trained via self-supervision, whereby the model simply tries to predict a masked word in a given context. Human language communication is via sequences of words, but language understanding requires constructing rich hierarchical structures that are never observed explicitly. The mechanisms for this have been a prime mystery of human language acquisition, while engineering work has mainly proceeded by supervised learning on treebanks of sentences hand labeled for this latent structure. However, we demonstrate that modern deep contextual language models learn major aspects of this structure, without any explicit supervision. We develop methods for identifying linguistic hierarchical structure emergent in artificial neural networks and demonstrate that components in these models focus on syntactic grammatical relationships and anaphoric coreference. Indeed, we show that a linear transformation of learned embeddings in these models captures parse tree distances to a surprising degree, allowing approximate reconstruction of the sentence tree structures normally assumed by linguists. These results help explain why these models have brought such large improvements across many language-understanding tasks.

Advances in natural language processing.

Hirschberg, Julia; Manning, Christopher D.

Science ; 349(6245): 261-6, 2015 Jul 17.

Article in English | MEDLINE | ID: mdl-26185244

ABSTRACT

Natural language processing employs computational techniques for the purpose of learning, understanding, and producing human language content. Early computational approaches to language research focused on automating the analysis of the linguistic structure of language and developing basic technologies such as machine translation, speech recognition, and speech synthesis. Today's researchers refine and make use of such tools in real-world applications, creating spoken dialogue systems and speech-to-speech translation engines, mining social media for information about health or finance, and identifying sentiment and emotion toward products and services. We describe successes and challenges in this rapidly advancing area.

Subject(s)

Data Mining/methods , Natural Language Processing , Translating , Humans , Social Media

Induced lexico-syntactic patterns improve information extraction from online medical forums.

Gupta, Sonal; MacLean, Diana L; Heer, Jeffrey; Manning, Christopher D.

J Am Med Inform Assoc ; 21(5): 902-9, 2014.

Article in English | MEDLINE | ID: mdl-24970840

ABSTRACT

OBJECTIVE: To reliably extract two entity types, symptoms and conditions (SCs), and drugs and treatments (DTs), from patient-authored text (PAT) by learning lexico-syntactic patterns from data annotated with seed dictionaries. BACKGROUND AND SIGNIFICANCE: Despite the increasing quantity of PAT (eg, online discussion threads), tools for identifying medical entities in PAT are limited. When applied to PAT, existing tools either fail to identify specific entity types or perform poorly. Identification of SC and DT terms in PAT would enable exploration of efficacy and side effects for not only pharmaceutical drugs, but also for home remedies and components of daily care. MATERIALS AND METHODS: We use SC and DT term dictionaries compiled from online sources to label several discussion forums from MedHelp (http://www.medhelp.org). We then iteratively induce lexico-syntactic patterns corresponding strongly to each entity type to extract new SC and DT terms. RESULTS: Our system is able to extract symptom descriptions and treatments absent from our original dictionaries, such as 'LADA', 'stabbing pain', and 'cinnamon pills'. Our system extracts DT terms with 58-70% F1 score and SC terms with 66-76% F1 score on two forums from MedHelp. We show improvements over MetaMap, OBA, a conditional random field-based classifier, and a previous pattern learning approach. CONCLUSIONS: Our entity extractor based on lexico-syntactic patterns is a successful and preferable technique for identifying specific entity types in PAT. To the best of our knowledge, this is the first paper to extract SC and DT entities from PAT. We exhibit learning of informal terms often used in PAT but missing from typical dictionaries.

Subject(s)

Consumer Health Information , Data Mining/methods , Internet , Natural Language Processing , Diagnosis , Dictionaries as Topic , Disease , Drug Therapy , Health Records, Personal , Humans , Linguistics

Combining joint models for biomedical event extraction.

McClosky, David; Riedel, Sebastian; Surdeanu, Mihai; McCallum, Andrew; Manning, Christopher D.

BMC Bioinformatics ; 13 Suppl 11: S9, 2012 Jun 26.

Article in English | MEDLINE | ID: mdl-22759463

ABSTRACT

BACKGROUND: We explore techniques for performing model combination between the UMass and Stanford biomedical event extraction systems. Both sub-components address event extraction as a structured prediction problem, and use dual decomposition (UMass) and parsing algorithms (Stanford) to find the best scoring event structure. Our primary focus is on stacking where the predictions from the Stanford system are used as features in the UMass system. For comparison, we look at simpler model combination techniques such as intersection and union which require only the outputs from each system and combine them directly. RESULTS: First, we find that stacking substantially improves performance while intersection and union provide no significant benefits. Second, we investigate the graph properties of event structures and their impact on the combination of our systems. Finally, we trace the origins of events proposed by the stacked model to determine the role each system plays in different components of the output. We learn that, while stacking can propose novel event structures not seen in either base model, these events have extremely low precision. Removing these novel events improves our already state-of-the-art F1 to 56.6% on the test set of Genia (Task 1). Overall, the combined system formed via stacking ("FAUST") performed well in the BioNLP 2011 shared task. The FAUST system obtained 1st place in three out of four tasks: 1st place in Genia Task 1 (56.0% F1) and Task 2 (53.9%), 2nd place in the Epigenetics and Post-translational Modifications track (35.0%), and 1st place in the Infectious Diseases track (55.6%). CONCLUSION: We present a state-of-the-art event extraction system that relies on the strengths of structured prediction and model combination through stacking. Akin to results on other tasks, stacking outperforms intersection and union and leads to very strong results. The utility of model combination hinges on complementary views of the data, and we show that our sub-systems capture different graph properties of event structures. Finally, by removing low precision novel events, we show that performance from stacking can be further improved.

Subject(s)

Algorithms , Data Mining , Information Storage and Retrieval , Models, Theoretical , Natural Language Processing , Communicable Diseases , Epigenomics , Humans , Protein Processing, Post-Translational

Probabilistic models of language processing and acquisition.

Chater, Nick; Manning, Christopher D.

Trends Cogn Sci ; 10(7): 335-44, 2006 Jul.

Article in English | MEDLINE | ID: mdl-16784883

ABSTRACT

Probabilistic methods are providing new explanatory approaches to fundamental cognitive science questions of how humans structure, process and acquire language. This review examines probabilistic models defined over traditional symbolic structures. Language comprehension and production involve probabilistic inference in such models; and acquisition involves choosing the best model, given innate constraints and linguistic and other input. Probabilistic models can account for the learning and processing of language, while maintaining the sophistication of symbolic models. A recent burgeoning of theoretical developments and online corpus creation has enabled large models to be tested, revealing probabilistic constraints in processing, undermining acquisition arguments based on a perceived poverty of the stimulus, and suggesting fruitful links with probabilistic theories of categorization and ambiguity resolution in perception.

Subject(s)

Cognition/physiology , Comprehension/physiology , Concept Formation/physiology , Language Development , Models, Statistical , Brain/physiology , Humans , Phonetics , Probability Theory , Psycholinguistics , Reading , Semantics , Speech Perception/physiology , Uncertainty

Exploring the boundaries: gene and protein identification in biomedical text.

Finkel, Jenny; Dingare, Shipra; Manning, Christopher D; Nissim, Malvina; Alex, Beatrice; Grover, Claire.

BMC Bioinformatics ; 6 Suppl 1: S5, 2005.

Article in English | MEDLINE | ID: mdl-15960839

ABSTRACT

BACKGROUND: Good automatic information extraction tools offer hope for automatic processing of the exploding biomedical literature, and successful named entity recognition is a key component for such tools. METHODS: We present a maximum-entropy based system incorporating a diverse set of features for identifying gene and protein names in biomedical abstracts. RESULTS: This system was entered in the BioCreative comparative evaluation and achieved a precision of 0.83 and recall of 0.84 in the "open" evaluation and a precision of 0.78 and recall of 0.85 in the "closed" evaluation. CONCLUSION: Central contributions are rich use of features derived from the training data at multiple levels of granularity, a focus on correctly identifying entity boundaries, and the innovative use of several external knowledge sources including full MEDLINE abstracts and web searches.

Subject(s)

Biomedical Research/classification , Genes , Literature , Proteins/classification , Biomedical Research/methods , Computational Biology/classification , Computational Biology/methods , Information Storage and Retrieval/classification , Information Storage and Retrieval/methods , Terminology as Topic

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL