Search | VHL Regional Portal

Improving natural language information extraction from cancer pathology reports using transfer learning and zero-shot string similarity.

Park, Briton; Altieri, Nicholas; DeNero, John; Odisho, Anobel Y; Yu, Bin.

JAMIA Open ; 4(3): ooab085, 2021 Jul.

Article in English | MEDLINE | ID: mdl-34604711

ABSTRACT

OBJECTIVE: We develop natural language processing (NLP) methods capable of accurately classifying tumor attributes from pathology reports given minimal labeled examples. Our hierarchical cancer to cancer transfer (HCTC) and zero-shot string similarity (ZSS) methods are designed to exploit shared information between cancers and auxiliary class features, respectively, to boost performance using enriched annotations which give both location-based information and document level labels for each pathology report. MATERIALS AND METHODS: Our data consists of 250 pathology reports each for kidney, colon, and lung cancer from 2002 to 2019 from a single institution (UCSF). For each report, we classified 5 attributes: procedure, tumor location, histology, grade, and presence of lymphovascular invasion. We develop novel NLP techniques involving transfer learning and string similarity trained on enriched annotations. We compare HCTC and ZSS methods to the state-of-the-art including conventional machine learning methods as well as deep learning methods. RESULTS: For our HCTC method, we see an improvement of up to 0.1 micro-F1 score and 0.04 macro-F1 averaged across cancer and applicable attributes. For our ZSS method, we see an improvement of up to 0.26 micro-F1 and 0.23 macro-F1 averaged across cancer and applicable attributes. These comparisons are made after adjusting training data sizes to correct for the 20% increase in annotation time for enriched annotations compared to ordinary annotations. CONCLUSIONS: Methods based on transfer learning across cancers and augmenting information methods with string similarity priors can significantly reduce the amount of labeled data needed for accurate information extraction from pathology reports.

Supervised line attention for tumor attribute classification from pathology reports: Higher performance with less data.

Altieri, Nicholas; Park, Briton; Olson, Mara; DeNero, John; Odisho, Anobel Y; Yu, Bin.

J Biomed Inform ; 122: 103872, 2021 10.

Article in English | MEDLINE | ID: mdl-34411709

ABSTRACT

OBJECTIVE: We aim to build an accurate machine learning-based system for classifying tumor attributes from cancer pathology reports in the presence of a small amount of annotated data, motivated by the expensive and time-consuming nature of pathology report annotation. An enriched labeling scheme that includes the location of relevant information along with the final label is used along with a corresponding hierarchical method for classifying reports that leverages these enriched annotations. MATERIALS AND METHODS: Our data consists of 250 colon cancer and 250 kidney cancer pathology reports from 2002 to 2019 at the University of California, San Francisco. For each report, we classify attributes such as procedure performed, tumor grade, and tumor site. For each attribute and document, an annotator trained by an oncologist labeled both the value of that attribute as well as the specific lines in the document that indicated the value. We develop a model that uses these enriched annotations that first predicts the relevant lines of the document, then predicts the final value given the predicted lines. We compare our model to multiple state-of-the-art methods for classifying tumor attributes from pathology reports. RESULTS: Our results show that across colon and kidney cancers and varying training set sizes, our hierarchical method consistently outperforms state-of-the-art methods. Furthermore, performance comparable to these methods can be achieved with approximately half the amount of labeled data. CONCLUSION: Document annotations that are enriched with location information are shown to greatly increase the sample efficiency of machine learning methods for classifying attributes of pathology reports.

Subject(s)

Neoplasms , Attention , Humans , Machine Learning , Research Report

Natural language processing systems for pathology parsing in limited data environments with uncertainty estimation.

Odisho, Anobel Y; Park, Briton; Altieri, Nicholas; DeNero, John; Cooperberg, Matthew R; Carroll, Peter R; Yu, Bin.

JAMIA Open ; 3(3): 431-438, 2020 Oct.

Article in English | MEDLINE | ID: mdl-33381748

ABSTRACT

OBJECTIVE: Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine learning-based pathology parsers and evaluate performance in low data settings. MATERIALS AND METHODS: Our data comes from the Urologic Outcomes Database at UCSF which includes 3232 annotated prostate cancer pathology reports from 2001 to 2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields, we required 2 statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model's estimates of its likelihood of being correct. RESULTS: Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction-based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields. CONCLUSIONS: We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates.

Modeling the spread of the Zika virus using topological data analysis.

Lo, Derek; Park, Briton.

PLoS One ; 13(2): e0192120, 2018.

Article in English | MEDLINE | ID: mdl-29438377

ABSTRACT

Zika virus (ZIKV), a disease spread primarily through the Aedes aegypti mosquito, was identified in Brazil in 2015 and was declared a global health emergency by the World Health Organization (WHO). Epidemiologists often use common state-level attributes such as population density and temperature to determine the spread of disease. By applying techniques from topological data analysis, we believe that epidemiologists will be able to better predict how ZIKV will spread. We use the Vietoris-Rips filtration on high-density mosquito locations in Brazil to create simplicial complexes, from which we extract homology group generators. Previously epidemiologists have not relied on topological data analysis to model disease spread. Evaluating our model on ZIKV case data in the states of Brazil demonstrates the value of these techniques for the improved assessment of vector-borne diseases.

Subject(s)

Models, Theoretical , Zika Virus Infection/transmission , Humans

Distributions of Mutational Effects and the Estimation of Directional Selection in Divergent Lineages of Arabidopsis thaliana.

Park, Briton; Rutter, Matthew T; Fenster, Charles B; Symonds, V Vaughan; Ungerer, Mark C; Townsend, Jeffrey P.

Genetics ; 206(4): 2105-2117, 2017 08.

Article in English | MEDLINE | ID: mdl-28550014

ABSTRACT

Mutations are crucial to evolution, providing the ultimate source of variation on which natural selection acts. Due to their key role, the distribution of mutational effects on quantitative traits is a key component to any inference regarding historical selection on phenotypic traits. In this paper, we expand on a previously developed test for selection that could be conducted assuming a Gaussian mutation effect distribution by developing approaches to also incorporate any of a family of heavy-tailed Laplace distributions of mutational effects. We apply the test to detect directional natural selection on five traits along the divergence of Columbia and Landsberg lineages of Arabidopsis thaliana, constituting the first test for natural selection in any organism using quantitative trait locus and mutation accumulation data to quantify the intensity of directional selection on a phenotypic trait. We demonstrate that the results of the test for selection can depend on the mutation effect distribution specified. Using the distributions exhibiting the best fit to mutation accumulation data, we infer that natural directional selection caused divergence in the rosette diameter and trichome density traits of the Columbia and Landsberg lineages.

Subject(s)

Arabidopsis/genetics , Evolution, Molecular , Mutation Accumulation , Selection, Genetic , Models, Genetic , Quantitative Trait Loci

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL