Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
Nature ; 630(8015): 181-188, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38778098

ABSTRACT

Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands of image tiles1-3. Prior models have often resorted to subsampling a small portion of tiles for each slide, thus missing the important slide-level context4. Here we present Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides from Providence, a large US health network comprising 28 cancer centres. The slides originated from more than 30,000 patients covering 31 major tissue types. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet5 method to digital pathology. To evaluate Prov-GigaPath, we construct a digital pathology benchmark comprising 9 cancer subtyping tasks and 17 pathomics tasks, using both Providence and TCGA data6. With large-scale pretraining and ultra-large-context modelling, Prov-GigaPath attains state-of-the-art performance on 25 out of 26 tasks, with significant improvement over the second-best method on 18 tasks. We further demonstrate the potential of Prov-GigaPath on vision-language pretraining for pathology7,8 by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modelling.


Subject(s)
Datasets as Topic , Image Processing, Computer-Assisted , Machine Learning , Pathology, Clinical , Humans , Benchmarking , Image Processing, Computer-Assisted/methods , Neoplasms/classification , Neoplasms/diagnosis , Neoplasms/pathology , Pathology, Clinical/methods , Male , Female
2.
Proc ACM Int Conf Inf Knowl Manag ; 2022: 4470-4474, 2022 Oct.
Article in English | MEDLINE | ID: mdl-36382341

ABSTRACT

With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.

3.
Proc Conf ; 2021: 155-161, 2021 Jun.
Article in English | MEDLINE | ID: mdl-35748887

ABSTRACT

To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. Unfortunately, this method fails for short documents where the context is unclear. Moreover, keyphrases, which are usually the gist of a document, need to be the central theme. We propose a new extraction model that introduces a centrality constraint to enrich the word representation of a Bidirectional long short-term memory. Performance evaluation on two publicly available datasets demonstrate our model outperforms existing state-of-the art approaches. Our model is publicly available at https://github.com/ZHgero/keyphrases_centrality.git.

4.
Article in English | MEDLINE | ID: mdl-35775029

ABSTRACT

To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. However, existing supervised datasets have limited annotated examples to train better deep learning models. In contrast, many domains have large amount of un-annotated data that can be leveraged to improve model performance in keyphrase extraction. We introduce a self-learning based model that incorporates uncertainty estimates to select instances from large-scale unlabeled data to augment the small labeled training set. Performance evaluation on a publicly available biomedical dataset demonstrates that our method improves performance of keyphrase extraction over state of the art models.

5.
IEEE Int Conf Healthc Inform ; 2021: 83-92, 2021 Aug.
Article in English | MEDLINE | ID: mdl-35079697

ABSTRACT

There is an increased adoption of electronic health record systems by a variety of hospitals and medical centers. This provides an opportunity to leverage automated computer systems in assisting healthcare workers. One of the least utilized but rich source of patient information is the unstructured clinical text. In this work, we develop CATAN, a chart-aware temporal attention network for learning patient representations from clinical notes. We introduce a novel representation where each note is considered a single unit, like a sentence, and composed of attention-weighted words. The notes in turn are aggregated into a patient representation using a second weighting unit, note attention. Unlike standard attention computations which focus only on the content of the note, we incorporate the chart-time for each note as a constraint for attention calculation. This allows our model to focus on notes closer to the prediction time. Using the MIMIC-III dataset, we empirically show that our patient representation and attention calculation achieves the best performance in comparison with various state-of-the-art baselines for one-year mortality prediction and 30-day hospital readmission. Moreover, the attention weights can be used to offer transparency into our model's predictions.

6.
AMIA Annu Symp Proc ; 2020: 1160-1169, 2020.
Article in English | MEDLINE | ID: mdl-33936492

ABSTRACT

Hospital-acquired pressure ulcer injury (PUI) is a primary nursing quality metric, reflecting the caliber of nursing care within a hospital. Prior studies have used the Braden scale and structured data from the electronic health records to detect/predict PUI while the informative unstructured clinical notes have not been used. We propose automated PUI detection using a novel negation-detection algorithm applied to unstructured clinical notes. Our detection framework is on-demand, requiring minimal cost. In application to the MIMIC-III dataset, the text features produced using our algorithm resulted in improved PUI detection when evaluated using logistic regression, random forests, and neural networks compared to text features without negation detection. Exploratory analysis reveals substantial overlap between key classifier features and leading clinical attributes of PUI, adding interpretability to our solution. Our method could also considerably reduce nurses' evaluations by automatic detection of most cases, leaving only the most uncertain cases for nursing assessment.


Subject(s)
Algorithms , Electronic Health Records , Pressure Ulcer , Humans , Logistic Models , Neural Networks, Computer
7.
J Biomed Inform ; 100S: 100047, 2019.
Article in English | MEDLINE | ID: mdl-34384576

ABSTRACT

Distributed semantic representation of biomedical text can be beneficial for text classification, named entity recognition, query expansion, human comprehension, and information retrieval. Despite the success of high-quality vector space models such as Word2Vec and GloVe, they only provide unigram word representations and the semantics for multi-word phrases can only be approximated by composition. This is problematic in biomedical text processing where technical phrases for diseases, symptoms, and drugs should be represented as single entities to capture the correct meaning. In this paper, we introduce PMCVec, an unsupervised technique that generates important phrases from PubMed abstracts and learns embeddings for single words and multi-word phrases simultaneously. Evaluations performed on benchmark datasets produce significant performance gains both qualitatively and quantitatively.

SELECTION OF CITATIONS
SEARCH DETAIL
...