Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 12 de 12
Filtrar
1.
Nature ; 630(8015): 181-188, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38778098

RESUMO

Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands of image tiles1-3. Prior models have often resorted to subsampling a small portion of tiles for each slide, thus missing the important slide-level context4. Here we present Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides from Providence, a large US health network comprising 28 cancer centres. The slides originated from more than 30,000 patients covering 31 major tissue types. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet5 method to digital pathology. To evaluate Prov-GigaPath, we construct a digital pathology benchmark comprising 9 cancer subtyping tasks and 17 pathomics tasks, using both Providence and TCGA data6. With large-scale pretraining and ultra-large-context modelling, Prov-GigaPath attains state-of-the-art performance on 25 out of 26 tasks, with significant improvement over the second-best method on 18 tasks. We further demonstrate the potential of Prov-GigaPath on vision-language pretraining for pathology7,8 by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modelling.


Assuntos
Conjuntos de Dados como Assunto , Processamento de Imagem Assistida por Computador , Aprendizado de Máquina , Patologia Clínica , Humanos , Benchmarking , Processamento de Imagem Assistida por Computador/métodos , Neoplasias/classificação , Neoplasias/diagnóstico , Neoplasias/patologia , Patologia Clínica/métodos , Masculino , Feminino
2.
Patterns (N Y) ; 4(4): 100726, 2023 Apr 14.
Artigo em Inglês | MEDLINE | ID: mdl-37123439

RESUMO

Most detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information, for general RWD applications. We conduct an extensive study on 135,107 patients from the cancer registry of a large integrated delivery network (IDN) comprising healthcare systems in five western US states. Our deep-learning methods attain test area under the receiver operating characteristic curve (AUROC) values of 94%-99% for key tumor attributes and comparable performance on held-out data from separate health systems and states. Ablation results demonstrate the superiority of these advanced deep-learning methods. Error analysis shows that our NLP system sometimes even corrects errors in registrar labels.

3.
Patterns (N Y) ; 4(4): 100729, 2023 Apr 14.
Artigo em Inglês | MEDLINE | ID: mdl-37123444

RESUMO

Large neural language models have transformed modern natural language processing (NLP) applications. However, fine-tuning such models for specific tasks remains challenging as model size increases, especially with small labeled datasets, which are common in biomedical NLP. We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that fine-tuning performance may be sensitive to pretraining settings and conduct an exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for low-resource biomedical NLP applications. Specifically, freezing lower layers is helpful for standard BERT- B A S E models, while layerwise decay is more effective for BERT- L A R G E and ELECTRA models. For low-resource text similarity tasks, such as BIOSSES, reinitializing the top layers is the optimal strategy. Overall, domain-specific vocabulary and pretraining facilitate robust models for fine-tuning. Based on these findings, we establish a new state of the art on a wide range of biomedical NLP applications.

4.
Nat Commun ; 14(1): 738, 2023 02 10.
Artigo em Inglês | MEDLINE | ID: mdl-36759510

RESUMO

Existing annotation paradigms rely on controlled vocabularies, where each data instance is classified into one term from a predefined set of controlled vocabularies. This paradigm restricts the analysis to concepts that are known and well-characterized. Here, we present the novel multilingual translation method BioTranslator to address this problem. BioTranslator takes a user-written textual description of a new concept and then translates this description to a non-text biological data instance. The key idea of BioTranslator is to develop a multilingual translation framework, where multiple modalities of biological data are all translated to text. We demonstrate how BioTranslator enables the identification of novel cell types using only a textual description and how BioTranslator can be further generalized to protein function prediction and drug target identification. Our tool frees scientists from limiting their analyses within predefined controlled vocabularies, enabling them to interact with biological data using free text.


Assuntos
Multilinguismo , Vocabulário Controlado , Proteínas
5.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36156661

RESUMO

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural
6.
PLoS Comput Biol ; 15(6): e1006758, 2019 06.
Artigo em Inglês | MEDLINE | ID: mdl-31246951

RESUMO

Many biological studies involve either (i) manipulating some aspect of a cell or its environment and then simultaneously measuring the effect on thousands of genes, or (ii) systematically manipulating each gene and then measuring the effect on some response of interest. A common challenge that arises in these studies is to explain how genes identified as relevant in the given experiment are organized into a subnetwork that accounts for the response of interest. The task of inferring a subnetwork is typically dependent on the information available in publicly available, structured databases, which suffer from incompleteness. However, a wealth of potentially relevant information resides in the scientific literature, such as information about genes associated with certain concepts of interest, as well as interactions that occur among various biological entities. We contend that by exploiting this information, we can improve the explanatory power and accuracy of subnetwork inference in multiple applications. Here we propose and investigate several ways in which information extracted from the scientific literature can be used to augment subnetwork inference. We show that we can use literature-extracted information to (i) augment the set of entities identified as being relevant in a subnetwork inference task, (ii) augment the set of interactions used in the process, and (iii) support targeted browsing of a large inferred subnetwork by identifying entities and interactions that are closely related to concepts of interest. We use this approach to uncover the pathways involved in interactions between a virus and a host cell, and the pathways that are regulated by a transcription factor associated with breast cancer. Our experimental results demonstrate that these approaches can provide more accurate and more interpretable subnetworks. Integer program code, background network data, and pathfinding code are available at https://github.com/Craven-Biostat-Lab/subnetwork_inference.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Redes Reguladoras de Genes/genética , Mapeamento de Interação de Proteínas/métodos , Mapas de Interação de Proteínas/genética , Bases de Dados Genéticas , HIV , Infecções por HIV/genética , Infecções por HIV/virologia , Humanos
7.
Proc Natl Acad Sci U S A ; 114(36): E7554-E7563, 2017 09 05.
Artigo em Inglês | MEDLINE | ID: mdl-28784769

RESUMO

Translating the genetic and epigenetic heterogeneity underlying human cancers into therapeutic strategies is an ongoing challenge. Large-scale sequencing efforts have uncovered a spectrum of mutations in many hematologic malignancies, including acute myeloid leukemia (AML), suggesting that combinations of agents will be required to treat these diseases effectively. Combinatorial approaches will also be critical for combating the emergence of genetically heterogeneous subclones, rescue signals in the microenvironment, and tumor-intrinsic feedback pathways that all contribute to disease relapse. To identify novel and effective drug combinations, we performed ex vivo sensitivity profiling of 122 primary patient samples from a variety of hematologic malignancies against a panel of 48 drug combinations. The combinations were designed as drug pairs that target nonoverlapping biological pathways and comprise drugs from different classes, preferably with Food and Drug Administration approval. A combination ratio (CR) was derived for each drug pair, and CRs were evaluated with respect to diagnostic categories as well as against genetic, cytogenetic, and cellular phenotypes of specimens from the two largest disease categories: AML and chronic lymphocytic leukemia (CLL). Nearly all tested combinations involving a BCL2 inhibitor showed additional benefit in patients with myeloid malignancies, whereas select combinations involving PI3K, CSF1R, or bromodomain inhibitors showed preferential benefit in lymphoid malignancies. Expanded analyses of patients with AML and CLL revealed specific patterns of ex vivo drug combination efficacy that were associated with select genetic, cytogenetic, and phenotypic disease subsets, warranting further evaluation. These findings highlight the heuristic value of an integrated functional genomic approach to the identification of novel treatment strategies for hematologic malignancies.


Assuntos
Antineoplásicos/uso terapêutico , Neoplasias Hematológicas/tratamento farmacológico , Leucemia Linfocítica Crônica de Células B/tratamento farmacológico , Leucemia Mieloide Aguda/tratamento farmacológico , Combinação de Medicamentos , Neoplasias Hematológicas/metabolismo , Humanos , Leucemia Linfocítica Crônica de Células B/metabolismo , Leucemia Mieloide Aguda/metabolismo , Mutação/efeitos dos fármacos , Fosfatidilinositol 3-Quinases/metabolismo , Proteínas Proto-Oncogênicas c-bcl-2/metabolismo , Receptores de Fator Estimulador das Colônias de Granulócitos e Macrófagos/metabolismo
8.
Nat Genet ; 49(9): 1319-1325, 2017 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-28783162

RESUMO

In this study, we used insurance claims for over one-third of the entire US population to create a subset of 128,989 families (481,657 unique individuals). We then used these data to (i) estimate the heritability and familial environmental patterns of 149 diseases and (ii) infer the genetic and environmental correlations for disease pairs from a set of 29 complex diseases. The majority (52 of 65) of our study's heritability estimates matched earlier reports, and 84 of our estimates appear to have been obtained for the first time. We used correlation matrices to compute environmental and genetic disease classifications and corresponding reliability measures. Among unexpected observations, we found that migraine, typically classified as a disease of the central nervous system, appeared to be most genetically similar to irritable bowel syndrome and most environmentally similar to cystitis and urethritis, all of which are inflammatory diseases.


Assuntos
Doença/genética , Meio Ambiente , Predisposição Genética para Doença/genética , Formulário de Reclamação de Seguro/estatística & dados numéricos , Cistite/classificação , Cistite/genética , Doença/classificação , Feminino , Humanos , Inflamação/classificação , Inflamação/genética , Padrões de Herança/genética , Síndrome do Intestino Irritável/classificação , Síndrome do Intestino Irritável/genética , Modelos Lineares , Masculino , Transtornos de Enxaqueca/classificação , Transtornos de Enxaqueca/genética , Análise Multivariada , Linhagem , Fatores de Risco , Estados Unidos , Uretrite/classificação , Uretrite/genética
9.
PLoS Biol ; 15(6): e2002477, 2017 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-28594819

RESUMO

Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.


Assuntos
Acesso à Informação , Pesquisa Biomédica/métodos , Bases de Dados Genéticas , Animais , Pesquisa Biomédica/tendências , Biotecnologia/tendências , Biologia Computacional/tendências , Mineração de Dados , Bases de Dados Bibliográficas , Bases de Dados Genéticas/normas , Bases de Dados Genéticas/tendências , Regulação da Expressão Gênica , Humanos , Automação de Bibliotecas , Dados de Sequência Molecular , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , Reprodutibilidade dos Testes , Fatores de Tempo , Estados Unidos
10.
Pac Symp Biocomput ; : 120-31, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25592574

RESUMO

Biological pathways are central to understanding complex diseases such as cancer. The majority of this knowledge is scattered in the vast and rapidly growing research literature. To automate knowledge extraction, machine learning approaches typically require annotated examples, which are expensive and time-consuming to acquire. Recently, there has been increasing interest in leveraging databases for distant supervision in knowledge extraction, but existing applications focus almost exclusively on newswire domains. In this paper, we present the first attempt to formulate the distant supervision problem for pathway extraction and apply a state-of-the-art method to extracting pathway interactions from PubMed abstracts. Experiments show that distant supervision can effectively compensate for the lack of annotation, attaining an accuracy approaching supervised results. From 22 million PubMed abstracts, we extracted 1.5 million pathway interactions at a precision of 25%. More than 10% of interactions are mentioned in the context of one or more cancer types, analysis of which yields interesting insights.


Assuntos
Mineração de Dados/métodos , Neoplasias/genética , Neoplasias/metabolismo , Biologia Computacional , Bases de Dados Genéticas , Humanos , Bases de Conhecimento , Redes e Vias Metabólicas/genética , Mutação , Oncogenes , PubMed , Aprendizado de Máquina Supervisionado
11.
Bioinformatics ; 30(19): 2840-2, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-24939151

RESUMO

MOTIVATION: Advances in sequencing technology have led to an exponential growth of genomics data, yet it remains a formidable challenge to interpret such data for identifying disease genes and drug targets. There has been increasing interest in adopting a systems approach that incorporates prior knowledge such as gene networks and genotype-phenotype associations. The majority of such knowledge resides in text such as journal publications, which has been undergoing its own exponential growth. It has thus become a significant bottleneck to identify relevant knowledge for genomic interpretation as well as to keep up with new genomics findings. RESULTS: In the Literome project, we have developed an automatic curation system to extract genomic knowledge from PubMed articles and made this knowledge available in the cloud with a Web site to facilitate browsing, searching and reasoning. Currently, Literome focuses on two types of knowledge most pertinent to genomic medicine: directed genic interactions such as pathways and genotype-phenotype associations. Users can search for interacting genes and the nature of the interactions, as well as diseases and drugs associated with a single nucleotide polymorphism or gene. Users can also search for indirect connections between two entities, e.g. a gene and a disease might be linked because an interacting gene is associated with a related disease. AVAILABILITY AND IMPLEMENTATION: Literome is freely available at literome.azurewebsites.net. Download for non-commercial use is available via Web services.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Polimorfismo de Nucleotídeo Único , PubMed , Algoritmos , Automação , Estudos de Associação Genética , Genoma , Genótipo , Humanos , Internet , Bases de Conhecimento , Fenótipo , Software
12.
Sci Rep ; 3: 1099, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23346356

RESUMO

We present an approach for genome-wide association analysis with improved power on the Wellcome Trust data consisting of seven common phenotypes and shared controls. We achieved improved power by expanding the control set to include other disease cohorts, multiple races, and closely related individuals. Within this setting, we conducted exhaustive univariate and epistatic interaction association analyses. Use of the expanded control set identified more known associations with Crohn's disease and potential new biology, including several plausible epistatic interactions in several diseases. Our work suggests that carefully combining data from large repositories could reveal many new biological insights through increased power. As a community resource, all results have been made available through an interactive web server.


Assuntos
Epistasia Genética/genética , Predisposição Genética para Doença , Polimorfismo de Nucleotídeo Único , Estudos de Coortes , Doença de Crohn/genética , Interpretação Estatística de Dados , Estudo de Associação Genômica Ampla/métodos , Humanos , Fenótipo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...