Pesquisa | Portal Regional da BVS

1.

Me-LLaMA: Foundation Large Language Models for Medical Applications.

Xie, Qianqian; Chen, Qingyu; Chen, Aokun; Peng, Cheng; Hu, Yan; Lin, Fongci; Peng, Xueqing; Huang, Jimin; Zhang, Jeffrey; Keloth, Vipina; Zhou, Xinyu; He, Huan; Ohno-Machado, Lucila; Wu, Yonghui; Xu, Hua; Bian, Jiang.

Res Sq ; 2024 May 22.

Artigo em Inglês | MEDLINE | ID: mdl-38826372

RESUMO

Recent advancements in large language models (LLMs) such as ChatGPT and LLaMA have hinted at their potential to revolutionize medical applications, yet their application in clinical settings often reveals limitations due to a lack of specialized training on medical-specific data. In response to this challenge, this study introduces Me-LLaMA, a novel medical LLM family that includes foundation models - Me-LLaMA 13/70B, along with their chat-enhanced versions - Me-LLaMA 13/70B-chat, developed through continual pre-training and instruction tuning of LLaMA2 using large medical datasets. Our methodology leverages a comprehensive domain-specific data suite, including a large-scale, continual pre-training dataset with 129B tokens, an instruction tuning dataset with 214k samples, and a new medical evaluation benchmark (MIBE) across six critical medical tasks with 12 datasets. Our extensive evaluation using the MIBE shows that Me-LLaMA models achieve overall better performance than existing open-source medical LLMs in zero-shot, few-shot and supervised learning abilities. With task-specific instruction tuning, Me-LLaMA models outperform ChatGPT on 7 out of 8 datasets and GPT-4 on 5 out of 8 datasets. In addition, we investigated the catastrophic forgetting problem, and our results show that Me-LLaMA models outperform other open-source medical LLMs in mitigating this issue. Me-LLaMA is one of the largest open-source medical foundation LLMs that use both biomedical and clinical data. It exhibits superior performance across both general and medical tasks compared to other open-source medical LLMs, rendering it an attractive choice for medical AI applications. We release our models, datasets, and evaluation scripts at: https://github.com/BIDS-Xu-Lab/Me-LLaMA.

2.

Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need.

Peng, Cheng; Yang, Xi; Chen, Aokun; Yu, Zehao; Smith, Kaleb E; Costa, Anthony B; Flores, Mona G; Bian, Jiang; Wu, Yonghui.

J Am Med Inform Assoc ; 2024 Apr 17.

Artigo em Inglês | MEDLINE | ID: mdl-38630580

RESUMO

OBJECTIVE: To solve major clinical natural language processing (NLP) tasks using a unified text-to-text learning architecture based on a generative large language model (LLM) via prompt tuning. METHODS: We formulated 7 key clinical NLP tasks as text-to-text learning and solved them using one unified generative clinical LLM, GatorTronGPT, developed using GPT-3 architecture and trained with up to 20 billion parameters. We adopted soft prompts (ie, trainable vectors) with frozen LLM, where the LLM parameters were not updated (ie, frozen) and only the vectors of soft prompts were updated, known as prompt tuning. We added additional soft prompts as a prefix to the input layer, which were optimized during the prompt tuning. We evaluated the proposed method using 7 clinical NLP tasks and compared them with previous task-specific solutions based on Transformer models. RESULTS AND CONCLUSION: The proposed approach achieved state-of-the-art performance for 5 out of 7 major clinical NLP tasks using one unified generative LLM. Our approach outperformed previous task-specific transformer models by â¼3% for concept extraction and 7% for relation extraction applied to social determinants of health, 3.4% for clinical concept normalization, 3.4%-10% for clinical abbreviation disambiguation, and 5.5%-9% for natural language inference. Our approach also outperformed a previously developed prompt-based machine reading comprehension (MRC) model, GatorTron-MRC, for clinical concept and relation extraction. The proposed approach can deliver the "one model for all" promise from training to deployment using a unified generative LLM.

3.

Effect of Eligibility Criteria on Patients' Survival and Serious Adverse Events in Colorectal Cancer Drug Trials.

Wang, Chang; Chen, Aokun; He, Xing; Balian, Patrick; George, Thomas J; Wang, Fei; Bian, Jiang; Guo, Yi.

medRxiv ; 2024 Apr 03.

Artigo em Inglês | MEDLINE | ID: mdl-38633798

RESUMO

This study investigates the impact of clinical trial eligibility criteria on patient survival and serious adverse events (SAEs) in colorectal cancer (CRC) drug trials using real-world data. We utilized the OneFlorida+ network's data repository, conducting a retrospective analysis of CRC patients receiving FDA-approved first-line metastatic treatments. Propensity score matching created balanced case-control groups, which were evaluated using survival analysis and machine learning algorithms to assess the effects of eligibility criteria. Our study included 68,375 patients, with matched case-control groups comprising 1,126 patients each. Survival analysis revealed ethnicity and race, along with specific medical history (eligibility criteria), as significant survival outcome predictors. Machine learning models, particularly the XgBoost regressor, were employed to analyze SAEs, indicating that age and study groups were notable factors in SAEs occurrence. The study's findings highlight the importance of considering patient demographics and medical history in CRC trial designs.

4.

Model tuning or prompt Tuning? a study of large language models for clinical concept and relation extraction.

Peng, Cheng; Yang, Xi; Smith, Kaleb E; Yu, Zehao; Chen, Aokun; Bian, Jiang; Wu, Yonghui.

J Biomed Inform ; 153: 104630, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38548007

RESUMO

OBJECTIVE: To develop soft prompt-based learning architecture for large language models (LLMs), examine prompt-tuning using frozen/unfrozen LLMs, and assess their abilities in transfer learning and few-shot learning. METHODS: We developed a soft prompt-based learning architecture and compared 4 strategies including (1) fine-tuning without prompts; (2) hard-prompting with unfrozen LLMs; (3) soft-prompting with unfrozen LLMs; and (4) soft-prompting with frozen LLMs. We evaluated GatorTron, a clinical LLM with up to 8.9 billion parameters, and compared GatorTron with 4 existing transformer models for clinical concept and relation extraction on 2 benchmark datasets for adverse drug events and social determinants of health (SDoH). We evaluated the few-shot learning ability and generalizability for cross-institution applications. RESULTS AND CONCLUSION: When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6 â¼ 3.1 % and 1.2 â¼ 2.9 %, respectively; GatorTron-345 M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming other two models by 0.2 â¼ 2 % and 0.6 â¼ 11.7 %, respectively. When LLMs are frozen, small LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen models. Soft prompting with a frozen GatorTron-8.9B model achieved the best performance for cross-institution evaluation. We demonstrate that (1) machines can learn soft prompts better than hard prompts composed by human, (2) frozen LLMs have good few-shot learning ability and generalizability for cross-institution applications, (3) frozen LLMs reduce computing cost to 2.5 â¼ 6 % of previous methods using unfrozen LLMs, and (4) frozen LLMs require large models (e.g., over several billions of parameters) for good performance.

Assuntos

Processamento de Linguagem Natural , Humanos , Aprendizado de Máquina , Mineração de Dados/métodos , Algoritmos , Determinantes Sociais da Saúde , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos

5.

Feasibility of Identifying Factors Related to Alzheimer's Disease and Related Dementia in Real-World Data.

Chen, Aokun; Li, Qian; Huang, Yu; Li, Yongqiu; Chuang, Yu-Neng; Hu, Xia; Guo, Serena; Wu, Yonghui; Guo, Yi; Bian, Jiang.

medRxiv ; 2024 Feb 13.

Artigo em Inglês | MEDLINE | ID: mdl-38405723

RESUMO

A comprehensive view of factors associated with AD/ADRD will significantly aid in studies to develop new treatments for AD/ADRD and identify high-risk populations and patients for prevention efforts. In our study, we summarized the risk factors for AD/ADRD by reviewing existing meta-analyses and review articles on risk and preventive factors for AD/ADRD. In total, we extracted 477 risk factors in 10 categories from 537 studies. We constructed an interactive knowledge map to disseminate our study results. Most of the risk factors are accessible from structured Electronic Health Records (EHRs), and clinical narratives show promise as information sources. However, evaluating genomic risk factors using RWD remains a challenge, as genetic testing for AD/ADRD is still not a common practice and is poorly documented in both structured and unstructured EHRs. Considering the constantly evolving research on AD/ADRD risk factors, literature mining via NLP methods offers a solution to automatically update our knowledge map.

6.

A study of generative large language model for medical research and healthcare.

Peng, Cheng; Yang, Xi; Chen, Aokun; Smith, Kaleb E; PourNejatian, Nima; Costa, Anthony B; Martin, Cheryl; Flores, Mona G; Zhang, Ying; Magoc, Tanja; Lipori, Gloria; Mitchell, Duane A; Ospina, Naykky S; Ahmed, Mustafa M; Hogan, William R; Shenkman, Elizabeth A; Guo, Yi; Bian, Jiang; Wu, Yonghui.

NPJ Digit Med ; 6(1): 210, 2023 Nov 16.

Artigo em Inglês | MEDLINE | ID: mdl-37973919

RESUMO

There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians' Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.

7.

Contextualized medication information extraction using Transformer-based deep learning architectures.

Chen, Aokun; Yu, Zehao; Yang, Xi; Guo, Yi; Bian, Jiang; Wu, Yonghui.

J Biomed Inform ; 142: 104370, 2023 06.

Artigo em Inglês | MEDLINE | ID: mdl-37100106

RESUMO

OBJECTIVE: To develop a natural language processing (NLP) system to extract medications and contextual information that help understand drug changes. This project is part of the 2022 n2c2 challenge. MATERIALS AND METHODS: We developed NLP systems for medication mention extraction, event classification (indicating medication changes discussed or not), and context classification to classify medication changes context into 5 orthogonal dimensions related to drug changes. We explored 6 state-of-the-art pretrained transformer models for the three subtasks, including GatorTron, a large language model pretrained using > 90 billion words of text (including > 80 billion words from > 290 million clinical notes identified at the University of Florida Health). We evaluated our NLP systems using annotated data and evaluation scripts provided by the 2022 n2c2 organizers. RESULTS: Our GatorTron models achieved the best F1-scores of 0.9828 for medication extraction (ranked 3rd), 0.9379 for event classification (ranked 2nd), and the best micro-average accuracy of 0.9126 for context classification. GatorTron outperformed existing transformer models pretrained using smaller general English text and clinical text corpora, indicating the advantage of large language models. CONCLUSION: This study demonstrated the advantage of using large transformer models for contextual medication information extraction from clinical narratives.

Assuntos

Aprendizado Profundo , Processamento de Linguagem Natural , Armazenamento e Recuperação da Informação

8.

A large language model for electronic health records.

Yang, Xi; Chen, Aokun; PourNejatian, Nima; Shin, Hoo Chang; Smith, Kaleb E; Parisien, Christopher; Compas, Colin; Martin, Cheryl; Costa, Anthony B; Flores, Mona G; Zhang, Ying; Magoc, Tanja; Harle, Christopher A; Lipori, Gloria; Mitchell, Duane A; Hogan, William R; Shenkman, Elizabeth A; Bian, Jiang; Wu, Yonghui.

NPJ Digit Med ; 5(1): 194, 2022 Dec 26.

Artigo em Inglês | MEDLINE | ID: mdl-36572766

RESUMO

There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model-GatorTron-using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on five clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve five clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og .

9.

The Impact of Race-Ethnicity and Diagnosis of Alzheimer's Disease and Related Dementias on Mammography Use.

Chen, Aokun; Li, Yongqiu; Woodard, Jennifer N; Islam, Jessica Y; Yang, Shuang; George, Thomas J; Shenkman, Elizabeth A; Bian, Jiang; Guo, Yi.

Cancers (Basel) ; 14(19)2022 Sep 28.

Artigo em Inglês | MEDLINE | ID: mdl-36230648

RESUMO

Breast cancer screening (BCS) with mammography is a crucial method for improving cancer survival. In this study, we examined the association of Alzheimer's disease (AD) and AD-related dementias (ADRD) diagnosis and race-ethnicity with mammography use in BCS-eligible women. In the real-world data from the OneFlorida+ Clinical Research Network, we extracted a cohort of 21,715 BCS-eligible women with ADRD and a matching comparison cohort of 65,145 BCS-eligible women without ADRD. In multivariable regression analysis, BCS-eligible women with ADRD were more likely to undergo a mammography than the BCS-eligible women without ADRD (odds ratio [OR] = 1.19, 95% confidence interval [CI] = 1.13-1.26). Stratified by race-ethnicity, BCS-eligible Hispanic women with ADRD were more likely to undergo a mammography (OR = 1.56, 95% CI = 1.39-1.75), whereas BCS-eligible non-Hispanic black (OR = 0.72, 95% CI = 0.62-0.83) and non-Hispanic other (OR = 0.65, 95% CI = 0.45-0.93) women with ADRD were less likely to undergo a mammography. This study was the first to report the impact of ADRD diagnosis and race-ethnicity on mammography use in BCS-eligible women using real-world data. Our results suggest ADRD patients might be undergoing BCS without detailed guidelines to maximize benefits and avoid harms.

10.

Incidence Trends of New-Onset Diabetes in Children and Adolescents Before and During the COVID-19 Pandemic: Findings From Florida.

Guo, Yi; Bian, Jiang; Chen, Aokun; Wang, Fei; Posgai, Amanda L; Schatz, Desmond A; Shenkman, Elizabeth A; Atkinson, Mark A.

Diabetes ; 71(12): 2702-2706, 2022 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-36094294

RESUMO

This study examined the incidence trends of new-onset type 1 and type 2 diabetes in children and adolescents in Florida before and during the coronavirus disease 2019 (COVID-19) pandemic. In this observational descriptive cohort study, we used a validated computable phenotype to identify incident diabetes cases among individuals <18 years of age in the OneFlorida+ network of the national Patient-Centered Clinical Research Network between January 2017 and June 2021. We conducted an interrupted time series analysis based on the autoregressive integrated moving average model to compare changes in age-adjusted incidence rates of type 1 and type 2 diabetes before and after March 2020, when COVID-19 was declared a national health emergency in the U.S. The age-adjusted incidence rates of both type 1 and type 2 diabetes increased post-COVID-19 for children and adolescents. These results highlight the need for longitudinal cohort studies to examine how the pandemic might influence subsequent diabetes onset in young individuals.

Assuntos

COVID-19 , Diabetes Mellitus Tipo 2 , Humanos , Incidência , Pandemias , COVID-19/epidemiologia , Diabetes Mellitus Tipo 2/epidemiologia , Estudos de Coortes , Estudos Longitudinais , Florida/epidemiologia

11.

Impacts of Eligibility Criteria on Trial Participants' Age in Alzheimer's Disease Clinical Trials.

Chen, Aokun; Li, Qian; He, Xing; Jaffee, Michael S; Hogan, William R; Wang, Fei; Guo, Yi; Bian, Jiang.

AMIA Annu Symp Proc ; 2022: 368-376, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-37128470

RESUMO

Overly restricted and poorly designed eligibility criteria reduce the generalizability of the results from clinical trials. We conducted a study to identify and quantify the impacts of study traits extracted from eligibility criteria on the age of study populations in Alzheimer's Disease (AD) clinical trials. Using machine learning methods and SHapley Additive exPlanation (SHAP) values, we identified 30 and 34 study traits that excluded older patients from AD trials in our 2 generated target populations respectively. We also found that study traits had different magnitudes of impacts on the age distributions of the generated study populations across racial-ethnic groups. To our best knowledge, this was the first study that quantified the impact of eligibility criteria on the age of AD trial participants. Our research is a first step in addressing the overly restrictive eligibility criteria in AD clinical trials.

Assuntos

Doença de Alzheimer , Humanos , Definição da Elegibilidade , Aprendizado de Máquina

12.

Learning Fast and Slow: Propedeutica for Real-Time Malware Detection.

Sun, Ruimin; Yuan, Xiaoyong; He, Pan; Zhu, Qile; Chen, Aokun; Gregio, Andre; Oliveira, Daniela; Li, Xiaolin.

IEEE Trans Neural Netw Learn Syst ; 33(6): 2518-2529, 2022 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-34723811

RESUMO

Existing malware detectors on safety-critical devices have difficulties in runtime detection due to the performance overhead. In this article, we introduce Propedeutica, a framework for efficient and effective real-time malware detection, leveraging the best of conventional machine learning (ML) and deep learning (DL) techniques. In Propedeutica, all software start executions are considered as benign and monitored by a conventional ML classifier for fast detection. If the software receives a borderline classification from the ML detector (e.g., the software is 50% likely to be benign and 50% likely to be malicious), the software will be transferred to a more accurate, yet performance demanding DL detector. To address spatial-temporal dynamics and software execution heterogeneity, we introduce a novel DL architecture (DeepMalware) for Propedeutica with multistream inputs. We evaluated Propedeutica with 9115 malware samples and 1338 benign software from various categories for the Windows OS. With a borderline interval of [30%, 70%], Propedeutica achieves an accuracy of 94.34% and a false-positive rate of 8.75%, with 41.45% of the samples moved for DeepMalwareanalysis. Even using only CPU, Propedeutica can detect malware within less than 0.1 s.

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA