Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 9 de 9
Filter
Add more filters










Database
Language
Publication year range
1.
Sci Rep ; 14(1): 14156, 2024 06 19.
Article in English | MEDLINE | ID: mdl-38898116

ABSTRACT

LLMs can accomplish specialized medical knowledge tasks, however, equitable access is hindered by the extensive fine-tuning, specialized medical data requirement, and limited access to proprietary models. Open-source (OS) medical LLMs show performance improvements and provide the transparency and compliance required in healthcare. We present OpenMedLM, a prompting platform delivering state-of-the-art (SOTA) performance for OS LLMs on medical benchmarks. We evaluated OS foundation LLMs (7B-70B) on medical benchmarks (MedQA, MedMCQA, PubMedQA, MMLU medical-subset) and selected Yi34B for developing OpenMedLM. Prompting strategies included zero-shot, few-shot, chain-of-thought, and ensemble/self-consistency voting. OpenMedLM delivered OS SOTA results on three medical LLM benchmarks, surpassing previous best-performing OS models that leveraged costly and extensive fine-tuning. OpenMedLM displays the first results to date demonstrating the ability of OS foundation models to optimize performance, absent specialized fine-tuning. The model achieved 72.6% accuracy on MedQA, outperforming the previous SOTA by 2.4%, and 81.7% accuracy on MMLU medical-subset, establishing itself as the first OS LLM to surpass 80% accuracy on this benchmark. Our results highlight medical-specific emergent properties in OS LLMs not documented elsewhere to date and validate the ability of OS models to accomplish healthcare tasks, highlighting the benefits of prompt engineering to improve performance of accessible LLMs for medical applications.


Subject(s)
Benchmarking , Humans , Software
2.
NPJ Digit Med ; 6(1): 135, 2023 Jul 29.
Article in English | MEDLINE | ID: mdl-37516790

ABSTRACT

The success of foundation models such as ChatGPT and AlphaFold has spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. In this narrative review, we examine 84 foundation models trained on non-imaging EMR data (i.e., clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g., MIMIC-III) or broad, public biomedical corpora (e.g., PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. Considering these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.

3.
J Am Med Inform Assoc ; 30(9): 1532-1542, 2023 08 18.
Article in English | MEDLINE | ID: mdl-37369008

ABSTRACT

OBJECTIVE: Heatlhcare institutions are establishing frameworks to govern and promote the implementation of accurate, actionable, and reliable machine learning models that integrate with clinical workflow. Such governance frameworks require an accompanying technical framework to deploy models in a resource efficient, safe and high-quality manner. Here we present DEPLOYR, a technical framework for enabling real-time deployment and monitoring of researcher-created models into a widely used electronic medical record system. MATERIALS AND METHODS: We discuss core functionality and design decisions, including mechanisms to trigger inference based on actions within electronic medical record software, modules that collect real-time data to make inferences, mechanisms that close-the-loop by displaying inferences back to end-users within their workflow, monitoring modules that track performance of deployed models over time, silent deployment capabilities, and mechanisms to prospectively evaluate a deployed model's impact. RESULTS: We demonstrate the use of DEPLOYR by silently deploying and prospectively evaluating 12 machine learning models trained using electronic medical record data that predict laboratory diagnostic results, triggered by clinician button-clicks in Stanford Health Care's electronic medical record. DISCUSSION: Our study highlights the need and feasibility for such silent deployment, because prospectively measured performance varies from retrospective estimates. When possible, we recommend using prospectively estimated performance measures during silent trials to make final go decisions for model deployment. CONCLUSION: Machine learning applications in healthcare are extensively researched, but successful translations to the bedside are rare. By describing DEPLOYR, we aim to inform machine learning deployment best practices and help bridge the model implementation gap.


Subject(s)
Electronic Health Records , Software , Retrospective Studies , Machine Learning
4.
Int J Med Inform ; 173: 104930, 2023 05.
Article in English | MEDLINE | ID: mdl-36893656

ABSTRACT

BACKGROUND: Data drift can negatively impact the performance of machine learning algorithms (MLAs) that were trained on historical data. As such, MLAs should be continuously monitored and tuned to overcome the systematic changes that occur in the distribution of data. In this paper, we study the extent of data drift and provide insights about its characteristics for sepsis onset prediction. This study will help elucidate the nature of data drift for prediction of sepsis and similar diseases. This may aid with the development of more effective patient monitoring systems that can stratify risk for dynamic disease states in hospitals. METHODS: We devise a series of simulations that measure the effects of data drift in patients with sepsis, using electronic health records (EHR). We simulate multiple scenarios in which data drift may occur, namely the change in the distribution of the predictor variables (covariate shift), the change in the statistical relationship between the predictors and the target (concept shift), and the occurrence of a major healthcare event (major event) such as the COVID-19 pandemic. We measure the impact of data drift on model performances, identify the circumstances that necessitate model retraining, and compare the effects of different retraining methodologies and model architecture on the outcomes. We present the results for two different MLAs, eXtreme Gradient Boosting (XGB) and Recurrent Neural Network (RNN). RESULTS: Our results show that the properly retrained XGB models outperform the baseline models in all simulation scenarios, hence signifying the existence of data drift. In the major event scenario, the area under the receiver operating characteristic curve (AUROC) at the end of the simulation period is 0.811 for the baseline XGB model and 0.868 for the retrained XGB model. In the covariate shift scenario, the AUROC at the end of the simulation period for the baseline and retrained XGB models is 0.853 and 0.874 respectively. In the concept shift scenario and under the mixed labeling method, the retrained XGB models perform worse than the baseline model for most simulation steps. However, under the full relabeling method, the AUROC at the end of the simulation period for the baseline and retrained XGB models is 0.852 and 0.877 respectively. The results for the RNN models were mixed, suggesting that retraining based on a fixed network architecture may be inadequate for an RNN. We also present the results in the form of other performance metrics such as the ratio of observed to expected probabilities (calibration) and the normalized rate of positive predictive values (PPV) by prevalence, referred to as lift, at a sensitivity of 0.8. CONCLUSION: Our simulations reveal that retraining periods of a couple of months or using several thousand patients are likely to be adequate to monitor machine learning models that predict sepsis. This indicates that a machine learning system for sepsis prediction will probably need less infrastructure for performance monitoring and retraining compared to other applications in which data drift is more frequent and continuous. Our results also show that in the event of a concept shift, a full overhaul of the sepsis prediction model may be necessary because it indicates a discrete change in the definition of sepsis labels, and mixing the labels for the sake of incremental training may not produce the desired results.


Subject(s)
COVID-19 , Communicable Diseases , Sepsis , Humans , Pandemics , COVID-19/diagnosis , Sepsis/diagnosis , Machine Learning
5.
medRxiv ; 2022 Jun 07.
Article in English | MEDLINE | ID: mdl-35702157

ABSTRACT

Background: Data drift can negatively impact the performance of machine learning algorithms (MLAs) that were trained on historical data. As such, MLAs should be continuously monitored and tuned to overcome the systematic changes that occur in the distribution of data. In this paper, we study the extent of data drift and provide insights about its characteristics for sepsis onset prediction. This study will help elucidate the nature of data drift for prediction of sepsis and similar diseases. This may aid with the development of more effective patient monitoring systems that can stratify risk for dynamic disease states in hospitals. Methods: We devise a series of simulations that measure the effects of data drift in patients with sepsis. We simulate multiple scenarios in which data drift may occur, namely the change in the distribution of the predictor variables (covariate shift), the change in the statistical relationship between the predictors and the target (concept shift), and the occurrence of a major healthcare event (major event) such as the COVID-19 pandemic. We measure the impact of data drift on model performances, identify the circumstances that necessitate model retraining, and compare the effects of different retraining methodologies and model architecture on the outcomes. We present the results for two different MLAs, eXtreme Gradient Boosting (XGB) and Recurrent Neural Network (RNN). Results: Our results show that the properly retrained XGB models outperform the baseline models in all simulation scenarios, hence signifying the existence of data drift. In the major event scenario, the area under the receiver operating characteristic curve (AUROC) at the end of the simulation period is 0.811 for the baseline XGB model and 0.868 for the retrained XGB model. In the covariate shift scenario, the AUROC at the end of the simulation period for the baseline and retrained XGB models is 0.853 and 0.874 respectively. In the concept shift scenario and under the mixed labeling method, the retrained XGB models perform worse than the baseline model for most simulation steps. However, under the full relabeling method, the AUROC at the end of the simulation period for the baseline and retrained XGB models is 0.852 and 0.877 respectively. The results for the RNN models were mixed, suggesting that retraining based on a fixed network architecture may be inadequate for an RNN. We also present the results in the form of other performance metrics such as the ratio of observed to expected probabilities (calibration) and the normalized rate of positive predictive values (PPV) by prevalence, referred to as lift, at a sensitivity of 0.8. Conclusion: Our simulations reveal that retraining periods of a couple of months or using several thousand patients are likely to be adequate to monitor machine learning models that predict sepsis. This indicates that a machine learning system for sepsis prediction will probably need less infrastructure for performance monitoring and retraining compared to other applications in which data drift is more frequent and continuous. Our results also show that in the event of a concept shift, a full overhaul of the sepsis prediction model may be necessary because it indicates a discrete change in the definition of sepsis labels, and mixing the labels for the sake of incremental training may not produce the desired results.

6.
JMIR Med Inform ; 10(6): e36202, 2022 Jun 15.
Article in English | MEDLINE | ID: mdl-35704370

ABSTRACT

BACKGROUND: Acute respiratory distress syndrome (ARDS) is a condition that is often considered to have broad and subjective diagnostic criteria and is associated with significant mortality and morbidity. Early and accurate prediction of ARDS and related conditions such as hypoxemia and sepsis could allow timely administration of therapies, leading to improved patient outcomes. OBJECTIVE: The aim of this study is to perform an exploration of how multilabel classification in the clinical setting can take advantage of the underlying dependencies between ARDS and related conditions to improve early prediction of ARDS in patients. METHODS: The electronic health record data set included 40,703 patient encounters from 7 hospitals from April 20, 2018, to March 17, 2021. A recurrent neural network (RNN) was trained using data from 5 hospitals, and external validation was conducted on data from 2 hospitals. In addition to ARDS, 12 target labels for related conditions such as sepsis, hypoxemia, and COVID-19 were used to train the model to classify a total of 13 outputs. As a comparator, XGBoost models were developed for each of the 13 target labels. Model performance was assessed using the area under the receiver operating characteristic curve. Heat maps to visualize attention scores were generated to provide interpretability to the neural networks. Finally, cluster analysis was performed to identify potential phenotypic subgroups of patients with ARDS. RESULTS: The single RNN model trained to classify 13 outputs outperformed the individual XGBoost models for ARDS prediction, achieving an area under the receiver operating characteristic curve of 0.842 on the external test sets. Models trained on an increasing number of tasks resulted in improved performance. Earlier prediction of ARDS nearly doubled the rate of in-hospital survival. Cluster analysis revealed distinct ARDS subgroups, some of which had similar mortality rates but different clinical presentations. CONCLUSIONS: The RNN model presented in this paper can be used as an early warning system to stratify patients who are at risk of developing one of the multiple risk outcomes, hence providing practitioners with the means to take early action.

7.
JMIR Aging ; 5(2): e35373, 2022 Apr 01.
Article in English | MEDLINE | ID: mdl-35363146

ABSTRACT

BACKGROUND: Short-term fall prediction models that use electronic health records (EHRs) may enable the implementation of dynamic care practices that specifically address changes in individualized fall risk within senior care facilities. OBJECTIVE: The aim of this study is to implement machine learning (ML) algorithms that use EHR data to predict a 3-month fall risk in residents from a variety of senior care facilities providing different levels of care. METHODS: This retrospective study obtained EHR data (2007-2021) from Juniper Communities' proprietary database of 2785 individuals primarily residing in skilled nursing facilities, independent living facilities, and assisted living facilities across the United States. We assessed the performance of 3 ML-based fall prediction models and the Juniper Communities' fall risk assessment. Additional analyses were conducted to examine how changes in the input features, training data sets, and prediction windows affected the performance of these models. RESULTS: The Extreme Gradient Boosting model exhibited the highest performance, with an area under the receiver operating characteristic curve of 0.846 (95% CI 0.794-0.894), specificity of 0.848, diagnostic odds ratio of 13.40, and sensitivity of 0.706, while achieving the best trade-off in balancing true positive and negative rates. The number of active medications was the most significant feature associated with fall risk, followed by a resident's number of active diseases and several variables associated with vital signs, including diastolic blood pressure and changes in weight and respiratory rates. The combination of vital signs with traditional risk factors as input features achieved higher prediction accuracy than using either group of features alone. CONCLUSIONS: This study shows that the Extreme Gradient Boosting technique can use a large number of features from EHR data to make short-term fall predictions with a better performance than that of conventional fall risk assessments and other ML models. The integration of routinely collected EHR data, particularly vital signs, into fall prediction models may generate more accurate fall risk surveillance than models without vital signs. Our data support the use of ML models for dynamic, cost-effective, and automated fall predictions in different types of senior care facilities.

8.
JGH Open ; 6(3): 196-204, 2022 Mar.
Article in English | MEDLINE | ID: mdl-35355667

ABSTRACT

Background: Non-alcoholic fatty liver (NAFL) can progress to the severe subtype non-alcoholic steatohepatitis (NASH) and/or fibrosis, which are associated with increased morbidity, mortality, and healthcare costs. Current machine learning studies detect NASH; however, this study is unique in predicting the progression of NAFL patients to NASH or fibrosis. Aim: To utilize clinical information from NAFL-diagnosed patients to predict the likelihood of progression to NASH or fibrosis. Methods: Data were collected from electronic health records of patients receiving a first-time NAFL diagnosis. A gradient boosted machine learning algorithm (XGBoost) as well as logistic regression (LR) and multi-layer perceptron (MLP) models were developed. A five-fold cross-validation grid search was utilized for hyperparameter optimization of variables, including maximum tree depth, learning rate, and number of estimators. Predictions of patients likely to progress to NASH or fibrosis within 4 years of initial NAFL diagnosis were made using demographic features, vital signs, and laboratory measurements. Results: The XGBoost algorithm achieved area under the receiver operating characteristic (AUROC) values of 0.79 for prediction of progression to NASH and 0.87 for fibrosis on both hold-out and external validation test sets. The XGBoost algorithm outperformed the LR and MLP models for both NASH and fibrosis prediction on all metrics. Conclusion: It is possible to accurately identify newly diagnosed NAFL patients at high risk of progression to NASH or fibrosis. Early identification of these patients may allow for increased clinical monitoring, more aggressive preventative measures to slow the progression of NAFL and fibrosis, and efficient clinical trial enrollment.

9.
Pancreatology ; 22(1): 43-50, 2022 Jan.
Article in English | MEDLINE | ID: mdl-34690046

ABSTRACT

BACKGROUND: Acute pancreatitis (AP) is one of the most common causes of gastrointestinal-related hospitalizations in the United States. Severe AP (SAP) is associated with a mortality rate of nearly 30% and is distinguished from milder forms of AP. Risk stratification to identify SAP cases needing inpatient treatment is an important aspect of AP diagnosis. METHODS: We developed machine learning algorithms to predict which patients presenting with AP would require treatment for SAP. Three models were developed using logistic regression, neural networks, and XGBoost. Models were assessed in terms of area under the receiver operating characteristic (AUROC) and compared to the Harmless Acute Pancreatitis Score (HAPS) and Bedside Index for Severity in Acute Pancreatitis (BISAP) scores for AP risk stratification. RESULTS: 61,894 patients were used to train and test the machine learning models. With an AUROC value of 0.921, the model developed using XGBoost outperformed the logistic regression and neural network-based models. The XGBoost model also achieved a higher AUROC than both HAPS and BISAP for identifying patients who would be diagnosed with SAP. CONCLUSIONS: Machine learning may be able to improve the accuracy of AP risk stratification methods and allow for more timely treatment and initiation of interventions.


Subject(s)
Machine Learning , Pancreatitis/diagnosis , Acute Disease , Adolescent , Adult , Aged , Aged, 80 and over , Female , Humans , Male , Middle Aged , Predictive Value of Tests , Prognosis , ROC Curve , Retrospective Studies , Severity of Illness Index
SELECTION OF CITATIONS
SEARCH DETAIL
...