Search | VHL Regional Portal

Evaluating gender bias in ML-based clinical risk prediction models: A study on multiple use cases at different hospitals.

Cabanillas Silva, Patricia; Sun, Hong; Rodriguez, Pablo; Rezk, Mohamed; Zhang, Xianchao; Fliegenschmidt, Janis; Hulde, Nikolai; von Dossow, Vera; Meesseman, Laurent; Depraetere, Kristof; Szymanowsky, Ralph; Stieg, Jörg; Dahlweid, Fried-Michael.

J Biomed Inform ; : 104692, 2024 Jul 13.

Article in English | MEDLINE | ID: mdl-39009174

ABSTRACT

BACKGROUND: An inherent difference exists between male and female bodies, the historical under-representation of females in clinical trials widened this gap in existing healthcare data. The fairness of clinical decision-support tools is at risk when developed based on biased data. This paper aims to quantitatively assess the gender bias in risk prediction models. We aim to generalize our findings by performing this investigation on multiple use cases at different hospitals. METHODS: First, we conduct a thorough analysis of the source data to find gender-based disparities. Secondly, we assess the model performance on different gender groups at different hospitals and on different use cases. Performance evaluation is quantified using the area under the receiver-operating characteristic curve (AUROC). Lastly, we investigate the clinical implications of these biases by analyzing the underdiagnosis and overdiagnosis rate, and the decision curve analysis (DCA). We also investigate the influence of model calibration on mitigating gender-related disparities in decision-making processes. RESULTS: Our data analysis reveals notable variations in incidence rates, AUROC, and over-diagnosis rates across different genders, hospitals and clinical use cases. However, it is also observed the underdiagnosis rate is consistently higher in the female population. In general, the female population exhibits lower incidence rates and the models perform worse when applied to this group. Furthermore, the decision curve analysis demonstrates there is no statistically significant difference between the model's clinical utility across gender groups within the interested range of thresholds. CONCLUSION: The presence of gender bias within risk prediction models varies across different clinical use cases and healthcare institutions. Although inherent difference is observed between male and female populations at the data source level, this variance does not affect the parity of clinical utility. In conclusion, the evaluations conducted in this study highlight the significance of continuous monitoring of gender-based disparities in various perspectives for clinical risk prediction models.

Machine Learning-Based Prediction Models for Different Clinical Risks in Different Hospitals: Evaluation of Live Performance.

Sun, Hong; Depraetere, Kristof; Meesseman, Laurent; Cabanillas Silva, Patricia; Szymanowsky, Ralph; Fliegenschmidt, Janis; Hulde, Nikolai; von Dossow, Vera; Vanbiervliet, Martijn; De Baerdemaeker, Jos; Roccaro-Waldmeyer, Diana M; Stieg, Jörg; Domínguez Hidalgo, Manuel; Dahlweid, Fried-Michael.

J Med Internet Res ; 24(6): e34295, 2022 06 07.

Article in English | MEDLINE | ID: mdl-35502887

ABSTRACT

BACKGROUND: Machine learning algorithms are currently used in a wide array of clinical domains to produce models that can predict clinical risk events. Most models are developed and evaluated with retrospective data, very few are evaluated in a clinical workflow, and even fewer report performances in different hospitals. In this study, we provide detailed evaluations of clinical risk prediction models in live clinical workflows for three different use cases in three different hospitals. OBJECTIVE: The main objective of this study was to evaluate clinical risk prediction models in live clinical workflows and compare their performance in these setting with their performance when using retrospective data. We also aimed at generalizing the results by applying our investigation to three different use cases in three different hospitals. METHODS: We trained clinical risk prediction models for three use cases (ie, delirium, sepsis, and acute kidney injury) in three different hospitals with retrospective data. We used machine learning and, specifically, deep learning to train models that were based on the Transformer model. The models were trained using a calibration tool that is common for all hospitals and use cases. The models had a common design but were calibrated using each hospital's specific data. The models were deployed in these three hospitals and used in daily clinical practice. The predictions made by these models were logged and correlated with the diagnosis at discharge. We compared their performance with evaluations on retrospective data and conducted cross-hospital evaluations. RESULTS: The performance of the prediction models with data from live clinical workflows was similar to the performance with retrospective data. The average value of the area under the receiver operating characteristic curve (AUROC) decreased slightly by 0.6 percentage points (from 94.8% to 94.2% at discharge). The cross-hospital evaluations exhibited severely reduced performance: the average AUROC decreased by 8 percentage points (from 94.2% to 86.3% at discharge), which indicates the importance of model calibration with data from the deployment hospital. CONCLUSIONS: Calibrating the prediction model with data from different deployment hospitals led to good performance in live settings. The performance degradation in the cross-hospital evaluation identified limitations in developing a generic model for different hospitals. Designing a generic process for model development to generate specialized prediction models for each hospital guarantees model performance in different hospitals.

Subject(s)

Electronic Health Records , Machine Learning , Hospitals , Humans , ROC Curve , Retrospective Studies

Utilising Information of the Case Fee Catalogue to Enhance 30-Day Readmission Prediction in the German DRG System.

Eggerth, Alphons; Hayn, Dieter; Veeranki, Sai; Stieg, Jörg; Schreier, Günter.

Stud Health Technol Inform ; 255: 40-44, 2018.

Article in English | MEDLINE | ID: mdl-30306903

ABSTRACT

Unplanned hospital readmissions are a burden to the healthcare system and to the patients. To lower the readmission rates, machine learning approaches can be used to create predictive models, with the intention to provide actionable information for caregivers. According to the German Diagnosis Related Groups (G-DRG) system, for every stay in a German hospital, data are collected for the subsequent reimbursement calculations. After statistical evaluation, these data are summarised in the yearly updated Case Fee Catalogue, which not only contains the weights for the reimbursement calculations, but also the expected length of stay values. The aim of the present paper was to evaluate potential enhancements of the prediction accuracy of our 30-day readmission prediction model by utilising additional information from the Case Fee Catalogue. A bagged ensemble of 25 regression trees was applied to §21 datasets from five independent German hospitals from 2013 to 2017, resulting in 422,597 cases. The overall model showed an area under the receiver operating characteristics curve of 0.812. Three of the top five features ranked by out of bag feature importance emerged from the Case Fee Catalogue. We conclude, that additional information from the Case Fee Catalogue can enhance the accuracy of 30-day readmission prediction.

Subject(s)

Diagnosis-Related Groups , Machine Learning , Patient Readmission , Forecasting , Germany , Hospitals , Humans , Prognosis , ROC Curve

Prediction of Readmissions in the German DRG System Based on §21 Datasets.

Eggerth, Alphons; Hayn, Dieter; Veeranki, Sai; Stieg, Jörg; Schreier, Günter.

Stud Health Technol Inform ; 253: 170-174, 2018.

Article in English | MEDLINE | ID: mdl-30147066

ABSTRACT

Hospital readmissions receive increasing interest, since they are burdensome for patients and costly for healthcare providers. For the calculation of reimbursement fees, in Germany there is the German-Diagnosis Related Groups (G-DRG) system. For every hospital stay, data are collected as a so-called "case", as the basis for the subsequent reimbursement calculations ("§21 dataset"). Merging rules lead to a loss of information in §21 datasets. We applied machine learning to §21 datasets and evaluated the influence of case merging for the resulting accuracy of readmission risk prediction. Data from 478,966 cases were analysed by applying a random forest. Many cases with readmissions within 30 days had been merged and thus their prediction required additional data. Using 10-fold cross validation, the prediction for readmissions within 31-60 days showed no notable difference in the area under the ROC curves comparing unedited §21 datasets with §21 datasets with restored original cases. The achieved AUC values of 0.69 lie in a similar range as the values of comparable state-of-the-art models. We conclude that dealing with merged cases, i.e. adding data, is required for 30-day-readmission prediction, whereas un-merging brings no improvement for the readmission prediction of period beyond 30 days.

Subject(s)

Diagnosis-Related Groups , Machine Learning , Patient Readmission , Forecasting , Germany , Humans , Length of Stay

Plausibility of Individual Decisions from Random Forests in Clinical Predictive Modelling Applications.

Hayn, Dieter; Walch, Harald; Stieg, Jörg; Kreiner, Karl; Ebner, Hubert; Schreier, Günter.

Stud Health Technol Inform ; 236: 328-335, 2017.

Article in English | MEDLINE | ID: mdl-28508814

ABSTRACT

BACKGROUND: Machine learning algorithms are a promising approach to help physicians to deal with the ever increasing amount of data collected in healthcare each day. However, interpretation of suggestions derived from predictive models can be difficult. OBJECTIVES: The aim of this work was to quantify the influence of a specific feature on an individual decision proposed by a random forest (RF). METHODS: For each decision tree within the RF, the influence of each feature on a specific decision (FID) was quantified. For each feature, changes in outcome value due to the feature were summarized along the path. Results from all the trees in the RF were statistically merged. The ratio of FID to the respective feature's global importance was calculated (FIDrel). RESULTS: Global feature importance, FID and FIDrel significantly differed, depending on the individual input data. Therefore, we suggest to present the most important features as determined for FID and for FIDrel, whenever results of a RF are visualized. CONCLUSION: Feature influence on a specific decision can be quantified in RFs. Further studies will be necessary to evaluate our approach in a real world scenario.

Subject(s)

Algorithms , Decision Trees , Delivery of Health Care , Machine Learning

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL