A standardized analytics pipeline for reliable and rapid development and validation of prediction models using observational health data.

Khalid, Sara; Yang, Cynthia; Blacketer, Clair; Duarte-Salles, Talita; Fernández-Bertolín, Sergio; Kim, Chungsoo; Park, Rae Woong; Park, Jimyung; Schuemie, Martijn J; Sena, Anthony G; Suchard, Marc A; You, Seng Chan; Rijnbeek, Peter R; Reps, Jenna M

Khalid, Sara; Yang, Cynthia; Blacketer, Clair; Duarte-Salles, Talita; Fernández-Bertolín, Sergio; Kim, Chungsoo; Park, Rae Woong; Park, Jimyung; Schuemie, Martijn J; Sena, Anthony G; Suchard, Marc A; You, Seng Chan; Rijnbeek, Peter R; Reps, Jenna M.

Khalid S; Botnar Research Centre, Centre for Statistics in Medicine, Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences (NDORMS), University of Oxford, Oxford, UK.
Yang C; Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
Blacketer C; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
Duarte-Salles T; Fundació Institut Universitari per a la recerca a lAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain.
Fernández-Bertolín S; Fundació Institut Universitari per a la recerca a lAtenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain.
Kim C; Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea.
Park RW; Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea; Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea.
Park J; Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea.
Schuemie MJ; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
Sena AG; Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA.
Suchard MA; Departments of Biomathematics, University of California, Los Angeles, USA.
You SC; Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Republic of Korea.
Rijnbeek PR; Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
Reps JM; Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ, USA. Electronic address: jreps@its.jnj.com.

Comput Methods Programs Biomed ; 211: 106394, 2021 Nov.

Article in English | MEDLINE | ID: covidwho-1437413

Preprint
This scientific journal article is probably based on a previously available preprint. It has been identified through a machine matching algorithm, human confirmation is still pending.
See preprint

ABSTRACT

ABSTRACT

BACKGROUND AND

OBJECTIVE:

As a response to the ongoing COVID-19 pandemic, several prediction models in the existing literature were rapidly developed, with the aim of providing evidence-based guidance. However, none of these COVID-19 prediction models have been found to be reliable. Models are commonly assessed to have a risk of bias, often due to insufficient reporting, use of non-representative data, and lack of large-scale external validation. In this paper, we present the Observational Health Data Sciences and Informatics (OHDSI) analytics pipeline for patient-level prediction modeling as a standardized approach for rapid yet reliable development and validation of prediction models. We demonstrate how our analytics pipeline and open-source software tools can be used to answer important prediction questions while limiting potential causes of bias (e.g., by validating phenotypes, specifying the target population, performing large-scale external validation, and publicly providing all analytical source code).

METHODS:

We show step-by-step how to implement the analytics pipeline for the question 'In patients hospitalized with COVID-19, what is the risk of death 0 to 30 days after hospitalization?'. We develop models using six different machine learning methods in a USA claims database containing over 20,000 COVID-19 hospitalizations and externally validate the models using data containing over 45,000 COVID-19 hospitalizations from South Korea, Spain, and the USA.

RESULTS:

Our open-source software tools enabled us to efficiently go end-to-end from problem design to reliable Model Development and evaluation. When predicting death in patients hospitalized with COVID-19, AdaBoost, random forest, gradient boosting machine, and decision tree yielded similar or lower internal and external validation discrimination performance compared to L1-regularized logistic regression, whereas the MLP neural network consistently resulted in lower discrimination. L1-regularized logistic regression models were well calibrated.

CONCLUSION:

Our results show that following the OHDSI analytics pipeline for patient-level prediction modelling can enable the rapid development towards reliable prediction models. The OHDSI software tools and pipeline are open source and available to researchers from all around the world.

Subject(s)

COVID-19; Pandemics; Humans; Logistic Models; Machine Learning; SARS-CoV-2

Keywords

COVID-19; Data harmonization; Data quality control; Distributed data network; Machine learning; Risk prediction

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Main subject: Pandemics / COVID-19 Type of study: Experimental Studies / Observational study / Prognostic study / Randomized controlled trials Limits: Humans Language: English Journal: Comput Methods Programs Biomed Journal subject: Medical Informatics Year: 2021 Document Type: Article Affiliation country: J.cmpb.2021.106394

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google