Your browser doesn't support javascript.
Identifying who has long COVID in the USA: a machine learning approach using N3C data.
Pfaff, Emily R; Girvin, Andrew T; Bennett, Tellen D; Bhatia, Abhishek; Brooks, Ian M; Deer, Rachel R; Dekermanjian, Jonathan P; Jolley, Sarah Elizabeth; Kahn, Michael G; Kostka, Kristin; McMurry, Julie A; Moffitt, Richard; Walden, Anita; Chute, Christopher G; Haendel, Melissa A.
  • Pfaff ER; Department of Medicine, UNC Chapel Hill School of Medicine, Chapel Hill, NC, USA. Electronic address: epfaff@email.unc.edu.
  • Girvin AT; Palantir Technologies, Denver, CO, USA.
  • Bennett TD; Section of Informatics and Data Science, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA; Section of Critical Care Medicine, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
  • Bhatia A; Carolina Health Informatics Program, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
  • Brooks IM; Colorado Center for Personalised Medicine, Division of Biomedical Informatics & Personalized Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
  • Deer RR; Department of Nutrition, Metabolism, and Rehabilitation Sciences, University of Texas Medical Branch, Galveston, TX, USA.
  • Dekermanjian JP; Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
  • Jolley SE; Division of Pulmonary and Critical Care Medicine, Department of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
  • Kahn MG; Section of Informatics and Data Science, Department of Pediatrics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
  • Kostka K; The OHDSI Center at the Roux Institute, Northeastern University, Portland, ME, USA.
  • McMurry JA; Center for Health AI, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
  • Moffitt R; Department of Biomedical Informatics, Stony Brook Cancer Center, Stony Brook University, Stony Brook, NY, USA.
  • Walden A; Center for Health AI, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
  • Chute CG; Section of Biomedical Informatics and Data Science, Johns Hopkins University, Baltimore, MD, USA.
  • Haendel MA; Center for Health AI, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
Lancet Digit Health ; 4(7): e532-e541, 2022 07.
Article in English | MEDLINE | ID: covidwho-1852294
ABSTRACT

BACKGROUND:

Post-acute sequelae of SARS-CoV-2 infection, known as long COVID, have severely affected recovery from the COVID-19 pandemic for patients and society alike. Long COVID is characterised by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous definition. Studies of electronic health records are a crucial element of the US National Institutes of Health's RECOVER Initiative, which is addressing the urgent need to understand long COVID, identify treatments, and accurately identify who has it-the latter is the aim of this study.

METHODS:

Using the National COVID Cohort Collaborative's (N3C) electronic health record repository, we developed XGBoost machine learning models to identify potential patients with long COVID. We defined our base population (n=1 793 604) as any non-deceased adult patient (age ≥18 years) with either an International Classification of Diseases-10-Clinical Modification COVID-19 diagnosis code (U07.1) from an inpatient or emergency visit, or a positive SARS-CoV-2 PCR or antigen test, and for whom at least 90 days have passed since COVID-19 index date. We examined demographics, health-care utilisation, diagnoses, and medications for 97 995 adults with COVID-19. We used data on these features and 597 patients from a long COVID clinic to train three machine learning models to identify potential long COVID among all patients with COVID-19, patients hospitalised with COVID-19, and patients who had COVID-19 but were not hospitalised. Feature importance was determined via Shapley values. We further validated the models on data from a fourth site.

FINDINGS:

Our models identified, with high accuracy, patients who potentially have long COVID, achieving areas under the receiver operator characteristic curve of 0·92 (all patients), 0·90 (hospitalised), and 0·85 (non-hospitalised). Important features, as defined by Shapley values, include rate of health-care utilisation, patient age, dyspnoea, and other diagnosis and medication information available within the electronic health record.

INTERPRETATION:

Patients identified by our models as potentially having long COVID can be interpreted as patients warranting care at a specialty clinic for long COVID, which is an essential proxy for long COVID diagnosis as its definition continues to evolve. We also achieve the urgent goal of identifying potential long COVID in patients for clinical trials. As more data sources are identified, our models can be retrained and tuned based on the needs of individual studies.

FUNDING:

US National Institutes of Health and National Center for Advancing Translational Sciences through the RECOVER Initiative.
Subject(s)

Full text: Available Collection: International databases Database: MEDLINE Main subject: COVID-19 Type of study: Cohort study / Diagnostic study / Observational study / Prognostic study Topics: Long Covid Limits: Adolescent / Adult / Humans Country/Region as subject: North America Language: English Journal: Lancet Digit Health Year: 2022 Document Type: Article

Similar

MEDLINE

...
LILACS

LIS


Full text: Available Collection: International databases Database: MEDLINE Main subject: COVID-19 Type of study: Cohort study / Diagnostic study / Observational study / Prognostic study Topics: Long Covid Limits: Adolescent / Adult / Humans Country/Region as subject: North America Language: English Journal: Lancet Digit Health Year: 2022 Document Type: Article