Who has long-COVID? A big data approach (preprint)

Emily Pfaff; Andrew Girvin; Tellen Bennett; Abhishek Bhatia; Ian Brooks; Rachel Deer; Jonathan Dekermanjian; Sarah Elizabeth Jolley; Michael Kahn; Kristin Kostka; Julie McMurry; Richard Moffitt; Anita Walden; Christopher Chute; Melissa Haendel

This article is a Preprint

Preprints are preliminary research reports that have not been certified by peer review. They should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Preprints posted online allow authors to receive rapid feedback and the entire scientific community can appraise the work for themselves and respond appropriately. Those comments are posted alongside the preprints for anyone to read them and serve as a post publication assessment.

Who has long-COVID? A big data approach (preprint)

Emily Pfaff; Andrew Girvin; Tellen Bennett; Abhishek Bhatia; Ian Brooks; Rachel Deer; Jonathan Dekermanjian; Sarah Elizabeth Jolley; Michael Kahn; Kristin Kostka; Julie McMurry; Richard Moffitt; Anita Walden; Christopher Chute; Melissa Haendel.

medrxiv; 2021.

Preprint in English | medRxiv | ID: ppzbmed-10.1101.2021.10.18.21265168

ABSTRACT

ABSTRACT

Background Post-acute sequelae of SARS-CoV-2 infection (PASC), otherwise known as long-COVID, have severely impacted recovery from the pandemic for patients and society alike. This new disease is characterized by evolving, heterogeneous symptoms, making it challenging to derive an unambiguous long-COVID definition. Electronic health record (EHR) studies are a critical element of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which is addressing the urgent need to understand PASC, accurately identify who has PASC, and identify treatments. Methods Using the National COVID Cohort Collaborative’s (N3C) EHR repository, we developed XGBoost machine learning (ML) models to identify potential long-COVID patients. We examined demographics, healthcare utilization, diagnoses, and medications for 97,995 adult COVID-19 patients. We used these features and 597 long-COVID clinic patients to train three ML models to identify potential long-COVID patients among (1) all COVID-19 patients, (2) patients hospitalized with COVID-19, and (3) patients who had COVID-19 but were not hospitalized. Findings Our models identified potential long-COVID patients with high accuracy, achieving areas under the receiver operator characteristic curve of 0.91 (all patients), 0.90 (hospitalized); and 0.85 (non-hospitalized). Important features include rate of healthcare utilization, patient age, dyspnea, and other diagnosis and medication information available within the EHR. Applying the “all patients” model to the larger N3C cohort identified 100,263 potential long-COVID patients. Interpretation Patients flagged by our models can be interpreted as “patients likely to be referred to or seek care at a long-COVID specialty clinic,” an essential proxy for long-COVID diagnosis in the current absence of a definition. We also achieve the urgent goal of identifying potential long-COVID patients for clinical trials. As more data sources are identified, the models can be retrained and tuned based on study needs. Funding This study was funded by NCATS and NIH through the RECOVER Initiative.

Subject(s)

COVID-19; Dyspnea

Fulltext

XML

Search on Google

Full text: Available Collection: Preprints Database: medRxiv Main subject: Dyspnea / COVID-19 Language: English Year: 2021 Document Type: Preprint

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

Search on Google

Full text: Available Collection: Preprints Database: medRxiv Main subject: Dyspnea / COVID-19 Language: English Year: 2021 Document Type: Preprint