Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 58
Filter
1.
Bioinform Adv ; 4(1): vbae095, 2024.
Article in English | MEDLINE | ID: mdl-38962404

ABSTRACT

Motivation: Nonlinear low-dimensional embeddings allow humans to visualize high-dimensional data, as is often seen in bioinformatics, where datasets may have tens of thousands of dimensions. However, relating the axes of a nonlinear embedding to the original dimensions is a nontrivial problem. In particular, humans may identify patterns or interesting subsections in the embedding, but cannot easily identify what those patterns correspond to in the original data. Results: Thus, we present SlowMoMan (SLOW Motions on MANifolds), a web application which allows the user to draw a one-dimensional path onto a 2D embedding. Then, by back-projecting the manifold to the original, high-dimensional space, we sort the original features such that those most discriminative along the manifold are ranked highly. We show a number of pertinent use cases for our tool, including trajectory inference, spatial transcriptomics, and automatic cell classification. Availability and implementation: Software: https://yunwilliamyu.github.io/SlowMoMan/; Code: https://github.com/yunwilliamyu/SlowMoMan.

2.
Am J Surg ; 227: 24-33, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37852844

ABSTRACT

INTRODUCTION: Collaboration is one of the hallmarks of academic research. This study analyzes collaboration patterns in U.S. transplant research, examining publication trends, productive institutions, co-authorship networks, and citation patterns in high-impact transplant journals. METHODS: 4,265 articles published between 2012 and 2021 were analyzed using scientometric tools, logistic regression, VantagePoint software, and Gephi software for network visualization. RESULTS: 16,003 authors from 1,011 institutions and 59 countries were identified, with Harvard, Johns Hopkins, and University of Pennsylvania contributing the most papers. Odds of international collaboration significantly increased over time (OR 1.03; p â€‹= â€‹0.040), while odds of citation in single-institution collaborations decreased (OR 0.99; p â€‹= â€‹0.016). Five major scientific communities and central institutions (Harvard University and University of Pittsburgh) connecting them were identified, revealing interconnected research clusters. CONCLUSIONS: Collaboration enhances knowledge exchange and research productivity, with an increasing trend of institutional and international collaboration in U.S. transplant research. Understanding this community is essential for promoting research impact and forming strategic partnerships.


Subject(s)
Bibliometrics , Organ Transplantation , Humans , Authorship
3.
J Am Med Inform Assoc ; 30(12): 1985-1994, 2023 11 17.
Article in English | MEDLINE | ID: mdl-37632234

ABSTRACT

OBJECTIVE: Patients who receive most care within a single healthcare system (colloquially called a "loyalty cohort" since they typically return to the same providers) have mostly complete data within that organization's electronic health record (EHR). Loyalty cohorts have low data missingness, which can unintentionally bias research results. Using proxies of routine care and healthcare utilization metrics, we compute a per-patient score that identifies a loyalty cohort. MATERIALS AND METHODS: We implemented a computable program for the widely adopted i2b2 platform that identifies loyalty cohorts in EHRs based on a machine-learning model, which was previously validated using linked claims data. We developed a novel validation approach, which tests, using only EHR data, whether patients returned to the same healthcare system after the training period. We evaluated these tools at 3 institutions using data from 2017 to 2019. RESULTS: Loyalty cohort calculations to identify patients who returned during a 1-year follow-up yielded a mean area under the receiver operating characteristic curve of 0.77 using the original model and 0.80 after calibrating the model at individual sites. Factors such as multiple medications or visits contributed significantly at all sites. Screening tests' contributions (eg, colonoscopy) varied across sites, likely due to coding and population differences. DISCUSSION: This open-source implementation of a "loyalty score" algorithm had good predictive power. Enriching research cohorts by utilizing these low-missingness patients is a way to obtain the data completeness necessary for accurate causal analysis. CONCLUSION: i2b2 sites can use this approach to select cohorts with mostly complete EHR data.


Subject(s)
Algorithms , Electronic Health Records , Humans , Machine Learning , Delivery of Health Care , Electronics
4.
Sci Data ; 10(1): 452, 2023 07 19.
Article in English | MEDLINE | ID: mdl-37468503

ABSTRACT

More than 150 scientists from 17 consortia are collaborating on an international project to build a Human Reference Atlas, which maps all 37 trillion cells in the healthy adult human body. The initial release of this atlas provided hierarchical lists of the anatomical structures, cell types, and biomarkers in 11 organs. Here, we describe the methods we used as part of this initiative to build the first open, computer-readable, and comprehensive database of the adult human blood vasculature, called the Human Reference Atlas-Vasculature Common Coordinate Framework (HRA-VCCF). It includes 993 vessels and their branching connections, 10 cell types, and 10 biomarkers. With this paper we are releasing additional details on vessel types and subtypes, branching sequence, anastomoses, portal systems, microvasculature, functional tissue units, mappings to regions vessels supply or drain, geometric properties of vessels, and links to 3D reference objects. Future versions will add variants and connections to the lymph vasculature; and, it will iteratively expand and improve the database as additional experimental data become available through the participating consortia.


Subject(s)
Biomarkers , Adult , Humans
5.
J Med Internet Res ; 25: e45662, 2023 05 25.
Article in English | MEDLINE | ID: mdl-37227772

ABSTRACT

Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.


Subject(s)
Colonic Neoplasms , Electronic Health Records , Humans , Algorithms , Informatics , Research Design
7.
J Biomed Inform ; 139: 104306, 2023 03.
Article in English | MEDLINE | ID: mdl-36738870

ABSTRACT

BACKGROUND: In electronic health records, patterns of missing laboratory test results could capture patients' course of disease as well as ​​reflect clinician's concerns or worries for possible conditions. These patterns are often understudied and overlooked. This study aims to identify informative patterns of missingness among laboratory data collected across 15 healthcare system sites in three countries for COVID-19 inpatients. METHODS: We collected and analyzed demographic, diagnosis, and laboratory data for 69,939 patients with positive COVID-19 PCR tests across three countries from 1 January 2020 through 30 September 2021. We analyzed missing laboratory measurements across sites, missingness stratification by demographic variables, temporal trends of missingness, correlations between labs based on missingness indicators over time, and clustering of groups of labs based on their missingness/ordering pattern. RESULTS: With these analyses, we identified mapping issues faced in seven out of 15 sites. We also identified nuances in data collection and variable definition for the various sites. Temporal trend analyses may support the use of laboratory test result missingness patterns in identifying severe COVID-19 patients. Lastly, using missingness patterns, we determined relationships between various labs that reflect clinical behaviors. CONCLUSION: In this work, we use computational approaches to relate missingness patterns to hospital treatment capacity and highlight the heterogeneity of looking at COVID-19 over time and at multiple sites, where there might be different phases, policies, etc. Changes in missingness could suggest a change in a patient's condition, and patterns of missingness among laboratory measurements could potentially identify clinical outcomes. This allows sites to consider missing data as informative to analyses and help researchers identify which sites are better poised to study particular questions.


Subject(s)
COVID-19 , Electronic Health Records , Humans , Data Collection , Records , Cluster Analysis
8.
Front Physiol ; 14: 1099403, 2023.
Article in English | MEDLINE | ID: mdl-36814475

ABSTRACT

Enhancing our understanding of lymphatic anatomy from the microscopic to the anatomical scale is essential to discern how the structure and function of the lymphatic system interacts with different tissues and organs within the body and contributes to health and disease. The knowledge of molecular aspects of the lymphatic network is fundamental to understand the mechanisms of disease progression and prevention. Recent advances in mapping components of the lymphatic system using state of the art single cell technologies, the identification of novel biomarkers, new clinical imaging efforts, and computational tools which attempt to identify connections between these diverse technologies hold the potential to catalyze new strategies to address lymphatic diseases such as lymphedema and lipedema. This manuscript summarizes current knowledge of the lymphatic system and identifies prevailing challenges and opportunities to advance the field of lymphatic research as discussed by the experts in the workshop.

9.
PLoS One ; 18(1): e0266985, 2023.
Article in English | MEDLINE | ID: mdl-36598895

ABSTRACT

PURPOSE: In young adults (18 to 49 years old), investigation of the acute respiratory distress syndrome (ARDS) after severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection has been limited. We evaluated the risk factors and outcomes of ARDS following infection with SARS-CoV-2 in a young adult population. METHODS: A retrospective cohort study was conducted between January 1st, 2020 and February 28th, 2021 using patient-level electronic health records (EHR), across 241 United States hospitals and 43 European hospitals participating in the Consortium for Clinical Characterization of COVID-19 by EHR (4CE). To identify the risk factors associated with ARDS, we compared young patients with and without ARDS through a federated analysis. We further compared the outcomes between young and old patients with ARDS. RESULTS: Among the 75,377 hospitalized patients with positive SARS-CoV-2 PCR, 1001 young adults presented with ARDS (7.8% of young hospitalized adults). Their mortality rate at 90 days was 16.2% and they presented with a similar complication rate for infection than older adults with ARDS. Peptic ulcer disease, paralysis, obesity, congestive heart failure, valvular disease, diabetes, chronic pulmonary disease and liver disease were associated with a higher risk of ARDS. We described a high prevalence of obesity (53%), hypertension (38%- although not significantly associated with ARDS), and diabetes (32%). CONCLUSION: Trough an innovative method, a large international cohort study of young adults developing ARDS after SARS-CoV-2 infection has been gather. It demonstrated the poor outcomes of this population and associated risk factor.


Subject(s)
COVID-19 , Respiratory Distress Syndrome , Humans , Young Adult , Aged , Adolescent , Adult , Middle Aged , COVID-19/complications , COVID-19/epidemiology , SARS-CoV-2 , Cohort Studies , Retrospective Studies , Electronic Health Records , Respiratory Distress Syndrome/etiology , Respiratory Distress Syndrome/complications , Obesity/complications
10.
EClinicalMedicine ; 55: 101724, 2023 Jan.
Article in English | MEDLINE | ID: mdl-36381999

ABSTRACT

Background: While acute kidney injury (AKI) is a common complication in COVID-19, data on post-AKI kidney function recovery and the clinical factors associated with poor kidney function recovery is lacking. Methods: A retrospective multi-centre observational cohort study comprising 12,891 hospitalized patients aged 18 years or older with a diagnosis of SARS-CoV-2 infection confirmed by polymerase chain reaction from 1 January 2020 to 10 September 2020, and with at least one serum creatinine value 1-365 days prior to admission. Mortality and serum creatinine values were obtained up to 10 September 2021. Findings: Advanced age (HR 2.77, 95%CI 2.53-3.04, p < 0.0001), severe COVID-19 (HR 2.91, 95%CI 2.03-4.17, p < 0.0001), severe AKI (KDIGO stage 3: HR 4.22, 95%CI 3.55-5.00, p < 0.0001), and ischemic heart disease (HR 1.26, 95%CI 1.14-1.39, p < 0.0001) were associated with worse mortality outcomes. AKI severity (KDIGO stage 3: HR 0.41, 95%CI 0.37-0.46, p < 0.0001) was associated with worse kidney function recovery, whereas remdesivir use (HR 1.34, 95%CI 1.17-1.54, p < 0.0001) was associated with better kidney function recovery. In a subset of patients without chronic kidney disease, advanced age (HR 1.38, 95%CI 1.20-1.58, p < 0.0001), male sex (HR 1.67, 95%CI 1.45-1.93, p < 0.0001), severe AKI (KDIGO stage 3: HR 11.68, 95%CI 9.80-13.91, p < 0.0001), and hypertension (HR 1.22, 95%CI 1.10-1.36, p = 0.0002) were associated with post-AKI kidney function impairment. Furthermore, patients with COVID-19-associated AKI had significant and persistent elevations of baseline serum creatinine 125% or more at 180 days (RR 1.49, 95%CI 1.32-1.67) and 365 days (RR 1.54, 95%CI 1.21-1.96) compared to COVID-19 patients with no AKI. Interpretation: COVID-19-associated AKI was associated with higher mortality, and severe COVID-19-associated AKI was associated with worse long-term post-AKI kidney function recovery. Funding: Authors are supported by various funders, with full details stated in the acknowledgement section.

11.
Commun Biol ; 5(1): 1369, 2022 12 13.
Article in English | MEDLINE | ID: mdl-36513738

ABSTRACT

Seventeen international consortia are collaborating on a human reference atlas (HRA), a comprehensive, high-resolution, three-dimensional atlas of all the cells in the healthy human body. Laboratories around the world are collecting tissue specimens from donors varying in sex, age, ethnicity, and body mass index. However, harmonizing tissue data across 25 organs and more than 15 bulk and spatial single-cell assay types poses challenges. Here, we present software tools and user interfaces developed to spatially and semantically annotate ("register") and explore the tissue data and the evolving HRA. A key part of these tools is a common coordinate framework, providing standard terminologies and data structures for describing specimen, biological structure, and spatial data linked to existing ontologies. As of April 22, 2022, the "registration" user interface has been used to harmonize and publish data on 5,909 tissue blocks collected by the Human Biomolecular Atlas Program (HuBMAP), the Stimulating Peripheral Activity to Relieve Conditions program (SPARC), the Human Cell Atlas (HCA), the Kidney Precision Medicine Project (KPMP), and the Genotype Tissue Expression project (GTEx). Further, 5,856 tissue sections were derived from 506 HuBMAP tissue blocks. The second "exploration" user interface enables consortia to evaluate data quality, explore tissue data spatially within the context of the HRA, and guide data acquisition. A companion website is at https://cns-iu.github.io/HRA-supporting-information/ .


Subject(s)
Software , Humans
12.
J Biomed Inform ; 134: 104176, 2022 10.
Article in English | MEDLINE | ID: mdl-36007785

ABSTRACT

OBJECTIVE: For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information. MATERIALS AND METHODS: For each of the centers from which we want to borrow information to improve the prediction performance for the target population, a penalized Cox model is fitted to estimate feature coefficients for the center. Using estimated feature coefficients and the covariance matrix of the target population, we then obtain a SurvMaximin estimated set of feature coefficients for the target population. The target population can be an entire cohort comprised of all centers, corresponding to federated learning, or a single center, corresponding to transfer learning. RESULTS: Simulation studies and a real-world international electronic health records application study, with 15 participating health care centers across three countries (France, Germany, and the U.S.), show that the proposed SurvMaximin algorithm achieves comparable or higher accuracy compared with the estimator using only the information of the target site and other existing methods. The SurvMaximin estimator is robust to variations in sample sizes and estimated feature coefficients between centers, which amounts to significantly improved estimates for target sites with fewer observations. CONCLUSIONS: The SurvMaximin method is well suited for both federated and transfer learning in the high-dimensional survival analysis setting. SurvMaximin only requires a one-time summary information exchange from participating centers. Estimated regression vectors can be very heterogeneous. SurvMaximin provides robust Cox feature coefficient estimates without outcome information in the target population and is privacy-preserving.


Subject(s)
Algorithms , Electronic Health Records , Humans , Privacy , Proportional Hazards Models , Survival Analysis
13.
NPJ Digit Med ; 5(1): 74, 2022 Jun 13.
Article in English | MEDLINE | ID: mdl-35697747

ABSTRACT

Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates across healthcare systems, countries, and continents. Specifically, we trained a Cox regression model with nine measured laboratory test values, standard demographics at admission, and comorbidity burden pre-admission. These models were compared at site, country, and continent level. Of the 39,969 hospitalized patients with COVID-19 (68.6% male), 5717 (14.3%) died. In the Cox model, age, albumin, AST, creatine, CRP, and white blood cell count are most predictive of mortality. The baseline covariates are more predictive of mortality during the early days of COVID-19 hospitalization. Models trained at healthcare systems with larger cohort size largely retain good transportability performance when porting to different sites. The combination of routine laboratory test values at admission along with basic demographic features can predict mortality in patients hospitalized with COVID-19. Importantly, this potentially deployable model differs from prior work by demonstrating not only consistent performance but also reliable transportability across healthcare systems in the US and Europe, highlighting the generalizability of this model and the overall approach.

14.
J Med Internet Res ; 24(5): e37931, 2022 05 18.
Article in English | MEDLINE | ID: mdl-35476727

ABSTRACT

BACKGROUND: Admissions are generally classified as COVID-19 hospitalizations if the patient has a positive SARS-CoV-2 polymerase chain reaction (PCR) test. However, because 35% of SARS-CoV-2 infections are asymptomatic, patients admitted for unrelated indications with an incidentally positive test could be misclassified as a COVID-19 hospitalization. Electronic health record (EHR)-based studies have been unable to distinguish between a hospitalization specifically for COVID-19 versus an incidental SARS-CoV-2 hospitalization. Although the need to improve classification of COVID-19 versus incidental SARS-CoV-2 is well understood, the magnitude of the problems has only been characterized in small, single-center studies. Furthermore, there have been no peer-reviewed studies evaluating methods for improving classification. OBJECTIVE: The aims of this study are to, first, quantify the frequency of incidental hospitalizations over the first 15 months of the pandemic in multiple hospital systems in the United States and, second, to apply electronic phenotyping techniques to automatically improve COVID-19 hospitalization classification. METHODS: From a retrospective EHR-based cohort in 4 US health care systems in Massachusetts, Pennsylvania, and Illinois, a random sample of 1123 SARS-CoV-2 PCR-positive patients hospitalized from March 2020 to August 2021 was manually chart-reviewed and classified as "admitted with COVID-19" (incidental) versus specifically admitted for COVID-19 ("for COVID-19"). EHR-based phenotyping was used to find feature sets to filter out incidental admissions. RESULTS: EHR-based phenotyped feature sets filtered out incidental admissions, which occurred in an average of 26% of hospitalizations (although this varied widely over time, from 0% to 75%). The top site-specific feature sets had 79%-99% specificity with 62%-75% sensitivity, while the best-performing across-site feature sets had 71%-94% specificity with 69%-81% sensitivity. CONCLUSIONS: A large proportion of SARS-CoV-2 PCR-positive admissions were incidental. Straightforward EHR-based phenotypes differentiated admissions, which is important to assure accurate public health reporting and research.


Subject(s)
COVID-19 , SARS-CoV-2 , COVID-19/diagnosis , COVID-19/epidemiology , Electronic Health Records , Hospitalization , Humans , Retrospective Studies
15.
medRxiv ; 2022 Feb 18.
Article in English | MEDLINE | ID: mdl-35350202

ABSTRACT

Admissions are generally classified as COVID-19 hospitalizations if the patient has a positive SARS-CoV-2 polymerase chain reaction (PCR) test. However, because 35% of SARS-CoV-2 infections are asymptomatic, patients admitted for unrelated indications with an incidentally positive test could be misclassified as a COVID-19 hospitalization. EHR-based studies have been unable to distinguish between a hospitalization specifically for COVID-19 versus an incidental SARS-CoV-2 hospitalization. From a retrospective EHR-based cohort in four US healthcare systems, a random sample of 1,123 SARS-CoV-2 PCR-positive patients hospitalized between 3/2020â€"8/2021 was manually chart-reviewed and classified as admitted-with-COVID-19 (incidental) vs. specifically admitted for COVID-19 (for-COVID-19). EHR-based phenotyped feature sets filtered out incidental admissions, which occurred in 26%. The top site-specific feature sets had 79-99% specificity with 62-75% sensitivity, while the best performing across-site feature set had 71-94% specificity with 69-81% sensitivity. A large proportion of SARS-CoV-2 PCR-positive admissions were incidental. Straightforward EHR-based phenotypes differentiated admissions, which is important to assure accurate public health reporting and research.

16.
Blood Adv ; 6(5): 1559-1565, 2022 03 08.
Article in English | MEDLINE | ID: mdl-35086145

ABSTRACT

Although intracranial hemorrhage (ICH) is frequent in the setting of brain metastases, there are limited data on the influence of antiplatelet agents on the development of brain tumor-associated ICH. To evaluate whether the administration of antiplatelet agents increases the risk of ICH, we performed a matched cohort analysis of patients with metastatic brain tumors with blinded radiology review. The study population included 392 patients with metastatic brain tumors (134 received antiplatelet agents and 258 acted as controls). Non-small cell lung cancer was the most common malignancy in the cohort (74.0%), followed by small cell lung cancer (9.9%), melanoma (4.6%), and renal cell cancer (4.3%). Among those who received an antiplatelet agent, 86.6% received aspirin alone and 23.1% received therapeutic anticoagulation during the study period. The cumulative incidence of any ICH at 1 year was 19.3% (95% CI, 14.1-24.4) in patients not receiving antiplatelet agents compared with 22.5% (95% CI, 15.2-29.8; P = .22, Gray test) in those receiving antiplatelet agents. The cumulative incidence of major ICH was 5.4% (95% CI, 2.6-8.3) among controls compared with 5.5% (95% CI, 1.5-9.5; P = .80) in those exposed to antiplatelet agents. The combination of anticoagulation plus antiplatelet agents did not increase the risk of major ICH. The use of antiplatelet agents was not associated with an increase in the incidence, size, or severity of ICH in the setting of brain metastases.


Subject(s)
Brain Neoplasms , Carcinoma, Non-Small-Cell Lung , Lung Neoplasms , Anticoagulants/therapeutic use , Brain Neoplasms/complications , Brain Neoplasms/drug therapy , Carcinoma, Non-Small-Cell Lung/chemically induced , Carcinoma, Non-Small-Cell Lung/complications , Carcinoma, Non-Small-Cell Lung/drug therapy , Humans , Intracranial Hemorrhages/chemically induced , Intracranial Hemorrhages/epidemiology , Lung Neoplasms/complications , Platelet Aggregation Inhibitors/adverse effects , Retrospective Studies
17.
IEEE Trans Knowl Data Eng ; 34(1): 328-339, 2022 Jan.
Article in English | MEDLINE | ID: mdl-38288326

ABSTRACT

In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(loglogn) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error ϵ on a Jaccard index t, given a random oracle, HyperMinHash needs O(ϵ-2(loglogn+log1ϵ)) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

19.
J Am Med Inform Assoc ; 28(12): 2582-2592, 2021 11 25.
Article in English | MEDLINE | ID: mdl-34608931

ABSTRACT

OBJECTIVE: Large amounts of health data are becoming available for biomedical research. Synthesizing information across databases may capture more comprehensive pictures of patient health and enable novel research studies. When no gold standard mappings between patient records are available, researchers may probabilistically link records from separate databases and analyze the linked data. However, previous linked data inference methods are constrained to certain linkage settings and exhibit low power. Here, we present ATLAS, an automated, flexible, and robust association testing algorithm for probabilistically linked data. MATERIALS AND METHODS: Missing variables are imputed at various thresholds using a weighted average method that propagates uncertainty from probabilistic linkage. Next, estimated effect sizes are obtained using a generalized linear model. ATLAS then conducts the threshold combination test by optimally combining P values obtained from data imputed at varying thresholds using Fisher's method and perturbation resampling. RESULTS: In simulations, ATLAS controls for type I error and exhibits high power compared to previous methods. In a real-world genetic association study, meta-analysis of ATLAS-enabled analyses on a linked cohort with analyses using an existing cohort yielded additional significant associations between rheumatoid arthritis genetic risk score and laboratory biomarkers. DISCUSSION: Weighted average imputation weathers false matches and increases contribution of true matches to mitigate linkage error-induced bias. The threshold combination test avoids arbitrarily choosing a threshold to rule a match, thus automating linked data-enabled analyses and preserving power. CONCLUSION: ATLAS promises to enable novel and powerful research studies using linked data to capitalize on all available data sources.


Subject(s)
Algorithms , Medical Record Linkage , Bias , Databases, Factual , Diagnostic Tests, Routine , Humans
20.
J Med Internet Res ; 23(10): e31400, 2021 10 11.
Article in English | MEDLINE | ID: mdl-34533459

ABSTRACT

BACKGROUND: Many countries have experienced 2 predominant waves of COVID-19-related hospitalizations. Comparing the clinical trajectories of patients hospitalized in separate waves of the pandemic enables further understanding of the evolving epidemiology, pathophysiology, and health care dynamics of the COVID-19 pandemic. OBJECTIVE: In this retrospective cohort study, we analyzed electronic health record (EHR) data from patients with SARS-CoV-2 infections hospitalized in participating health care systems representing 315 hospitals across 6 countries. We compared hospitalization rates, severe COVID-19 risk, and mean laboratory values between patients hospitalized during the first and second waves of the pandemic. METHODS: Using a federated approach, each participating health care system extracted patient-level clinical data on their first and second wave cohorts and submitted aggregated data to the central site. Data quality control steps were adopted at the central site to correct for implausible values and harmonize units. Statistical analyses were performed by computing individual health care system effect sizes and synthesizing these using random effect meta-analyses to account for heterogeneity. We focused the laboratory analysis on C-reactive protein (CRP), ferritin, fibrinogen, procalcitonin, D-dimer, and creatinine based on their reported associations with severe COVID-19. RESULTS: Data were available for 79,613 patients, of which 32,467 were hospitalized in the first wave and 47,146 in the second wave. The prevalence of male patients and patients aged 50 to 69 years decreased significantly between the first and second waves. Patients hospitalized in the second wave had a 9.9% reduction in the risk of severe COVID-19 compared to patients hospitalized in the first wave (95% CI 8.5%-11.3%). Demographic subgroup analyses indicated that patients aged 26 to 49 years and 50 to 69 years; male and female patients; and black patients had significantly lower risk for severe disease in the second wave than in the first wave. At admission, the mean values of CRP were significantly lower in the second wave than in the first wave. On the seventh hospital day, the mean values of CRP, ferritin, fibrinogen, and procalcitonin were significantly lower in the second wave than in the first wave. In general, countries exhibited variable changes in laboratory testing rates from the first to the second wave. At admission, there was a significantly higher testing rate for D-dimer in France, Germany, and Spain. CONCLUSIONS: Patients hospitalized in the second wave were at significantly lower risk for severe COVID-19. This corresponded to mean laboratory values in the second wave that were more likely to be in typical physiological ranges on the seventh hospital day compared to the first wave. Our federated approach demonstrated the feasibility and power of harmonizing heterogeneous EHR data from multiple international health care systems to rapidly conduct large-scale studies to characterize how COVID-19 clinical trajectories evolve.


Subject(s)
COVID-19 , Pandemics , Adult , Aged , Female , Hospitalization , Hospitals , Humans , Male , Middle Aged , Retrospective Studies , SARS-CoV-2
SELECTION OF CITATIONS
SEARCH DETAIL
...