Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 45
Filter
1.
ACR Open Rheumatol ; 2024 May 15.
Article in English | MEDLINE | ID: mdl-38747148

ABSTRACT

OBJECTIVE: We aimed to examine the feasibility of applying natural language processing (NLP) to unstructured electronic health record (EHR) documents to detect the presence of financial insecurity among patients with rheumatologic disease enrolled in an integrated care management program (iCMP). METHODS: We incorporated supervised, rule-based NLP and statistical methods to identify financial insecurity among patients with rheumatic conditions enrolled in an iCMP (n = 20,395) in a multihospital EHR system. We constructed a lexicon for financial insecurity using data from available knowledge sources and then reviewed EHR notes from 538 randomly selected individuals (training cohort n = 366, validation cohort n = 172). We manually categorized records as having "definite," "possible," or "no" mention of financial insecurity. All available notes were processed using Narrative Information Linear Extraction, a rule-based version of NLP. Models were trained using the NLP features for financial insecurity using logistic, least absolute shrinkage operator (LASSO), and random forest performance characteristic and were compared with the reference standard. RESULTS: A total of 245,142 notes were processed from 538 individual patient records. Financial insecurity was present among 100 (27%) individuals in the training cohort and 63 (37%) in the validation cohort. The LASSO and random forest models performed identically and slightly better than logistic regression, with positive predictive values of 0.90, sensitivities of 0.29, and specificities of 0.98. CONCLUSION: The development of a context-driven lexicon used with rule-based NLP to extract data that identify financial insecurity is feasible for use and improved the capture for presence of financial insecurity with high accuracy. In the absence of a standard lexicon and construct definition for financial insecurity status, additional studies are needed to optimize the sensitivity of algorithms to categorize financial insecurity with construct validity.

2.
Semin Arthritis Rheum ; 67: 152468, 2024 May 17.
Article in English | MEDLINE | ID: mdl-38788567

ABSTRACT

OBJECTIVE: Cardiovascular disease (CVD) risk is increased in SLE and underestimated by general population prediction algorithms. We aimed to develop a novel SLE-specific prediction tool, SLECRISK, to provide a more accurate estimate of CVD risk in SLE. METHODS: We studied patients in the Brigham and Women's Hospital SLE cohort. We collected one-year baseline data including the presence of traditional CVD factors and SLE-related features at cohort enrollment. Ten-year follow-up for the first major adverse cardiovascular event (MACE; myocardial infarction (MI), stroke, or cardiac death) began at day +1 following the baseline period (index date). ICD-9/10 codes identified MACE were adjudicated by board-certified cardiologists. Least absolute shrinkage and selection operator regression selected SLE-related variables to add to the American College of Cardiology/American Heart Association (ACC/AHA) Pooled Cohort Risk Equations 10-year risk Cox regression model. Model fit statistics and performance (sensitivity, specificity, positive/negative predictive value, c-statistic) for predicting moderate/high 10-year risk (≥7.5 %) of MACE were assessed and compared to ACC/AHA, Framingham risk score (FRS), and modified FRS (mFRS). Optimism adjustment internal validation was performed using bootstrapping. RESULTS: We included 1,243 patients with 90 MACEs (46 MIs, 36 strokes, 19 cardiac deaths) over 8946.5 person-years of follow-up. SLE variables selected for the new prediction algorithm (SLECRISK) were SLE activity (remission/mild vs. moderate/severe), disease duration (years), creatinine (mg/dL), anti-dsDNA, anti-RNP, lupus anticoagulant, anti-Ro positivity, and low C4. The sensitivity for detecting moderate/high-risk (≥7.5 %) of MACE using SLECRISK was 0.74 (95 %CI: 0.65, 0.83), which was better than the sensitivity of the ACC/AHA model (0.38 (95 %CI: 0.28, 0.48)). It also identified 3.4-fold more moderate/high-risk patients than the ACC/AHA. Patients who were moderate/high-risk according to SLECRISK but not ACC/AHA, were more likely to be young women with severe SLE and few other traditional CVD risk factors. Model performance between SLECRISK, FRS, and mFRS were similar. CONCLUSION: The novel SLECRISK tool is more sensitive than the ACC/AHA for predicting moderate/high 10-year risk for MACE and may be particularly useful in predicting risk for young females with severe SLE. Future external validation studies utilizing cohorts with more severe SLE are needed.

3.
J Am Heart Assoc ; 13(9): e030387, 2024 May 07.
Article in English | MEDLINE | ID: mdl-38686879

ABSTRACT

BACKGROUND: Coronary microvascular dysfunction as measured by myocardial flow reserve (MFR) is associated with increased cardiovascular risk in rheumatoid arthritis (RA). The objective of this study was to determine the association between reducing inflammation with MFR and other measures of cardiovascular risk. METHODS AND RESULTS: Patients with RA with active disease about to initiate a tumor necrosis factor inhibitor were enrolled (NCT02714881). All subjects underwent a cardiac perfusion positron emission tomography scan to quantify MFR at baseline before tumor necrosis factor inhibitor initiation, and after tumor necrosis factor inhibitor initiation at 24 weeks. MFR <2.5 in the absence of obstructive coronary artery disease was defined as coronary microvascular dysfunction. Blood samples at baseline and 24 weeks were measured for inflammatory markers (eg, high-sensitivity C-reactive protein [hsCRP], interleukin-1b, and high-sensitivity cardiac troponin T [hs-cTnT]). The primary outcome was mean MFR before and after tumor necrosis factor inhibitor initiation, with Δhs-cTnT as the secondary outcome. Secondary and exploratory analyses included the correlation between ΔhsCRP and other inflammatory markers with MFR and hs-cTnT. We studied 66 subjects, 82% of which were women, mean RA duration 7.4 years. The median atherosclerotic cardiovascular disease risk was 2.5%; 47% had coronary microvascular dysfunction and 23% had detectable hs-cTnT. We observed no change in mean MFR before (2.65) and after treatment (2.64, P=0.6) or hs-cTnT. A correlation was observed between a reduction in hsCRP and interleukin-1b with a reduction in hs-cTnT. CONCLUSIONS: In this RA cohort with low prevalence of cardiovascular risk factors, nearly 50% of subjects had coronary microvascular dysfunction at baseline. A reduction in inflammation was not associated with improved MFR. However, a modest reduction in interleukin-1b and no other inflammatory pathways was correlated with a reduction in subclinical myocardial injury. REGISTRATION: URL: https://www.clinicaltrials.gov; Unique identifier: NCT02714881.


Subject(s)
Arthritis, Rheumatoid , Biomarkers , Coronary Circulation , Inflammation , Microcirculation , Aged , Female , Humans , Male , Middle Aged , Antirheumatic Agents/therapeutic use , Arthritis, Rheumatoid/physiopathology , Arthritis, Rheumatoid/complications , Arthritis, Rheumatoid/blood , Biomarkers/blood , C-Reactive Protein/metabolism , Coronary Artery Disease/physiopathology , Coronary Artery Disease/blood , Coronary Artery Disease/diagnosis , Coronary Circulation/physiology , Coronary Vessels/physiopathology , Coronary Vessels/diagnostic imaging , Fractional Flow Reserve, Myocardial/physiology , Heart Disease Risk Factors , Inflammation/blood , Inflammation/physiopathology , Inflammation Mediators/blood , Interleukin-1beta/blood , Myocardial Perfusion Imaging/methods , Positron-Emission Tomography , Treatment Outcome , Troponin T/blood , Tumor Necrosis Factor Inhibitors/therapeutic use
4.
Patterns (N Y) ; 5(1): 100906, 2024 Jan 12.
Article in English | MEDLINE | ID: mdl-38264714

ABSTRACT

Electronic health record (EHR) data are increasingly used to support real-world evidence studies but are limited by the lack of precise timings of clinical events. Here, we propose a label-efficient incident phenotyping (LATTE) algorithm to accurately annotate the timing of clinical events from longitudinal EHR data. By leveraging the pre-trained semantic embeddings, LATTE selects predictive features and compresses their information into longitudinal visit embeddings through visit attention learning. LATTE models the sequential dependency between the target event and visit embeddings to derive the timings. To improve label efficiency, LATTE constructs longitudinal silver-standard labels from unlabeled patients to perform semi-supervised training. LATTE is evaluated on the onset of type 2 diabetes, heart failure, and relapses of multiple sclerosis. LATTE consistently achieves substantial improvements over benchmark methods while providing high prediction interpretability. The event timings are shown to help discover risk factors of heart failure among patients with rheumatoid arthritis.

5.
Pharmacoepidemiol Drug Saf ; 33(1): e5684, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37654015

ABSTRACT

BACKGROUND: We aimed to determine whether integrating concepts from the notes from the electronic health record (EHR) data using natural language processing (NLP) could improve the identification of gout flares. METHODS: Using Medicare claims linked with EHR, we selected gout patients who initiated the urate-lowering therapy (ULT). Patients' 12-month baseline period and on-treatment follow-up were segmented into 1-month units. We retrieved EHR notes for months with gout diagnosis codes and processed notes for NLP concepts. We selected a random sample of 500 patients and reviewed each of their notes for the presence of a physician-documented gout flare. Months containing at least 1 note mentioning gout flares were considered months with events. We used 60% of patients to train predictive models with LASSO. We evaluated the models by the area under the curve (AUC) in the validation data and examined positive/negative predictive values (P/NPV). RESULTS: We extracted and labeled 839 months of follow-up (280 with gout flares). The claims-only model selected 20 variables (AUC = 0.69). The NLP concept-only model selected 15 (AUC = 0.69). The combined model selected 32 claims variables and 13 NLP concepts (AUC = 0.73). The claims-only model had a PPV of 0.64 [0.50, 0.77] and an NPV of 0.71 [0.65, 0.76], whereas the combined model had a PPV of 0.76 [0.61, 0.88] and an NPV of 0.71 [0.65, 0.76]. CONCLUSION: Adding NLP concept variables to claims variables resulted in a small improvement in the identification of gout flares. Our data-driven claims-only model and our combined claims/NLP-concept model outperformed existing rule-based claims algorithms reliant on medication use, diagnosis, and procedure codes.


Subject(s)
Gout , Aged , Humans , United States/epidemiology , Gout/diagnosis , Gout/epidemiology , Natural Language Processing , Electronic Health Records , Medicare , Symptom Flare Up , Algorithms
6.
medRxiv ; 2023 Oct 02.
Article in English | MEDLINE | ID: mdl-37873131

ABSTRACT

Though electronic health record (EHR) systems are a rich repository of clinical information with large potential, the use of EHR-based phenotyping algorithms is often hindered by inaccurate diagnostic records, the presence of many irrelevant features, and the requirement for a human-labeled training set. In this paper, we describe a knowledge-driven online multimodal automated phenotyping (KOMAP) system that i) generates a list of informative features by an online narrative and codified feature search engine (ONCE) and ii) enables the training of a multimodal phenotyping algorithm based on summary data. Powered by composite knowledge from multiple EHR sources, online article corpora, and a large language model, features selected by ONCE show high concordance with the state-of-the-art AI models (GPT4 and ChatGPT) and encourage large-scale phenotyping by providing a smaller but highly relevant feature set. Validation of the KOMAP system across four healthcare centers suggests that it can generate efficient phenotyping algorithms with robust performance. Compared to other methods requiring patient-level inputs and gold-standard labels, the fully online KOMAP provides a significant opportunity to enable multi-center collaboration.

7.
JAMA Intern Med ; 183(10): 1090-1097, 2023 10 01.
Article in English | MEDLINE | ID: mdl-37603326

ABSTRACT

Importance: The US Food and Drug Administration (FDA) is building a national postmarketing surveillance system for medical devices, moving to a "total product life cycle" approach whereby more limited premarketing data are balanced with postmarketing surveillance to capture rare adverse events and long-term safety issues. Objective: To assess the methodological requirements and feasibility of postmarketing device surveillance using endovascular aneurysm repair devices (EVARs), which have been the subject of safety concerns, using clinical data from a large health care system. Design, Setting, and Participants: This retrospective cohort study included patients with electronic health record (EHR) data in the Veterans Affairs Corporate Data Warehouse. Exposure: Implantation of an AFX Endovascular AAA System (AFX) device (any of 3 iterations) or a non-AFX comparator EVAR device from January 1, 2011, to December 21, 2021. Main Outcomes and Measures: The primary outcomes were rates of type III endoleaks and all-cause mortality; and rates of these outcomes associated with AFX devices compared with non-AFX devices, assessed using Cox proportional hazards regression models and doubly robust causal modeling. Information on type III endoleaks was available only as free-text mentions in clinical notes, while all-cause mortality data could be extracted using structured data. Device-specific information required by the FDA is ascertained using unique device identifiers (UDIs), which include factors such as model numbers, catalog numbers, and manufacturer-specific product codes. The availability of UDIs in EHRs was assessed. Results: In total, 13 941 patients (mean [SD] age, 71.8 [7.4] years) received 1 of the devices of interest (AFX with Strata [AFX-S]: 718 patients [5.2%]; AFX with Duraply [AFX-D]: 404 patients [2.9%]; or AFX2: 682 patients [4.9%]), and 12 137 (87.1%) received non-AFX devices. The UDIs were not recorded in the EHR for any patient with an AFX device, and partial UDIs were available for 19 patients (0.1%) with a non-AFX device. This necessitated the development of advanced natural language processing tools to define the cohort of patients for analysis. The study identified a significantly higher risk of type III endoleaks at 5 years among patients receiving any of the AFX device iterations, including the most recent version, AFX2 (11.6%; 95% CI, 8.1%-15.1%) compared with that among patients with non-AFX devices (5.7%; 95% CI, 2.2%-9.2%; absolute risk difference, 5.9%; 95% CI, 2.3%-9.4%). However, there was no significantly higher all-cause mortality for any of the AFX device iterations, including for AFX2 (19.0%; 95% CI, 16.0%-22.0%) compared with non-AFX devices (18.0%; 95% CI, 15.0%-21.0%; absolute risk difference, 1.0%; 95% CI, -2.1% to 4.1%). Conclusions and Relevance: The findings of this cohort study suggest that clinical data can be used for the postmarketing device surveillance required by the FDA. The study also highlights ongoing challenges to performing larger-scale surveillance, including lack of consistent use of UDIs and insufficient relevant structured data to efficiently capture certain outcomes of interest.


Subject(s)
Aortic Aneurysm, Abdominal , Blood Vessel Prosthesis Implantation , Endovascular Procedures , Humans , Aged , Blood Vessel Prosthesis , Endoleak/etiology , Endovascular Aneurysm Repair , Aortic Aneurysm, Abdominal/etiology , Aortic Aneurysm, Abdominal/mortality , Aortic Aneurysm, Abdominal/surgery , Retrospective Studies , Cohort Studies , Treatment Outcome , Endovascular Procedures/adverse effects , Endovascular Procedures/instrumentation
8.
Arthritis Care Res (Hoboken) ; 75(12): 2529-2536, 2023 12.
Article in English | MEDLINE | ID: mdl-37331999

ABSTRACT

OBJECTIVE: Social determinants of health (SDoH), such as poverty, are associated with increased burden and severity of rheumatic and musculoskeletal diseases. This study was undertaken to study the prevalence and documentation of SDoH-related needs in electronic health records (EHRs) of individuals with these conditions. METHODS: We randomly selected individuals with ≥1 International Classification of Diseases, Ninth/Tenth Revision (ICD-9/10) code for a rheumatic/musculoskeletal condition enrolled in a multihospital integrated care management program that coordinates care for medically and/or psychosocially complex individuals. We assessed SDoH documentation using terms for financial needs, food insecurity, housing instability, transportation, and medication access according to EHR note review and ICD-10 SDoH billing codes (Z codes). We used multivariable logistic regression to examine associations between demographic factors (age, gender, race, ethnicity, insurance) and ≥1 (versus 0) SDoH need as the odds ratio (OR) with 95% confidence interval (95% CI). RESULTS: Among 558 individuals with rheumatic/musculoskeletal conditions, 249 (45%) had ≥1 SDoH need documented in EHR notes by social workers, care coordinators, nurses, and physicians. A total of 171 individuals (31%) had financial insecurity, 105 (19%) had transportation needs, 94 (17%) had food insecurity; 5% had ≥1 related Z code. In the multivariable model, the odds of having ≥1 SDoH need was 2.45 times higher (95% CI 1.17-5.11) for Black versus White individuals and significantly higher for Medicaid or Medicare beneficiaries versus commercially insured individuals. CONCLUSION: Nearly half of this sample of complex care management patients with rheumatic/musculoskeletal conditions had SDoH documented within EHR notes; financial insecurity was the most prevalent. Only 5% of patients had representative billing codes suggesting that systematic strategies to extract SDoH from notes are needed.


Subject(s)
Delivery of Health Care, Integrated , Musculoskeletal Diseases , Rheumatic Diseases , United States/epidemiology , Humans , Aged , Social Determinants of Health , Medicare , Musculoskeletal Diseases/diagnosis , Musculoskeletal Diseases/epidemiology , Musculoskeletal Diseases/therapy , Documentation , Rheumatic Diseases/diagnosis , Rheumatic Diseases/epidemiology , Rheumatic Diseases/therapy
9.
medRxiv ; 2023 May 21.
Article in English | MEDLINE | ID: mdl-37293026

ABSTRACT

Objective: Electronic health record (EHR) systems contain a wealth of clinical data stored as both codified data and free-text narrative notes, covering hundreds of thousands of clinical concepts available for research and clinical care. The complex, massive, heterogeneous, and noisy nature of EHR data imposes significant challenges for feature representation, information extraction, and uncertainty quantification. To address these challenges, we proposed an efficient Aggregated naRrative Codified Health (ARCH) records analysis to generate a large-scale knowledge graph (KG) for a comprehensive set of EHR codified and narrative features. Methods: The ARCH algorithm first derives embedding vectors from a co-occurrence matrix of all EHR concepts and then generates cosine similarities along with associated p-values to measure the strength of relatedness between clinical features with statistical certainty quantification. In the final step, ARCH performs a sparse embedding regression to remove indirect linkage between entity pairs. We validated the clinical utility of the ARCH knowledge graph, generated from 12.5 million patients in the Veterans Affairs (VA) healthcare system, through downstream tasks including detecting known relationships between entity pairs, predicting drug side effects, disease phenotyping, as well as sub-typing Alzheimer's disease patients. Results: ARCH produces high-quality clinical embeddings and KG for over 60,000 EHR concepts, as visualized in the R-shiny powered web-API (https://celehs.hms.harvard.edu/ARCH/). The ARCH embeddings attained an average area under the ROC curve (AUC) of 0.926 and 0.861 for detecting pairs of similar EHR concepts when the concepts are mapped to codified data and to NLP data; and 0.810 (codified) and 0.843 (NLP) for detecting related pairs. Based on the p-values computed by ARCH, the sensitivity of detecting similar and related entity pairs are 0.906 and 0.888 under false discovery rate (FDR) control of 5%. For detecting drug side effects, the cosine similarity based on the ARCH semantic representations achieved an AUC of 0.723 while the AUC improved to 0.826 after few-shot training via minimizing the loss function on the training data set. Incorporating NLP data substantially improved the ability to detect side effects in the EHR. For example, based on unsupervised ARCH embeddings, the power of detecting drug-side effects pairs when using codified data only was 0.15, much lower than the power of 0.51 when using both codified and NLP concepts. Compared to existing large-scale representation learning methods including PubmedBERT, BioBERT and SAPBERT, ARCH attains the most robust performance and substantially higher accuracy in detecting these relationships. Incorporating ARCH selected features in weakly supervised phenotyping algorithms can improve the robustness of algorithm performance, especially for diseases that benefit from NLP features as supporting evidence. For example, the phenotyping algorithm for depression attained an AUC of 0.927 when using ARCH selected features but only 0.857 when using codified features selected via the KESER network[1]. In addition, embeddings and knowledge graphs generated from the ARCH network were able to cluster AD patients into two subgroups, where the fast progression subgroup had a much higher mortality rate. Conclusions: The proposed ARCH algorithm generates large-scale high-quality semantic representations and knowledge graph for both codified and NLP EHR features, useful for a wide range of predictive modeling tasks.

10.
J Med Internet Res ; 25: e45662, 2023 05 25.
Article in English | MEDLINE | ID: mdl-37227772

ABSTRACT

Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.


Subject(s)
Colonic Neoplasms , Electronic Health Records , Humans , Algorithms , Informatics , Research Design
11.
EBioMedicine ; 92: 104581, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37121095

ABSTRACT

BACKGROUND: Rheumatoid arthritis (RA) shares genetic variants with other autoimmune conditions, but existing studies test the association between RA variants with a pre-defined set of phenotypes. The objective of this study was to perform a large-scale, systemic screen to determine phenotypes that share genetic architecture with RA to inform our understanding of shared pathways. METHODS: In the UK Biobank (UKB), we constructed RA genetic risk scores (GRS) incorporating human leukocyte antigen (HLA) and non-HLA risk alleles. Phenotypes were defined using groupings of International Classification of Diseases (ICD) codes. Patients with an RA code were excluded to mitigate the possibility of associations being driven by the diagnosis or management of RA. We performed a phenome-wide association study, testing the association between the RA GRS with phenotypes using multivariate generalized estimating equations that adjusted for age, sex, and first five principal components. Statistical significance was defined using Bonferroni correction. Results were replicated in an independent cohort and replicated phenotypes were validated using medical record review of patients. FINDINGS: We studied n = 316,166 subjects from UKB without evidence of RA and screened for association between the RA GRS and n = 1317 phenotypes. In the UKB, 20 phenotypes were significantly associated with the RA GRS, of which 13 (65%) were immune mediated conditions including polymyalgia rheumatica, granulomatosis with polyangiitis (GPA), type 1 diabetes, and multiple sclerosis. We further identified a novel association in Celiac disease where the HLA and non-HLA alleles had strong associations in opposite directions. Strikingly, we observed that the non-HLA GRS was exclusively associated with greater risk of the validated conditions, suggesting shared underlying pathways outside the HLA region. INTERPRETATION: This study replicated and identified novel autoimmune phenotypes verified by medical record review that share immune pathways with RA and may inform opportunities for shared treatment targets, as well as risk assessment for conditions with a paucity of genomic data, such as GPA. FUNDING: This research was funded by the US National Institutes of Health (P30AR072577, R21AR078339, R35GM142879, T32AR007530) and the Harold and DuVal Bowen Fund.


Subject(s)
Arthritis, Rheumatoid , Genetic Predisposition to Disease , Humans , Genotype , Arthritis, Rheumatoid/diagnosis , Arthritis, Rheumatoid/genetics , Risk Factors , Phenotype , HLA Antigens/genetics , Histocompatibility Antigens Class II/genetics , HLA-DRB1 Chains/genetics , Alleles
12.
Bioinformatics ; 39(2)2023 02 03.
Article in English | MEDLINE | ID: mdl-36805623

ABSTRACT

MOTIVATION: Predicting molecule-disease indications and side effects is important for drug development and pharmacovigilance. Comprehensively mining molecule-molecule, molecule-disease and disease-disease semantic dependencies can potentially improve prediction performance. METHODS: We introduce a Multi-Modal REpresentation Mapping Approach to Predicting molecular-disease relations (M2REMAP) by incorporating clinical semantics learned from electronic health records (EHR) of 12.6 million patients. Specifically, M2REMAP first learns a multimodal molecule representation that synthesizes chemical property and clinical semantic information by mapping molecule chemicals via a deep neural network onto the clinical semantic embedding space shared by drugs, diseases and other common clinical concepts. To infer molecule-disease relations, M2REMAP combines multimodal molecule representation and disease semantic embedding to jointly infer indications and side effects. RESULTS: We extensively evaluate M2REMAP on molecule indications, side effects and interactions. Results show that incorporating EHR embeddings improves performance significantly, for example, attaining an improvement over the baseline models by 23.6% in PRC-AUC on indications and 23.9% on side effects. Further, M2REMAP overcomes the limitation of existing methods and effectively predicts drugs for novel diseases and emerging pathogens. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/celehs/M2REMAP, and prediction results are provided at https://shiny.parse-health.org/drugs-diseases-dev/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Drug-Related Side Effects and Adverse Reactions , Humans , Drug Development , Electronic Health Records , Neural Networks, Computer , Pharmacovigilance
13.
Arthritis Care Res (Hoboken) ; 75(5): 1036-1045, 2023 05.
Article in English | MEDLINE | ID: mdl-34623035

ABSTRACT

OBJECTIVE: In rheumatoid arthritis (RA), there are limited data on risk factors for the clinical heart failure (HF) subtypes of HF with reduced ejection fraction (HFrEF) and HF with preserved ejection fraction (HFpEF). This study examined the association between inflammation and incident HF subtypes in RA. Because inflammation changes over time with disease activity, we hypothesized that the effect of inflammation may be stronger at the 5-year follow-up than at the standard 10-year follow-up from general population studies of cardiovascular risk. METHODS: We studied an electronic health record (EHR)-based RA cohort with data pre- and post-RA incidence. We applied a validated approach to identify HF and extract ejection fraction to classify HFrEF and HFpEF. Follow-up started from the RA incidence date (index date) to the earliest occurrence of incident HF, death, last EHR encounter, or 10 years. Baseline inflammation was assessed using erythrocyte sedimentation rate or C-reactive protein values. Covariates included demographic characteristics, established HF risk factors, and RA-related factors. We tested the association between baseline inflammation with incident HF and its subtypes using Cox proportional hazards models. RESULTS: We studied 9,087 patients with RA; 8.2% developed HF during 10 years of follow-up. Elevated inflammation was associated with increased risk for HF at both 5- and 10-year follow-ups (hazard ratio [HR] 1.66, 95% confidence interval [95% CI] 1.12-2.46 and HR 1.46, 95% CI 1.13-1.90, respectively), which is also seen for HFpEF at 5 years (HR 1.72, 95% CI 1.09-2.70) and 10 years (HR 1.45, 95% CI 1.07-1.94). HFrEF was not associated with inflammation for either follow-up time. CONCLUSION: Elevated inflammation early in RA diagnosis was associated with HF; this association was driven by HFpEF and not HFrEF, suggesting a window of opportunity for prevention of HFpEF in RA.


Subject(s)
Arthritis, Rheumatoid , Heart Failure , Humans , Stroke Volume , Heart Failure/diagnosis , Heart Failure/epidemiology , Risk Factors , Inflammation , Prognosis
14.
J Biomed Inform ; 133: 104147, 2022 09.
Article in English | MEDLINE | ID: mdl-35872266

ABSTRACT

OBJECTIVE: The growing availability of electronic health records (EHR) data opens opportunities for integrative analysis of multi-institutional EHR to produce generalizable knowledge. A key barrier to such integrative analyses is the lack of semantic interoperability across different institutions due to coding differences. We propose a Multiview Incomplete Knowledge Graph Integration (MIKGI) algorithm to integrate information from multiple sources with partially overlapping EHR concept codes to enable translations between healthcare systems. METHODS: The MIKGI algorithm combines knowledge graph information from (i) embeddings trained from the co-occurrence patterns of medical codes within each EHR system and (ii) semantic embeddings of the textual strings of all medical codes obtained from the Self-Aligning Pretrained BERT (SAPBERT) algorithm. Due to the heterogeneity in the coding across healthcare systems, each EHR source provides partial coverage of the available codes. MIKGI synthesizes the incomplete knowledge graphs derived from these multi-source embeddings by minimizing a spherical loss function that combines the pairwise directional similarities of embeddings computed from all available sources. MIKGI outputs harmonized semantic embedding vectors for all EHR codes, which improves the quality of the embeddings and enables direct assessment of both similarity and relatedness between any pair of codes from multiple healthcare systems. RESULTS: With EHR co-occurrence data from Veteran Affairs (VA) healthcare and Mass General Brigham (MGB), MIKGI algorithm produces high quality embeddings for a variety of downstream tasks including detecting known similar or related entity pairs and mapping VA local codes to the relevant EHR codes used at MGB. Based on the cosine similarity of the MIKGI trained embeddings, the AUC was 0.918 for detecting similar entity pairs and 0.809 for detecting related pairs. For cross-institutional medical code mapping, the top 1 and top 5 accuracy were 91.0% and 97.5% when mapping medication codes at VA to RxNorm medication codes at MGB; 59.1% and 75.8% when mapping VA local laboratory codes to LOINC hierarchy. When trained with 500 labels, the lab code mapping attained top 1 and 5 accuracy at 77.7% and 87.9%. MIKGI also attained best performance in selecting VA local lab codes for desired laboratory tests and COVID-19 related features for COVID EHR studies. Compared to existing methods, MIKGI attained the most robust performance with accuracy the highest or near the highest across all tasks. CONCLUSIONS: The proposed MIKGI algorithm can effectively integrate incomplete summary data from biomedical text and EHR data to generate harmonized embeddings for EHR codes for knowledge graph modeling and cross-institutional translation of EHR codes.


Subject(s)
COVID-19 , Electronic Health Records , Algorithms , Humans , Logical Observation Identifiers Names and Codes , Pattern Recognition, Automated
15.
J Am Heart Assoc ; 11(15): e026014, 2022 08 02.
Article in English | MEDLINE | ID: mdl-35904194

ABSTRACT

Background Models predicting atrial fibrillation (AF) risk, such as Cohorts for Heart and Aging Research in Genomic Epidemiology AF (CHARGE-AF), have not performed as well in electronic health records. Natural language processing (NLP) may improve models by using narrative electronic health record text. Methods and Results From a primary care network, we included patients aged ≥65 years with visits between 2003 and 2013 in development (n=32 960) and internal validation cohorts (n=13 992). An external validation cohort from a separate network from 2015 to 2020 included 39 051 patients. Model features were defined using electronic health record codified data and narrative data with NLP. We developed 2 models to predict 5-year AF incidence using (1) codified+NLP data and (2) codified data only and evaluated model performance. The analysis included 2839 incident AF cases in the development cohort and 1057 and 2226 cases in internal and external validation cohorts, respectively. The C-statistic was greater (P<0.001) in codified+NLP model (0.744 [95% CI, 0.735-0.753]) compared with codified-only (0.730 [95% CI, 0.720-0.739]) in the development cohort. In internal validation, the C-statistic of codified+NLP was modestly higher (0.735 [95% CI, 0.720-0.749]) compared with codified-only (0.729 [95% CI, 0.715-0.744]; P=0.06) and CHARGE-AF (0.717 [95% CI, 0.703-0.731]; P=0.002). Codified+NLP and codified-only were well calibrated, whereas CHARGE-AF underestimated AF risk. In external validation, the C-statistic of codified+NLP (0.750 [95% CI, 0.740-0.760]) remained higher (P<0.001) than codified-only (0.738 [95% CI, 0.727-0.748]) and CHARGE-AF (0.735 [95% CI, 0.725-0.746]). Conclusions Estimation of 5-year risk of AF can be modestly improved using NLP to incorporate narrative electronic health record data.


Subject(s)
Atrial Fibrillation , Natural Language Processing , Atrial Fibrillation/diagnosis , Atrial Fibrillation/epidemiology , Cohort Studies , Electronic Health Records , Humans , Incidence , Risk Assessment/methods
16.
J Biomed Inform ; 132: 104109, 2022 08.
Article in English | MEDLINE | ID: mdl-35660521

ABSTRACT

OBJECTIVE: Accurately assigning phenotype information to individual patients via computational phenotyping using Electronic Health Records (EHRs) has been seen as the first step towards enabling EHRs for precision medicine research. Chart review labels annotated by clinical experts, also known as "gold standard" labels, are essential for the development and validation of computational phenotyping algorithms. However, given the complexity of EHR systems, the process of chart review is both labor intensive and time consuming. We propose a fully automated algorithm, referred to as pGUESS, to rank EHR notes according to their relevance to a given phenotype. By identifying the most relevant notes, pGUESS can greatly improve the efficiency and accuracy of chart reviews. METHOD: pGUESS uses prior guided semantic similarity to measure the informativeness of a clinical note to a given phenotype. We first select candidate clinical concepts from a pool of comprehensive medical concepts using public knowledge sources and then derive the semantic embedding vector (SEV) for a reference article (SEVref) and each note (SEVnote). The algorithm scores the relevance of a note as the cosine similarity between SEVnote and SEVref. RESULTS: The algorithm was validated against four sets of 200 notes that were manually annotated by clinical experts to assess their informativeness to one of three disease phenotypes. pGUESS algorithm substantially outperforms existing unsupervised approaches for classifying the relevance status with respect to both accuracy and scalability across phenotypes. Averaging over the three phenotypes, the rank correlation between the algorithm ranking and gold standard label was 0.64 for pGUESS, but only 0.47 and 0.35 for the next two best performing algorithms. pGUESS is also much more computationally scalable compared to existing algorithms. CONCLUSION: pGUESS algorithm can substantially reduce the burden of chart review and holds potential in improving the efficiency and accuracy of human annotation.


Subject(s)
Algorithms , Semantics , Electronic Health Records , Humans , Natural Language Processing , Phenotype , Precision Medicine
17.
JAMA Netw Open ; 5(6): e2218371, 2022 06 01.
Article in English | MEDLINE | ID: mdl-35737384

ABSTRACT

Importance: Temporal shifts in clinical knowledge and practice need to be adjusted for in treatment outcome assessment in clinical evidence. Objective: To use electronic health record (EHR) data to (1) assess the temporal trends in treatment decisions and patient outcomes and (2) emulate a randomized clinical trial (RCT) using EHR data with proper adjustment for temporal trends. Design, Setting, and Participants: The Clinical Outcomes of Surgical Therapy (COST) Study Group Trial assessing overall survival of patients with stages I to III early-stage colon cancer was chosen as the target trial. The RCT was emulated using EHR data of patients from a single health care system cohort who underwent colectomy for early-stage colon cancer from January 1, 2006, to December 31, 2017, and were followed up to January 1, 2020, from Mass General Brigham. Analyses were conducted from December 2, 2019, to January 24, 2022. Exposures: Laparoscopy-assisted colectomy (LAC) vs open colectomy (OC). Main Outcomes and Measures: The primary outcome was 5-year overall survival. To address confounding in the emulation, pretreatment variables were selected and adjusted. The temporal trends were adjusted by stratification of the calendar year when the colectomies were performed with cotraining across strata. Results: A total of 943 patients met key RCT eligibility criteria in the EHR emulation cohort, including 518 undergoing LAC (median age, 63 [range, 20-95] years; 268 [52%] women; 121 [23%] with stage I, 165 [32%] with stage II, and 232 [45%] with stage III cancer; 32 [6%] with colon adhesion; 278 [54%] with right-sided colon cancer; 18 [3%] with left-sided colon cancer; and 222 [43%] with sigmoid colon cancer) and 425 undergoing OC (median age, 65 [range, 28-99] years; 223 [52%] women; 61 [14%] with stage I, 153 [36%] with stage II, and 211 [50%] with stage III cancer; 39 [9%] with colon adhesion; 202 [47%] with right-sided colon cancer; 39 [9%] with left-sided colon cancer; and 201 [47%] with sigmoid colon cancer). Tests for temporal trends in treatment assignment (χ2 = 60.3; P < .001) and overall survival (χ2 = 137.2; P < .001) were significant. The adjusted EHR emulation reached the same conclusion as the RCT: LAC is not inferior to OC in overall survival rate with risk difference at 5 years of -0.007 (95% CI, -0.070 to 0.057). The results were consistent for stratified analysis within each temporal period. Conclusions and Relevance: These findings suggest that confounding bias from temporal trends should be considered when conducting clinical evidence studies with long time spans. Stratification of calendar time and cotraining of models is one solution. With proper adjustment, clinical evidence may supplement RCTs in the assessment of treatment outcome over time.


Subject(s)
Laparoscopy , Sigmoid Neoplasms , Aged , Colectomy/methods , Electronic Health Records , Female , Humans , Laparoscopy/methods , Male , Middle Aged
18.
Int J Med Inform ; 162: 104753, 2022 Apr 01.
Article in English | MEDLINE | ID: mdl-35405530

ABSTRACT

OBJECTIVE: The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce a semi-supervised method for binary acronym disambiguation, the task of classifying a target sense for acronyms in the clinical EHR notes. METHODS: We developed a semi-supervised ensemble machine learning (CASEml) algorithm to automatically identify when an acronym means a target sense by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard semi-supervised method and a baseline metric selecting the most frequent acronym sense. Along with evaluating the performance of these methods for specific instances of acronyms, we evaluated the impact of acronym disambiguation on NLP-driven phenotyping of rheumatoid arthritis. RESULTS: CASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art semi-supervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis. CONCLUSION: CASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and semi-supervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.

19.
Mult Scler Relat Disord ; 57: 103333, 2022 Jan.
Article in English | MEDLINE | ID: mdl-35158446

ABSTRACT

BACKGROUND: Long-term data on multiple sclerosis (MS) inflammatory disease activity are limited. We examined electronic health records (EHR) indicators of disease activity in people with MS. METHODS: We analyzed prospectively collected research registry data and linked EHR data in a clinic-based cohort from 2000 to 2016. We used the trend of the yearly incident relapse rate from the registry data as benchmark. We then calculated the temporal trends of potentially relevant EHR measures, including mean count of the MS diagnostic code, mentions of MS-related concepts, MS-related health utilizations and selected prescriptions. RESULTS: 1,555 MS patients had both registry and EHR data. Between 2000 and 2016, the registry data showed a declining trend in the yearly incident relapse rate, parallel to an increasing trend of DMT usage. Among the EHR measures, covariate-adjusted frequency of diagnostic code of MS, procedure codes of MS-related imaging studies and emergency room visits, and electronic prescription for steroids declined over time, mirroring the temporal trend of the benchmark yearly incident relapse rate. CONCLUSION: This study highlights EHR indicators of MS relapse that could enable large-scale examination of long-term disease activities or inform individual patient monitoring in clinical settings where EHR data are available.


Subject(s)
Multiple Sclerosis , Cohort Studies , Electronic Health Records , Humans , Multiple Sclerosis/epidemiology , Recurrence , Registries
20.
JAMA Netw Open ; 4(11): e2134627, 2021 11 01.
Article in English | MEDLINE | ID: mdl-34783826

ABSTRACT

Importance: As disease-modifying treatment options for multiple sclerosis increase, comparisons of the options based on real-world evidence may guide clinical decision-making. Objective: To compare the relapse outcomes between 2 pairs of disease-modifying treatments: dimethyl fumarate vs fingolimod and natalizumab vs rituximab. Design, Setting, and Participants: This comparative effectiveness study integrated data from a clinic-based multiple sclerosis research registry and its linked electronic health records (EHR) system between January 1, 2006, and December 31, 2016, and built treatment groups for each pairwise disease-modifying treatment comparison according to both registry records and electronic prescriptions. Parallel analyses were conducted from October 11, 2019, to July 7, 2021. Main Outcomes and Measures: The main outcomes were the 1-year and 2-year relapse rates as well as the time to relapse. To compare relapse outcomes, the study adjusted for covariates from 2 sources (registry and EHR) and corrected for confounding biases among the covariates by the doubly robust estimation. Results: The study included 4 treatment groups: dimethyl fumarate (n = 260; 198 women [76.2%]; 227 non-Hispanic White individuals [87.3%]; mean [SD] age at diagnosis, 41.7 [10.4] years), fingolimod (n = 267; 190 women [71.2%]; 222 non-Hispanic White individuals [83.1%]; mean [SD] age at diagnosis, 37.9 [9.9] years), natalizumab (n = 204; 160 women [78.4%]; 172 non-Hispanic White individuals [84.3%]; mean [SD] age at diagnosis, 37.2 [10.6] years), and rituximab (n = 115; 83 women [72.2%]; 99 non-Hispanic White individuals [86.1%]; mean [SD] age at diagnosis, 44.1 [11.1] years). No significant differences were found in the relapse outcomes between dimethyl fumarate and fingolimod after correcting for confounding biases and multiple testing (difference in 1-year relapse rate, 0.028 [95% CI, -0.031 to 0.084]; difference in 2-year relapse rate, 0.071 [95% CI, 0.008-0.128]; relative risk of 2-year non-relapse, 0.957 [95% CI, 0.884-1.035] with dimethyl fumarate as reference). When compared with rituximab, natalizumab was associated with a higher relapse rate for all 3 outcomes after bias correction and multiple testing (difference in 1-year relapse rate, 0.080 [95% CI, 0.013-0.137]; difference in 2-year relapse rate, 0.132 [95% CI, 0.043-0.189]; relative risk of 2-year non-relapse, 0.903 [95% CI, 0.822-0.944]). Confounders were identified from EHR data not recorded in the registry data through data-driven feature selection. Conclusions and Relevance: This study reports real-world evidence of equivalent relapse outcomes between dimethyl fumarate and fingolimod and relapse reduction in favor of rituximab relative to natalizumab. This approach illustrates the value of incorporating EHR data as high-dimensional covariates in real-world treatment comparison.


Subject(s)
Dimethyl Fumarate/therapeutic use , Fingolimod Hydrochloride/therapeutic use , Multiple Sclerosis, Relapsing-Remitting/prevention & control , Multiple Sclerosis/drug therapy , Natalizumab/therapeutic use , Rituximab/therapeutic use , Adult , Female , Humans , Immunosuppressive Agents/therapeutic use , Male , Middle Aged
SELECTION OF CITATIONS
SEARCH DETAIL
...