Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 41
Filter
1.
NPJ Digit Med ; 7(1): 171, 2024 Jun 27.
Article in English | MEDLINE | ID: mdl-38937550

ABSTRACT

Foundation models are transforming artificial intelligence (AI) in healthcare by providing modular components adaptable for various downstream tasks, making AI development more scalable and cost-effective. Foundation models for structured electronic health records (EHR), trained on coded medical records from millions of patients, demonstrated benefits including increased performance with fewer training labels, and improved robustness to distribution shifts. However, questions remain on the feasibility of sharing these models across hospitals and their performance in local tasks. This multi-center study examined the adaptability of a publicly accessible structured EHR foundation model (FMSM), trained on 2.57 M patient records from Stanford Medicine. Experiments used EHR data from The Hospital for Sick Children (SickKids) and Medical Information Mart for Intensive Care (MIMIC-IV). We assessed both adaptability via continued pretraining on local data, and task adaptability compared to baselines of locally training models from scratch, including a local foundation model. Evaluations on 8 clinical prediction tasks showed that adapting the off-the-shelf FMSM matched the performance of gradient boosting machines (GBM) locally trained on all data while providing a 13% improvement in settings with few task-specific training labels. Continued pretraining on local data showed FMSM required fewer than 1% of training examples to match the fully trained GBM's performance, and was 60 to 90% more sample-efficient than training local foundation models from scratch. Our findings demonstrate that adapting EHR foundation models across hospitals provides improved prediction performance at less cost, underscoring the utility of base foundation models as modular components to streamline the development of healthcare AI.

2.
BMC Med Inform Decis Mak ; 24(1): 51, 2024 Feb 14.
Article in English | MEDLINE | ID: mdl-38355486

ABSTRACT

BACKGROUND: Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. The primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. METHODS: This study included three cohorts: SickKids from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen's Kappa, sensitivity and specificity were calculated for each lab-based severity level. RESULTS: The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and StanfordAdults (n = 159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen's Kappa and sensitivity were lower at SickKids for all severity levels compared to StanfordPeds. CONCLUSIONS: Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment.


Subject(s)
Hyperkalemia , Neutropenia , Humans , Delivery of Health Care , Machine Learning , Sensitivity and Specificity
3.
Am J Epidemiol ; 2024 Feb 22.
Article in English | MEDLINE | ID: mdl-38400644

ABSTRACT

In 2008, Oregon expanded its Medicaid program using a lottery, creating a rare opportunity to study the effects of Medicaid coverage using a randomized controlled design (Oregon Health Insurance Experiment). Analysis showed that Medicaid coverage lowered the risk of depression. However, this effect may vary between individuals, and the identification of individuals likely to benefit the most has the potential to improve the effectiveness and efficiency of the Medicaid program. By applying the machine learning causal forest to data from this experiment, we found substantial heterogeneity in the effect of Medicaid coverage on depression; individuals with high predicted benefit were older and had more physical or mental health conditions at baseline. Expanding coverage to individuals with high predicted benefit generated greater reduction in depression prevalence than expanding to all eligible individuals (21.5 vs. 8.8 percentage point reduction; adjusted difference [95%CI], +12.7 [+4.6,+20.8]; P=0.003), at substantially lower cost per case prevented ($16,627 vs. $36,048; adjusted difference [95%CI], -$18,598 [-$156,953,-$3,120]; P=0.04). Medicaid coverage reduces depression substantially more in a subset of the population than others, in ways that are predictable in advance. Targeting coverage on those most likely to benefit could improve the effectiveness and efficiency of insurance expansion.

4.
Pac Symp Biocomput ; 29: 8-23, 2024.
Article in English | MEDLINE | ID: mdl-38160266

ABSTRACT

The quickly-expanding nature of published medical literature makes it challenging for clinicians and researchers to keep up with and summarize recent, relevant findings in a timely manner. While several closed-source summarization tools based on large language models (LLMs) now exist, rigorous and systematic evaluations of their outputs are lacking. Furthermore, there is a paucity of high-quality datasets and appropriate benchmark tasks with which to evaluate these tools. We address these issues with four contributions: we release Clinfo.ai, an open-source WebApp that answers clinical questions based on dynamically retrieved scientific literature; we specify an information retrieval and abstractive summarization task to evaluate the performance of such retrieval-augmented LLM systems; we release a dataset of 200 questions and corresponding answers derived from published systematic reviews, which we name PubMed Retrieval and Synthesis (PubMedRS-200); and report benchmark results for Clinfo.ai and other publicly available OpenQA systems on PubMedRS-200.


Subject(s)
Computational Biology , Natural Language Processing , Humans , PubMed , Information Storage and Retrieval , Language
5.
J Am Med Inform Assoc ; 30(12): 2004-2011, 2023 11 17.
Article in English | MEDLINE | ID: mdl-37639620

ABSTRACT

OBJECTIVE: Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robust pediatric prediction models. The primary objective was to determine whether a self-supervised model trained in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks. MATERIALS AND METHODS: This retrospective cohort study used EHR data and included patients with at least one admission to an inpatient unit. One admission per patient was randomly selected. Adult inpatients were 18 years or older while pediatric inpatients were more than 28 days and less than 18 years. Admissions were temporally split into training (January 1, 2008 to December 31, 2019), validation (January 1, 2020 to December 31, 2020), and test (January 1, 2021 to August 1, 2022) sets. Primary comparison was a self-supervised model trained in adult inpatients versus count-based logistic regression models trained in pediatric inpatients. Primary outcome was mean area-under-the-receiver-operating-characteristic-curve (AUROC) for 11 distinct clinical outcomes. Models were evaluated in pediatric inpatients. RESULTS: When evaluated in pediatric inpatients, mean AUROC of self-supervised model trained in adult inpatients (0.902) was noninferior to count-based logistic regression models trained in pediatric inpatients (0.868) (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority). CONCLUSIONS: Self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding suggests transferability of self-supervised models trained in adult patients to pediatric patients, without requiring costly model retraining.


Subject(s)
Inpatients , Machine Learning , Humans , Adult , Child , Retrospective Studies , Supervised Machine Learning , Electronic Health Records
6.
NPJ Digit Med ; 6(1): 135, 2023 Jul 29.
Article in English | MEDLINE | ID: mdl-37516790

ABSTRACT

The success of foundation models such as ChatGPT and AlphaFold has spurred significant interest in building similar models for electronic medical records (EMRs) to improve patient care and hospital operations. However, recent hype has obscured critical gaps in our understanding of these models' capabilities. In this narrative review, we examine 84 foundation models trained on non-imaging EMR data (i.e., clinical text and/or structured data) and create a taxonomy delineating their architectures, training data, and potential use cases. We find that most models are trained on small, narrowly-scoped clinical datasets (e.g., MIMIC-III) or broad, public biomedical corpora (e.g., PubMed) and are evaluated on tasks that do not provide meaningful insights on their usefulness to health systems. Considering these findings, we propose an improved evaluation framework for measuring the benefits of clinical foundation models that is more closely grounded to metrics that matter in healthcare.

7.
Sci Rep ; 13(1): 3767, 2023 03 07.
Article in English | MEDLINE | ID: mdl-36882576

ABSTRACT

Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009-2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5-9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.


Subject(s)
Electric Power Supplies , Electronic Health Records , Humans , Hospital Mortality , Hospitalization
8.
Methods Inf Med ; 62(1-02): 60-70, 2023 05.
Article in English | MEDLINE | ID: mdl-36812932

ABSTRACT

BACKGROUND: Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust to temporal dataset shift as measured by out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance. METHODS: Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by year groups (2008-2010, 2011-2013, 2014-2016, and 2017-2019). We trained baseline models using L2-regularized logistic regression on 2008-2010 to predict in-hospital mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year groups. We evaluated three feature selection methods: L1-regularized logistic regression (L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether a feature selection method could maintain ID performance (2008-2010) and improve OOD performance (2017-2019). We also assessed whether parsimonious models retrained on OOD data performed as well as oracle models trained on all features in the OOD year group. RESULTS: The baseline model showed significantly worse OOD performance with the long LOS and sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6% of all features, whereas causal feature selection generally retained fewer features. Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline models. The retraining of these models on 2017-2019 data using features selected from training on 2008-2010 data generally reached parity with oracle models trained directly on 2017-2019 data using all available features. Causal feature selection led to heterogeneous results with the superset maintaining ID performance while improving OOD calibration only on the long LOS task. CONCLUSIONS: While model retraining can mitigate the impact of temporal dataset shift on parsimonious models produced by L1 and ROAR, new methods are required to proactively improve temporal robustness.


Subject(s)
Clinical Medicine , Sepsis , Female , Pregnancy , Humans , Hospital Mortality , Length of Stay , Machine Learning
9.
JMIR Med Inform ; 10(11): e40039, 2022 Nov 17.
Article in English | MEDLINE | ID: mdl-36394938

ABSTRACT

BACKGROUND: Given the costs of machine learning implementation, a systematic approach to prioritizing which models to implement into clinical practice may be valuable. OBJECTIVE: The primary objective was to determine the health care attributes respondents at 2 pediatric institutions rate as important when prioritizing machine learning model implementation. The secondary objective was to describe their perspectives on implementation using a qualitative approach. METHODS: In this mixed methods study, we distributed a survey to health system leaders, physicians, and data scientists at 2 pediatric institutions. We asked respondents to rank the following 5 attributes in terms of implementation usefulness: the clinical problem was common, the clinical problem caused substantial morbidity and mortality, risk stratification led to different actions that could reasonably improve patient outcomes, reducing physician workload, and saving money. Important attributes were those ranked as first or second most important. Individual qualitative interviews were conducted with a subsample of respondents. RESULTS: Among 613 eligible respondents, 275 (44.9%) responded. Qualitative interviews were conducted with 17 respondents. The most common important attributes were risk stratification leading to different actions (205/275, 74.5%) and clinical problem causing substantial morbidity or mortality (177/275, 64.4%). The attributes considered least important were reducing physician workload and saving money. Qualitative interviews consistently prioritized implementations that improved patient outcomes. CONCLUSIONS: Respondents prioritized machine learning model implementation where risk stratification would lead to different actions and clinical problems that caused substantial morbidity and mortality. Implementations that improved patient outcomes were prioritized. These results can help provide a framework for machine learning model implementation.

10.
Front Digit Health ; 4: 943768, 2022.
Article in English | MEDLINE | ID: mdl-36339512

ABSTRACT

Multiple reporting guidelines for artificial intelligence (AI) models in healthcare recommend that models be audited for reliability and fairness. However, there is a gap of operational guidance for performing reliability and fairness audits in practice. Following guideline recommendations, we conducted a reliability audit of two models based on model performance and calibration as well as a fairness audit based on summary statistics, subgroup performance and subgroup calibration. We assessed the Epic End-of-Life (EOL) Index model and an internally developed Stanford Hospital Medicine (HM) Advance Care Planning (ACP) model in 3 practice settings: Primary Care, Inpatient Oncology and Hospital Medicine, using clinicians' answers to the surprise question ("Would you be surprised if [patient X] passed away in [Y years]?") as a surrogate outcome. For performance, the models had positive predictive value (PPV) at or above 0.76 in all settings. In Hospital Medicine and Inpatient Oncology, the Stanford HM ACP model had higher sensitivity (0.69, 0.89 respectively) than the EOL model (0.20, 0.27), and better calibration (O/E 1.5, 1.7) than the EOL model (O/E 2.5, 3.0). The Epic EOL model flagged fewer patients (11%, 21% respectively) than the Stanford HM ACP model (38%, 75%). There were no differences in performance and calibration by sex. Both models had lower sensitivity in Hispanic/Latino male patients with Race listed as "Other." 10 clinicians were surveyed after a presentation summarizing the audit. 10/10 reported that summary statistics, overall performance, and subgroup performance would affect their decision to use the model to guide care; 9/10 said the same for overall and subgroup calibration. The most commonly identified barriers for routinely conducting such reliability and fairness audits were poor demographic data quality and lack of data access. This audit required 115 person-hours across 8-10 months. Our recommendations for performing reliability and fairness audits include verifying data validity, analyzing model performance on intersectional subgroups, and collecting clinician-patient linkages as necessary for label generation by clinicians. Those responsible for AI models should require such audits before model deployment and mediate between model auditors and impacted stakeholders.

11.
Appl Clin Inform ; 13(2): 431-438, 2022 03.
Article in English | MEDLINE | ID: mdl-35508197

ABSTRACT

OBJECTIVE: The purpose of this study is to evaluate the ability of three metrics to monitor for a reduction in performance of a chronic kidney disease (CKD) model deployed at a pediatric hospital. METHODS: The CKD risk model estimates a patient's risk of developing CKD 3 to 12 months following an inpatient admission. The model was developed on a retrospective dataset of 4,879 admissions from 2014 to 2018, then run silently on 1,270 admissions from April to October, 2019. Three metrics were used to monitor its performance during the silent phase: (1) standardized mean differences (SMDs); (2) performance of a "membership model"; and (3) response distribution analysis. Observed patient outcomes for the 1,270 admissions were used to calculate prospective model performance and the ability of the three metrics to detect performance changes. RESULTS: The deployed model had an area under the receiver-operator curve (AUROC) of 0.63 in the prospective evaluation, which was a significant decrease from an AUROC of 0.76 on retrospective data (p = 0.033). Among the three metrics, SMDs were significantly different for 66/75 (88%) of the model's input variables (p <0.05) between retrospective and deployment data. The membership model was able to discriminate between the two settings (AUROC = 0.71, p <0.0001) and the response distributions were significantly different (p <0.0001) for the two settings. CONCLUSION: This study suggests that the three metrics examined could provide early indication of performance deterioration in deployed models' performance.


Subject(s)
Computer Simulation , Machine Learning , Renal Insufficiency, Chronic/physiopathology , Benchmarking , Child , Female , Hospitalization , Humans , Male , Models, Biological , Prospective Studies , ROC Curve , Renal Insufficiency, Chronic/diagnosis , Retrospective Studies , Risk Factors
12.
Article in English | MEDLINE | ID: mdl-35272095

ABSTRACT

BACKGROUND: Few studies to date have characterized functional connectivity (FC) within emotion and reward networks in relation to family dynamics in youth at high familial risk for bipolar disorder (HR-BD) and major depressive disorder (HR-MDD) relative to low-risk youth (LR). Such characterization may advance our understanding of the neural underpinnings of mood disorders and lead to more effective interventions. METHODS: A total of 139 youth (43 HR-BD, 46 HR-MDD, and 50 LR) aged 12.9 ± 2.7 years were longitudinally followed for 4.5 ± 2.4 years. We characterized differences in striatolimbic FC that distinguished between HR-BD, HR-MDD, and LR and between resilience and conversion to psychopathology. We then examined whether risk status moderated FC-family dynamic associations. Finally, we examined whether baseline between-group FC differences predicted resilence versus conversion to psychopathology. RESULTS: HR-BD had greater amygdala-middle frontal gyrus and dorsal striatum-middle frontal gyrus FC relative to HR-MDD and LR, and HR-MDD had lower amygdala-fusiform gyrus and dorsal striatum-precentral gyrus FC relative to HR-BD and LR (voxel-level p < .001, cluster-level false discovery rate-corrected p < .05). Resilient youth had greater amygdala-orbitofrontal cortex and ventral striatum-dorsal anterior cingulate cortex FC relative to youth with conversion to psychopathology (voxel-level p < .001, cluster-level false discovery rate-corrected p < .05). Greater family rigidity was inversely associated with amygdala-fusiform gyrus FC across all groups (false discovery rate-corrected p = .017), with a moderating effect of bipolar risk status (HR-BD vs. HR-MDD p < .001; HR-BD vs. LR p = .005). Baseline FC differences did not predict resilence versus conversion to psychopathology. CONCLUSIONS: Findings represent neural signatures of risk and resilience in emotion and reward processing networks in youth at familial risk for mood disorders that may be targets for novel interventions tailored to the family context.


Subject(s)
Depressive Disorder, Major , Mood Disorders , Adolescent , Family Relations , Genetic Predisposition to Disease , Humans , Magnetic Resonance Imaging
13.
J Pers Med ; 12(2)2022 Feb 06.
Article in English | MEDLINE | ID: mdl-35207712

ABSTRACT

The diagnostic categories in psychiatry often encompass heterogeneous symptom profiles associated with differences in the underlying etiology, pathogenesis and prognosis. Prior work demonstrated that some of this heterogeneity can be quantified though dimensional analysis of the Depression Anxiety Stress Scale (DASS), yielding unique transdiagnostic symptom subtypes. This study investigated whether classifying patients according to these symptom profiles would have prognostic value for the treatment response to therapeutic transcranial magnetic stimulation (TMS) in comorbid major depressive disorder (MDD) and posttraumatic stress disorder (PTSD). A linear discriminant model was constructed using a simulation dataset to classify 35 participants into one of the following six pre-defined symptom profiles: Normative Mood, Tension, Anxious Arousal, Generalized Anxiety, Anhedonia and Melancholia. Clinical outcomes with TMS across MDD and PTSD were assessed. All six symptom profiles were present. After TMS, participants with anxious arousal were less likely to achieve MDD remission compared to other subtypes (FET, odds ratio 0.16, p = 0.034), exhibited poorer PTSD symptom reduction (21% vs. 46%; t (33) = 2.025, p = 0.051) and were less likely to complete TMS (FET, odds ratio 0.066, p = 0.011). These results offer preliminary evidence that classifying individuals according to these transdiagnostic symptom profiles may offer a simple method to inform TMS treatment decisions.

14.
Jt Comm J Qual Patient Saf ; 48(3): 131-138, 2022 03.
Article in English | MEDLINE | ID: mdl-34866024

ABSTRACT

BACKGROUND: Hospital-acquired pressure injuries (HAPIs) cause patient harm and increase health care costs. We sought to evaluate the performance of the Braden QD Scale-associated changes in HAPI incidence. METHODS: Using electronic health records data from a quaternary children's hospital, we evaluated the association between Braden QD scores and patient risk of HAPI. We analyzed how this relationship changed during a hospitalwide quality HAPI reduction initiative. RESULTS: Of 23,532 unique patients, 108 (0.46%, 95% confidence interval [CI] = 0.38%-0.55%) experienced a HAPI. Every 1-point increase in the Braden QD score was associated with a 41% increase in the patient's odds of developing a HAPI (odds ratio [OR] = 1.41, 95% CI = 1.36-1.46, p < 0.001). HAPI incidence declined significantly following implementation of a HAPI-reduction initiative (ß = -0.09, 95% CI = -0.11 - -0.07, p < 0.001), as did Braden QD positive predictive value (ß = -0.29, 95% CI = -0.44 - -0.14, p < 0.001) and specificity (ß = -0.28, 95% CI = -0.43 - -0.14, p < 0.001), while sensitivity (ß = 0.93, 95% CI = 0.30-1.75, p = 0.01) and the concordance statistic (ß = 0.18, 95% CI = 0.15-0.21, p < 0.001) increased significantly. CONCLUSION: Decreases in HAPI incidence following a quality improvement initiative were associated with (1) significant deterioration in threshold-dependent performance measures such as specificity and precision and (2) significant improvements in threshold-independent performance measures such as the concordance statistic. The performance of the Braden QD Scale is more stable as a tool that continuously measures risk than as a prediction tool.


Subject(s)
Pressure Ulcer , Child , Humans , Incidence , Pressure Ulcer/epidemiology , Pressure Ulcer/prevention & control , Quality Improvement , Retrospective Studies , Risk Assessment , Risk Factors
15.
Biol Psychiatry ; 91(6): 561-571, 2022 03 15.
Article in English | MEDLINE | ID: mdl-34482948

ABSTRACT

BACKGROUND: Despite tremendous advances in characterizing human neural circuits that govern emotional and cognitive functions impaired in depression and anxiety, we lack a circuit-based taxonomy for depression and anxiety that captures transdiagnostic heterogeneity and informs clinical decision making. METHODS: We developed and tested a novel system for quantifying 6 brain circuits reproducibly and at the individual patient level. We implemented standardized circuit definitions relative to a healthy reference sample and algorithms to generate circuit clinical scores for the overall circuit and its constituent regions. RESULTS: In new data from primary and generalizability samples of depression and anxiety (N = 250), we demonstrated that overall disconnections within task-free salience and default mode circuits map onto symptoms of anxious avoidance, loss of pleasure, threat dysregulation, and negative emotional biases-core characteristics that transcend diagnoses-and poorer daily function. Regional dysfunctions within task-evoked cognitive control and affective circuits may implicate symptoms of cognitive and valence-congruent emotional functions. Circuit dysfunction scores also distinguished response to antidepressant and behavioral intervention treatments in an independent sample (n = 205). CONCLUSIONS: Our findings articulate circuit dimensions that relate to transdiagnostic symptoms across mood and anxiety disorders. Our novel system offers a foundation for deploying standardized circuit assessments across research groups, trials, and clinics to advance more precise classifications and treatment targets for psychiatry.


Subject(s)
Depression , Psychiatry , Anxiety , Anxiety Disorders , Humans
16.
Npj Ment Health Res ; 1(1): 19, 2022 Dec 02.
Article in English | MEDLINE | ID: mdl-38609510

ABSTRACT

Although individual psychotherapy is generally effective for a range of mental health conditions, little is known about the moment-to-moment language use of effective therapists. Increased access to computational power, coupled with a rise in computer-mediated communication (telehealth), makes feasible the large-scale analyses of language use during psychotherapy. Transparent methodological approaches are lacking, however. Here we present novel methods to increase the efficiency of efforts to examine language use in psychotherapy. We evaluate three important aspects of therapist language use - timing, responsiveness, and consistency - across five clinically relevant language domains: pronouns, time orientation, emotional polarity, therapist tactics, and paralinguistic style. We find therapist language is dynamic within sessions, responds to patient language, and relates to patient symptom diagnosis but not symptom severity. Our results demonstrate that analyzing therapist language at scale is feasible and may help answer longstanding questions about specific behaviors of effective therapists.

17.
Appl Clin Inform ; 12(4): 808-815, 2021 08.
Article in English | MEDLINE | ID: mdl-34470057

ABSTRACT

OBJECTIVE: The change in performance of machine learning models over time as a result of temporal dataset shift is a barrier to machine learning-derived models facilitating decision-making in clinical practice. Our aim was to describe technical procedures used to preserve the performance of machine learning models in the presence of temporal dataset shifts. METHODS: Studies were included if they were fully published articles that used machine learning and implemented a procedure to mitigate the effects of temporal dataset shift in a clinical setting. We described how dataset shift was measured, the procedures used to preserve model performance, and their effects. RESULTS: Of 4,457 potentially relevant publications identified, 15 were included. The impact of temporal dataset shift was primarily quantified using changes, usually deterioration, in calibration or discrimination. Calibration deterioration was more common (n = 11) than discrimination deterioration (n = 3). Mitigation strategies were categorized as model level or feature level. Model-level approaches (n = 15) were more common than feature-level approaches (n = 2), with the most common approaches being model refitting (n = 12), probability calibration (n = 7), model updating (n = 6), and model selection (n = 6). In general, all mitigation strategies were successful at preserving calibration but not uniformly successful in preserving discrimination. CONCLUSION: There was limited research in preserving the performance of machine learning models in the presence of temporal dataset shift in clinical medicine. Future research could focus on the impact of dataset shift on clinical decision making, benchmark the mitigation strategies on a wider range of datasets and tasks, and identify optimal strategies for specific settings.


Subject(s)
Clinical Medicine , Machine Learning , Clinical Decision-Making , Cognition
18.
Nat Commun ; 12(1): 2017, 2021 04 01.
Article in English | MEDLINE | ID: mdl-33795682

ABSTRACT

In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.


Subject(s)
COVID-19 , Data Curation/methods , Expert Systems , Machine Learning , Datasets as Topic , Electronic Health Records , Humans , Natural Language Processing , SARS-CoV-2
19.
Neuropsychopharmacology ; 46(4): 809-819, 2021 03.
Article in English | MEDLINE | ID: mdl-33230268

ABSTRACT

There is a critical need to better understand the neural basis of antidepressant medication (ADM) response with respect to both symptom alleviation and quality of life (QoL) in major depressive disorder (MDD). Reward neurocircuitry has been implicated in QoL, the neural basis of MDD, and the mechanisms of ADM response. Yet, we do not know whether change in reward neurocircuitry as a function of ADM is associated with change in symptoms and QoL. To address this gap in knowledge, we analyzed data from 128 patients with MDD who participated in the iSPOT-D trial and were assessed with functional neuroimaging pre- and post-ADM treatment (randomized to sertraline, venlafaxine-XR, or escitalopram). 58 matched healthy controls were scanned at the same time points. We quantified functional connectivity (FC) of reward neurocircuitry using nucleus accumbens (NAc) seed regions of interest, and then characterized how changes in FC relate to symptom response (primary outcome) and QoL response (secondary outcome). Symptom responders showed an increase in NAc-dorsal anterior cingulate cortex (ACC) FC relative to non-responders (p < 0.001) which was associated with improvement in physical QoL (p < 0.0003), and a decrease in NAc-inferior parietal lobule FC relative to controls (p < 0.001). QoL response was characterized by increases in FC between NAc-ventral ACC for environmental, NAc-thalamus for physical, and NAc-paracingulate gyrus for social domains (p < 0.001). Symptom responders to sertraline were distinguished by a decrease in NAc-insula FC (p < 0.001) and to venlafaxine-XR by an increase in NAc-inferior temporal gyrus FC (p < 0.005). Findings suggest that change in reward neurocircuitry may underlie differential ADM response profiles with respect to symptoms and QoL in depression.


Subject(s)
Depressive Disorder, Major , Quality of Life , Antidepressive Agents/therapeutic use , Citalopram/therapeutic use , Depressive Disorder, Major/drug therapy , Humans , Magnetic Resonance Imaging , Reward
20.
ArXiv ; 2020 Aug 05.
Article in English | MEDLINE | ID: mdl-32793768

ABSTRACT

In the electronic health record, using clinical notes to identify entities such as disorders and their temporality (e.g. the order of an event relative to a time index) can inform many important analyses. However, creating training data for clinical entity tasks is time consuming and sharing labeled data is challenging due to privacy concerns. The information needs of the COVID-19 pandemic highlight the need for agile methods of training machine learning models for clinical notes. We present Trove, a framework for weakly supervised entity classification using medical ontologies and expert-generated rules. Our approach, unlike hand-labeled notes, is easy to share and modify, while offering performance comparable to learning from manually labeled training data. In this work, we validate our framework on six benchmark tasks and demonstrate Trove's ability to analyze the records of patients visiting the emergency department at Stanford Health Care for COVID-19 presenting symptoms and risk factors.

SELECTION OF CITATIONS
SEARCH DETAIL
...