Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 9 de 9
Filter
1.
J Clin Med ; 12(9)2023 May 05.
Article in English | MEDLINE | ID: mdl-37176726

ABSTRACT

This study aimed to develop and temporally validate an electronic medical record (EMR)-based insomnia prediction model. In this nested case-control study, we analyzed EMR data from 2011-2018 obtained from a statewide health information exchange. The study sample included 19,843 insomnia cases and 19,843 controls matched by age, sex, and race. Models using different ML techniques were trained to predict insomnia using demographics, diagnosis, and medication order data from two surveillance periods: -1 to -365 days and -180 to -365 days before the first documentation of insomnia. Separate models were also trained with patient data from three time periods (2011-2013, 2011-2015, and 2011-2017). After selecting the best model, predictive performance was evaluated on holdout patients as well as patients from subsequent years to assess the temporal validity of the models. An extreme gradient boosting (XGBoost) model outperformed all other classifiers. XGboost models trained on 2011-2017 data from -1 to -365 and -180 to -365 days before index had AUCs of 0.80 (SD 0.005) and 0.70 (SD 0.006), respectively, on the holdout set. On patients with data from subsequent years, a drop of at most 4% in AUC is observed for all models, even when there is a five-year difference between the collection period of the training and the temporal validation data. The proposed EMR-based prediction models can be used to identify insomnia up to six months before clinical detection. These models may provide an inexpensive, scalable, and longitudinally viable method to screen for individuals at high risk of insomnia.

2.
Heliyon ; 9(3): e14636, 2023 Mar.
Article in English | MEDLINE | ID: mdl-37020943

ABSTRACT

Background and objectives: Medical notes are narratives that describe the health of the patient in free text format. These notes can be more informative than structured data such as the history of medications or disease conditions. They are routinely collected and can be used to evaluate the patient's risk for developing chronic diseases such as dementia. This study investigates different methodologies for transforming routine care notes into dementia risk classifiers and evaluates the generalizability of these classifiers to new patients and new health care institutions. Methods: The notes collected over the relevant history of the patient are lengthy. In this study, TF-ICF is used to select keywords with the highest discriminative ability between at risk dementia patients and healthy controls. The medical notes are then summarized in the form of occurrences of the selected keywords. Two different encodings of the summary are compared. The first encoding consists of the average of the vector embedding of each keyword occurrence as produced by the BERT or Clinical BERT pre-trained language models. The second encoding aggregates the keywords according to UMLS concepts and uses each concept as an exposure variable. For both encodings, misspellings of the selected keywords are also considered in an effort to improve the predictive performance of the classifiers. A neural network is developed over the first encoding and a gradient boosted trees model is applied to the second encoding. Patients from a single health care institution are used to develop all the classifiers which are then evaluated on held-out patients from the same health care institution as well as test patients from two other health care institutions. Results: The results indicate that it is possible to identify patients at risk for dementia one year ahead of the onset of the disease using medical notes with an AUC of 75% when a gradient boosted trees model is used in conjunction with exposure variables derived from UMLS concepts. However, this performance is not maintained with an embedded feature space and when the classifier is applied to patients from other health care institutions. Moreover, an analysis of the top predictors of the gradient boosted trees model indicates that different features inform the classification depending on whether or not spelling variants of the keywords are included. Conclusion: The present study demonstrates that medical notes can enable risk prediction models for complex chronic diseases such as dementia. However, additional research efforts are needed to improve the generalizability of these models. These efforts should take into consideration the length and localization of the medical notes; the availability of sufficient training data for each disease condition; and the variabilities resulting from different feature engineering techniques.

3.
Sleep Health ; 9(2): 128-135, 2023 04.
Article in English | MEDLINE | ID: mdl-36858835

ABSTRACT

OBJECTIVE: Examine the association between race and time to pharmacologic treatment of insomnia in a large multi-institutional cohort. METHODS: Retrospective analysis of electronic medical records from a regional health information exchange. Eligible patients included adults with at least one healthcare visit per year from 2010 to 2019, a new insomnia diagnosis code during the study period, and no prior insomnia diagnosis codes or medications. A Cox frailty model was used to examine the association between race and time to an insomnia medication after diagnosis. RESULTS: In total, 9557 patients were analyzed, 7773 (81.3%) of whom where White, 1294 (13.5%) Black, 238 (2.5%) Other, and 252 (2.6%) unknown race. About 6.2% of Black and 8% of Other race patients received an order for a Food and Drug Administration-approved insomnia medication after diagnosis compared with 13.5% of White patients. Black patients were significantly less likely to have an order for a Food and Drug Administration-approved insomnia medication at all time points (adjusted hazard ratio [aHR] range: 0.37-0.73), and patients reporting Other race were less likely to have received an order at 2 (aHR 0.51, 95% confidence interval [CI] 0.28-0.94), 3 (aHR 0.33, 95% CI 0.13-0.79), and 4 years (aHR 0.21, 95% CI 0.06-0.71) of follow-up. Similar results were observed in a sensitivity analysis including off-label medications. CONCLUSIONS: Patients belonging to racial minority groups are less likely to be prescribed an insomnia medication than White patients after accounting for sociodemographic and clinical factors. Further research is needed to determine the extent to which patient preferences and physician perceptions affect these prescribing patterns and investigate potential disparities in nonpharmacologic treatment.


Subject(s)
Healthcare Disparities , Hypnotics and Sedatives , Practice Patterns, Physicians' , Racial Groups , Sleep Initiation and Maintenance Disorders , Time-to-Treatment , Adult , Humans , Black People/statistics & numerical data , Minority Groups/statistics & numerical data , Racial Groups/statistics & numerical data , Retrospective Studies , Sleep Initiation and Maintenance Disorders/drug therapy , Sleep Initiation and Maintenance Disorders/epidemiology , Healthcare Disparities/ethnology , Healthcare Disparities/statistics & numerical data , Hypnotics and Sedatives/administration & dosage , Hypnotics and Sedatives/therapeutic use , Practice Patterns, Physicians'/statistics & numerical data , Time-to-Treatment/statistics & numerical data , White/statistics & numerical data , United States/epidemiology
4.
Sci Rep ; 13(1): 2185, 2023 02 07.
Article in English | MEDLINE | ID: mdl-36750631

ABSTRACT

Machine learning models can help improve health care services. However, they need to be practical to gain wide-adoption. In this study, we investigate the practical utility of different data modalities and cohort segmentation strategies when designing models for emergency department (ED) and inpatient hospital (IH) visits. The data modalities include socio-demographics, diagnosis and medications. Segmentation compares a cohort of insomnia patients to a cohort of general non-insomnia patients under varying age and disease severity criteria. Transfer testing between the two cohorts is introduced to demonstrate that an insomnia-specific model is not necessary when predicting future ED visits, but may have merit when predicting IH visits especially for patients with an insomnia diagnosis. The results also indicate that using both diagnosis and medications as a source of data does not generally improve model performance and may increase its overhead. Based on these findings, the proposed evaluation methodologies are recommended to ascertain the utility of disease-specific models in addition to the traditional intra-cohort testing.


Subject(s)
Emergency Service, Hospital , Machine Learning , Humans , Critical Care , Retrospective Studies
5.
Data Brief ; 43: 108442, 2022 Aug.
Article in English | MEDLINE | ID: mdl-35859786

ABSTRACT

Topic modeling is an active research area with several unanswered questions. The focus of recent research in this area is on the use of a vector embedding representation of the input text with both generative and evolutionary topic modeling techniques. Unfortunately, it is hard to compare different techniques when the underlying data and preprocessing steps that were used to develop the models are not available. This paper presents two secondary datasets that can help address this gap. These datasets are derived from two primary datasets. The first consists of 8145 posts from the r/Cancer health forum and the second consists of 18,294 messages submitted to 20 different news groups. The same preprocessing procedure is applied to both datasets by removing punctuation, stop words and high frequency words. Each dataset is then clustered using three different topic modeling techniques: pPSO, ETM and NVDM and three topic numbers: 10, 20, 30. In addition, for pPSO two text embeddings representation are considered: sBERT and Skipgram. The secondary datasets were originally developed in support of a comparative analysis of the aforementioned topic modeling techniques in a study titled "Comparing PSO-based Clustering over Contextual Vector Embeddings to Modern Topic Modeling" submitted to the Journal of Information Processing and Management. The present paper provides a detailed description of the two secondary datasets including the unique identifier that can be used to retrieve the original documents, the pre-processing scripts, the topic keywords generated by the three topic modeling techniques with varying topic numbers and embedding representations. As such, the datasets allow direct comparison with other topic modeling techniques. To further facilitate this process, the algorithm underlying the evolutionary topic modeling technique, pPSO, proposed by the authors is also provided.

6.
J Biomed Inform ; 125: 103976, 2022 01.
Article in English | MEDLINE | ID: mdl-34906737

ABSTRACT

Broader patient-reported experiences in oncology are largely unknown due to the lack of available information from traditional data sources. Online health community data provide an exploratory way to uncover these experiences at a large scale. Analyzing these data can guide further studies towards understanding patients' needs and experiences. However, analysis of online health data is inherently difficult due to the unstructured nature of these data and the variety of ways information can be expressed over text. Specifically, subscribers may not disclose critical information such as the age of the patient in their posts. In fact, the number of health forum posts that explicitly mention the age of the patient is significantly lower than the number of posts that do not include this information in the Reddit r/Cancer health forum under consideration in the present paper. Health-focused studies often need to consider or control for age as a confounder, hence the importance of having sufficient age data. This paper presents a methodology that can help classify health forum posts according to four age groups (0-17, 18-39, 40-64 and 65 + years) even when the posts do not contain explicit mention of the age of the patient. First, the subset of the posts that include explicit mention of the age of the patient is identified. Second, the explicit age clues are removed from these posts and used to train the proposed age classifier. The resulting classifier is able to infer the age of the patient using only implicit age clues with an average true positive rate (TPR) of 71%. This TPR is comparable to the average TPR of 69% obtained from human annotations for the same set of posts.


Subject(s)
Health Records, Personal , Age Factors , Humans
7.
Artif Intell Med ; 102: 101771, 2020 01.
Article in English | MEDLINE | ID: mdl-31980108

ABSTRACT

Our aim is to develop a machine learning (ML) model that can predict dementia in a general patient population from multiple health care institutions one year and three years prior to the onset of the disease without any additional monitoring or screening. The purpose of the model is to automate the cost-effective, non-invasive, digital pre-screening of patients at risk for dementia. Towards this purpose, routine care data, which is widely available through Electronic Medical Record (EMR) systems is used as a data source. These data embody a rich knowledge and make related medical applications easy to deploy at scale in a cost-effective manner. Specifically, the model is trained by using structured and unstructured data from three EMR data sets: diagnosis, prescriptions, and medical notes. Each of these three data sets is used to construct an individual model along with a combined model which is derived by using all three data sets. Human-interpretable data processing and ML techniques are selected in order to facilitate adoption of the proposed model by health care providers from multiple institutions. The results show that the combined model is generalizable across multiple institutions and is able to predict dementia within one year of its onset with an accuracy of nearly 80% despite the fact that it was trained using routine care data. Moreover, the analysis of the models identified important predictors for dementia. Some of these predictors (e.g., age and hypertensive disorders) are already confirmed by the literature while others, especially the ones derived from the unstructured medical notes, require further clinical analysis.


Subject(s)
Dementia/diagnosis , Electronic Health Records , Age Factors , Aged , Aged, 80 and over , Cost-Benefit Analysis , Drug Prescriptions/statistics & numerical data , Electronic Health Records/economics , Humans , Hypertension/complications , Machine Learning , Mass Screening , Middle Aged , Models, Theoretical , Neuropsychological Tests , Predictive Value of Tests , Reproducibility of Results , Risk Factors
8.
JMIR Med Inform ; 7(2): e12561, 2019 Apr 04.
Article in English | MEDLINE | ID: mdl-30946020

ABSTRACT

BACKGROUND: Medication nonadherence can compound into severe medical problems for patients. Identifying patients who are likely to become nonadherent may help reduce these problems. Data-driven machine learning models can predict medication adherence by using selected indicators from patients' past health records. Sources of data for these models traditionally fall under two main categories: (1) proprietary data from insurance claims, pharmacy prescriptions, or electronic medical records and (2) survey data collected from representative groups of patients. Models developed using these data sources often are limited because they are proprietary, subject to high cost, have limited scalability, or lack timely accessibility. These limitations suggest that social health forums might be an alternate source of data for adherence prediction. Indeed, these data are accessible, affordable, timely, and available at scale. However, they can be inaccurate. OBJECTIVE: This paper proposes a medication adherence machine learning model for fibromyalgia therapies that can mitigate the inaccuracy of social health forum data. METHODS: Transfer learning is a machine learning technique that allows knowledge acquired from one dataset to be transferred to another dataset. In this study, predictive adherence models for the target disease were first developed by using accurate but limited survey data. These models were then used to predict medication adherence from health social forum data. Random forest, an ensemble machine learning technique, was used to develop the predictive models. This transfer learning methodology is demonstrated in this study by examining data from the Medical Expenditure Panel Survey and the PatientsLikeMe social health forum. RESULTS: When the models are carefully designed, less than a 5% difference in accuracy is observed between the Medical Expenditure Panel Survey and the PatientsLikeMe medication adherence predictions for fibromyalgia treatments. This design must take into consideration the mapping between the predictors and the outcomes in the two datasets. CONCLUSIONS: This study exemplifies the potential and limitations of transfer learning in medication adherence-predictive models based on survey data and social health forum data. The proposed approach can make timely medication adherence monitoring cost-effective and widely accessible. Additional investigation is needed to improve the robustness of the approach and extend its applicability to other therapies and other sources of data.

9.
J Chem Inf Comput Sci ; 43(1): 25-35, 2003.
Article in English | MEDLINE | ID: mdl-12546534

ABSTRACT

The recent advances in laboratory technologies have resulted in a wealth of chemical and biological data. The rapid proliferation of a vast amount of data has led to a set of cheminformatics and bioinformatics applications that manipulate dynamic, heterogeneous, and massive data. An example of such application in the pharmaceutical industry is the computational process involved in the early discovery of lead drug candidates for a given target disease. In this paper, an efficient implementation of a drug candidate database is presented and evaluated. This study shows that high performance data access can be achieved through proper choices of data representation, database schema design, and parallel processing techniques.


Subject(s)
Databases, Factual , Drug Design , Computational Biology , Drug Industry , Humans
SELECTION OF CITATIONS
SEARCH DETAIL
...