Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
1.
Front Digit Health ; 5: 1193467, 2023.
Article in English | MEDLINE | ID: mdl-37588022

ABSTRACT

Introduction: The SARS-CoV-2 (COVID-19) pandemic has created substantial health and economic burdens in the US and worldwide. As new variants continuously emerge, predicting critical clinical events in the context of relevant individual risks is a promising option for reducing the overall burden of COVID-19. This study aims to train an AI-driven decision support system that helps build a model to understand the most important features that predict the "mortality" of patients hospitalized with COVID-19. Methods: We conducted a retrospective analysis of "5,371" patients hospitalized for COVID-19-related symptoms from the South Florida Memorial Health Care System between March 14th, 2020, and January 16th, 2021. A data set comprising patients' sociodemographic characteristics, pre-existing health information, and medication was analyzed. We trained Random Forest classifier to predict "mortality" for patients hospitalized with COVID-19. Results: Based on the interpretability of the model, age emerged as the primary predictor of "mortality", followed by diarrhea, diabetes, hypertension, BMI, early stages of kidney disease, smoking status, sex, pneumonia, and race in descending order of importance. Notably, individuals aged over 65 years (referred to as "older adults"), males, Whites, Hispanics, and current smokers were identified as being at higher risk of death. Additionally, BMI, specifically in the overweight and obese categories, significantly predicted "mortality". These findings indicated that the model effectively learned from various categories, such as patients' sociodemographic characteristics, pre-hospital comorbidities, and medications, with a predominant focus on characterizing pre-hospital comorbidities. Consequently, the model demonstrated the ability to predict "mortality" with transparency and reliability. Conclusion: AI can potentially provide healthcare workers with the ability to stratify patients and streamline optimal care solutions when time is of the essence and resources are limited. This work sets the platform for future work that forecasts patient responses to treatments at various levels of disease severity and assesses health disparities and patient conditions that promote improved health care in a broader context. This study contributed to one of the first predictive analyses applying AI/ML techniques to COVID-19 data using a vast sample from South Florida.

2.
SN Comput Sci ; 4(4): 389, 2023.
Article in English | MEDLINE | ID: mdl-37200563

ABSTRACT

Automated methods for detecting fraudulent healthcare providers have the potential to save billions of dollars in healthcare costs and improve the overall quality of patient care. This study presents a data-centric approach to improve healthcare fraud classification performance and reliability using Medicare claims data. Publicly available data from the Centers for Medicare & Medicaid Services (CMS) are used to construct nine large-scale labeled data sets for supervised learning. First, we leverage CMS data to curate the 2013-2019 Part B, Part D, and Durable Medical Equipment, Prosthetics, Orthotics, and Supplies (DMEPOS) Medicare fraud classification data sets. We provide a review of each data set and data preparation techniques to create Medicare data sets for supervised learning and we propose an improved data labeling process. Next, we enrich the original Medicare fraud data sets with up to 58 new provider summary features. Finally, we address a common model evaluation pitfall and propose an adjusted cross-validation technique that mitigates target leakage to provide reliable evaluation results. Each data set is evaluated on the Medicare fraud classification task using extreme gradient boosting and random forest learners, multiple complementary performance metrics, and 95% confidence intervals. Results show that the new enriched data sets consistently outperform the original Medicare data sets that are currently used in related works. Our results encourage the data-centric machine learning workflow and provide a strong foundation for data understanding and preparation techniques for machine learning applications in healthcare fraud.

3.
J Big Data ; 8(1): 101, 2021.
Article in English | MEDLINE | ID: mdl-34306963

ABSTRACT

Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

4.
J Big Data ; 8(1): 18, 2021.
Article in English | MEDLINE | ID: mdl-33457181

ABSTRACT

This survey explores how Deep Learning has battled the COVID-19 pandemic and provides directions for future research on COVID-19. We cover Deep Learning applications in Natural Language Processing, Computer Vision, Life Sciences, and Epidemiology. We describe how each of these applications vary with the availability of big data and how learning tasks are constructed. We begin by evaluating the current state of Deep Learning and conclude with key limitations of Deep Learning for COVID-19 applications. These limitations include Interpretability, Generalization Metrics, Learning from Limited Labeled Data, and Data Privacy. Natural Language Processing applications include mining COVID-19 research for Information Retrieval and Question Answering, as well as Misinformation Detection, and Public Sentiment Analysis. Computer Vision applications cover Medical Image Analysis, Ambient Intelligence, and Vision-based Robotics. Within Life Sciences, our survey looks at how Deep Learning can be applied to Precision Diagnostics, Protein Structure Prediction, and Drug Repurposing. Deep Learning has additionally been utilized in Spread Forecasting for Epidemiology. Our literature review has found many examples of Deep Learning systems to fight COVID-19. We hope that this survey will help accelerate the use of Deep Learning for COVID-19 research.

5.
J Big Data ; 7(1): 94, 2020.
Article in English | MEDLINE | ID: mdl-33169094

ABSTRACT

Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

6.
J Alzheimers Dis ; 77(4): 1545-1558, 2020.
Article in English | MEDLINE | ID: mdl-32894241

ABSTRACT

BACKGROUND: The widespread incidence and prevalence of Alzheimer's disease and mild cognitive impairment (MCI) has prompted an urgent call for research to validate early detection cognitive screening and assessment. OBJECTIVE: Our primary research aim was to determine if selected MemTrax performance metrics and relevant demographics and health profile characteristics can be effectively utilized in predictive models developed with machine learning to classify cognitive health (normal versus MCI), as would be indicated by the Montreal Cognitive Assessment (MoCA). METHODS: We conducted a cross-sectional study on 259 neurology, memory clinic, and internal medicine adult patients recruited from two hospitals in China. Each patient was given the Chinese-language MoCA and self-administered the continuous recognition MemTrax online episodic memory test on the same day. Predictive classification models were built using machine learning with 10-fold cross validation, and model performance was measured using Area Under the Receiver Operating Characteristic Curve (AUC). Models were built using two MemTrax performance metrics (percent correct, response time), along with the eight common demographic and personal history features. RESULTS: Comparing the learners across selected combinations of MoCA scores and thresholds, Naïve Bayes was generally the top-performing learner with an overall classification performance of 0.9093. Further, among the top three learners, MemTrax-based classification performance overall was superior using just the top-ranked four features (0.9119) compared to using all 10 common features (0.8999). CONCLUSION: MemTrax performance can be effectively utilized in a machine learning classification predictive model screening application for detecting early stage cognitive impairment.


Subject(s)
Cognitive Dysfunction/classification , Cognitive Dysfunction/psychology , Machine Learning/classification , Mental Status and Dementia Tests , Models, Psychological , Aged , Cognitive Dysfunction/diagnosis , Cross-Sectional Studies , Female , Humans , Machine Learning/standards , Male , Mental Status and Dementia Tests/standards , Middle Aged , Neuropsychological Tests/standards
7.
Health Care Manag Sci ; 23(1): 2-19, 2020 Mar.
Article in English | MEDLINE | ID: mdl-30368641

ABSTRACT

Quality and affordable healthcare is an important aspect in people's lives, particularly as they age. The rising elderly population in the United States (U.S.), with increasing number of chronic diseases, implies continuing healthcare later in life and the need for programs, such as U.S. Medicare, to help with associated medical expenses. Unfortunately, due to healthcare fraud, these programs are being adversely affected draining resources and reducing quality and accessibility of necessary healthcare services. The detection of fraud is critical in being able to identify and, subsequently, stop these perpetrators. The application of machine learning methods and data mining strategies can be leveraged to improve current fraud detection processes and reduce the resources needed to find and investigate possible fraudulent activities. In this paper, we employ an approach to predict a physician's expected specialty based on the type and number of procedures performed. From this approach, we generate a baseline model, comparing Logistic Regression and Multinomial Naive Bayes, in order to test and assess several new approaches to improve the detection of U.S. Medicare Part B provider fraud. Our results indicate that our proposed improvement strategies (specialty grouping, class removal, and class isolation), applied to different medical specialties, have mixed results over the selected Logistic Regression baseline model's fraud detection performance. Through our work, we demonstrate that improvements to current detection methods can be effective in identifying potential fraud.


Subject(s)
Fraud , Insurance Claim Review , Medicare/legislation & jurisprudence , Bayes Theorem , Data Mining/methods , Humans , Logistic Models , Machine Learning , Physicians/classification , United States
8.
J Alzheimers Dis ; 70(1): 277-286, 2019.
Article in English | MEDLINE | ID: mdl-31177223

ABSTRACT

BACKGROUND: Memory dysfunction is characteristic of aging and often attributed to Alzheimer's disease (AD). An easily administered tool for preliminary assessment of memory function and early AD detection would be integral in improving patient management. OBJECTIVE: Our primary aim was to utilize machine learning in determining initial viable models to serve as complementary instruments in demonstrating efficacy of the MemTrax online Continuous Recognition Tasks (M-CRT) test for episodic-memory screening and assessing cognitive impairment. METHODS: We used an existing dataset subset (n = 18,395) of demographic information, general health screening questions (addressing memory, sleep quality, medications, and medical conditions affecting thinking), and test results from a convenience sample of adults who took the M-CRT test. M-CRT performance and participant features were used as independent attributes: true positive/negative, percent responses/correct, response time, age, sex, and recent alcohol consumption. For predictive modeling, we used demographic information and test scores to predict binary classification of the health-related questions (yes/no) and general health status (healthy/unhealthy), based on the screening questions. RESULTS: ANOVA revealed significant differences among HealthQScore groups for response time true positive (p = 0.000) and true positive (p = 0.020), but none for true negative (p = 0.0551). Both % responses and % correct had significant differences (p = 0.026 and p = 0.037, respectively). Logistic regression was generally the top-performing learner with moderately robust prediction performance (AUC) for HealthQScore (0.648-0.680) and selected general health questions (0.713-0.769). CONCLUSION: Our novel application of supervised machine learning and predictive modeling helps to demonstrate and validate cross-sectional utility of MemTrax in assessing early-stage cognitive impairment and general screening for AD.


Subject(s)
Aging/psychology , Alzheimer Disease/diagnosis , Cognition/physiology , Cognitive Dysfunction/diagnosis , Dementia/diagnosis , Machine Learning , Memory, Episodic , Adult , Aged , Aged, 80 and over , Alzheimer Disease/psychology , Cognitive Dysfunction/psychology , Databases, Factual , Dementia/psychology , Female , Health Status , Humans , Male , Mass Screening , Middle Aged , Models, Psychological , Neuropsychological Tests
9.
Comput Biol Med ; 110: 29-39, 2019 07.
Article in English | MEDLINE | ID: mdl-31112896

ABSTRACT

BACKGROUND: Building cancer risk models from real-world data requires overcoming challenges in data preprocessing, efficient representation, and computational performance. We present a case study of a cloud-based approach to learning from de-identified electronic health record data and demonstrate its effectiveness for melanoma risk prediction. METHODS: We used a hybrid distributed and non-distributed approach to computing in the cloud: distributed processing with Apache Spark for data preprocessing and labeling, and non-distributed processing for machine learning model training with scikit-learn. Moreover, we explored the effects of sampling the training dataset to improve computational performance. Risk factors were evaluated using regression weights as well as tree SHAP values. RESULTS: Among 4,061,172 patients who did not have melanoma through the 2016 calendar year, 10,129 were diagnosed with melanoma within one year. A gradient-boosted classifier achieved the best predictive performance with cross-validation (AUC = 0.799, Sensitivity = 0.753, Specificity = 0.688). Compared to a model built on the original data, a dataset two orders of magnitude smaller could achieve statistically similar or better performance with less than 1% of the training time and cost. CONCLUSIONS: We produced a model that can effectively predict melanoma risk for a diverse dermatology population in the U.S. by using hybrid computing infrastructure and data sampling. For this de-identified clinical dataset, sampling approaches significantly shortened the time for model building while retaining predictive accuracy, allowing for more rapid machine learning model experimentation on familiar computing machinery. A large number of risk factors (>300) were required to produce the best model.


Subject(s)
Big Data , Electronic Health Records , Machine Learning , Melanoma , Models, Biological , Humans , Melanoma/epidemiology , Melanoma/metabolism , Melanoma/pathology , Predictive Value of Tests , Risk Assessment , Risk Factors
10.
Med Sci Sports Exerc ; 51(7): 1362-1371, 2019 07.
Article in English | MEDLINE | ID: mdl-30694980

ABSTRACT

INTRODUCTION: Concussion prevalence in sport is well recognized, so too is the challenge of clinical and return-to-play management for an injury with an inherent indeterminant time course of resolve. A clear, valid insight into the anticipated resolution time could assist in planning treatment intervention. PURPOSE: This study implemented a supervised machine learning-based approach in modeling estimated symptom resolve time in high school athletes who incurred a concussion during sport activity. METHODS: We examined the efficacy of 10 classification algorithms using machine learning for the prediction of symptom resolution time (within 7, 14, or 28 d), with a data set representing 3 yr of concussions suffered by high school student-athletes in football (most concussion incidents) and other contact sports. RESULTS: The most prevalent sport-related concussion reported symptom was headache (94.9%), followed by dizziness (74.3%) and difficulty concentrating (61.1%). For all three category thresholds of predicted symptom resolution time, single-factor ANOVA revealed statistically significant performance differences across the 10 classification models for all learners at a 95% confidence interval (P = 0.000). Naïve Bayes and Random Forest with either 100 or 500 trees were the top-performing learners with an area under the receiver operating characteristic curve performance ranging between 0.656 and 0.742 (0.0-1.0 scale). CONCLUSIONS: Considering the limitations of these data specific to symptom presentation and resolve, supervised machine learning demonstrated efficacy, while warranting further exploration, in developing symptom-based prediction models for practical estimation of sport-related concussion recovery in enhancing clinical decision support.


Subject(s)
Athletic Injuries/physiopathology , Brain Concussion/physiopathology , Machine Learning , Adolescent , Athletic Injuries/diagnosis , Attention/physiology , Brain Concussion/diagnosis , Clinical Decision-Making , Dizziness/etiology , Football/injuries , Headache/etiology , Humans , Return to Sport , Time Factors
11.
Health Inf Sci Syst ; 6(1): 9, 2018 Dec.
Article in English | MEDLINE | ID: mdl-30186595

ABSTRACT

Healthcare in the United States is a critical aspect of most people's lives, particularly for the aging demographic. This rising elderly population continues to demand more cost-effective healthcare programs. Medicare is a vital program serving the needs of the elderly in the United States. The growing number of Medicare beneficiaries, along with the enormous volume of money in the healthcare industry, increases the appeal for, and risk of, fraud. In this paper, we focus on the detection of Medicare Part B provider fraud which involves fraudulent activities, such as patient abuse or neglect and billing for services not rendered, perpetrated by providers and other entities who have been excluded from participating in Federal healthcare programs. We discuss Part B data processing and describe a unique process for mapping fraud labels with known fraudulent providers. The labeled big dataset is highly imbalanced with a very limited number of fraud instances. In order to combat this class imbalance, we generate seven class distributions and assess the behavior and fraud detection performance of six different machine learning methods. Our results show that RF100 using a 90:10 class distribution is the best learner with a 0.87302 AUC. Moreover, learner behavior with the 50:50 balanced class distribution is similar to more imbalanced distributions which keep more of the original data. Based on the performance and significance testing results, we posit that retaining more of the majority class information leads to better Medicare Part B fraud detection performance over the balanced datasets across the majority of learners.

12.
Artif Intell Med ; 90: 1-14, 2018 08.
Article in English | MEDLINE | ID: mdl-30017512

ABSTRACT

Advancements are constantly being made in oncology, improving prevention and treatment of cancers. To help reduce the impact and deadliness of cancers, they must be detected early. Additionally, there is a risk of cancers recurring after potentially curative treatments are performed. Predictive models can be built using historical patient data to model the characteristics of patients that developed cancer or relapsed. These models can then be deployed into clinical settings to determine if new patients are at high risk for cancer development or recurrence. For large-scale predictive models to be built, structured data must be captured for a wide range of diverse patients. This paper explores current methods for building cancer risk models using structured clinical patient data. Trends in statistical and machine learning techniques are explored, and gaps are identified for future research. The field of cancer risk prediction is a high-impact one, and research must continue for these models to be embraced for clinical decision support of both practitioners and patients.


Subject(s)
Data Mining/methods , Decision Support Techniques , Diagnosis, Computer-Assisted/methods , Early Detection of Cancer/methods , Electronic Health Records , Machine Learning , Neoplasms/diagnosis , Clinical Decision-Making , Data Interpretation, Statistical , Data Mining/statistics & numerical data , Decision Trees , Early Detection of Cancer/statistics & numerical data , Electronic Health Records/statistics & numerical data , Humans , Neoplasm Staging , Neoplasms/epidemiology , Neoplasms/therapy , Nomograms , Recurrence , Risk Assessment , Risk Factors
13.
IEEE Trans Neural Netw ; 21(5): 813-30, 2010 May.
Article in English | MEDLINE | ID: mdl-20236881

ABSTRACT

Neural network algorithms such as multilayer perceptrons (MLPs) and radial basis function networks (RBFNets) have been used to construct learners which exhibit strong predictive performance. Two data related issues that can have a detrimental impact on supervised learning initiatives are class imbalance and labeling errors (or class noise). Imbalanced data can make it more difficult for the neural network learning algorithms to distinguish between examples of the various classes, and class noise can lead to the formulation of incorrect hypotheses. Both class imbalance and labeling errors are pervasive problems encountered in a wide variety of application domains. Many studies have been performed to investigate these problems in isolation, but few have focused on their combined effects. This study presents a comprehensive empirical investigation using neural network algorithms to learn from imbalanced data with labeling errors. In particular, the first component of our study investigates the impact of class noise and class imbalance on two common neural network learning algorithms, while the second component considers the ability of data sampling (which is commonly used to address the issue of class imbalance) to improve their performances. Our results, for which over two million models were trained and evaluated, show that conclusions drawn using the more commonly studied C4.5 classifier may not apply when using neural networks.


Subject(s)
Algorithms , Artificial Intelligence , Learning/physiology , Neural Networks, Computer , Computer Simulation , Databases, Factual/statistics & numerical data , Humans , ROC Curve
14.
IEEE Trans Image Process ; 15(9): 2755-61, 2006 Sep.
Article in English | MEDLINE | ID: mdl-16948319

ABSTRACT

We present an unsupervised multiscale color image segmentation algorithm. The basic idea is to apply mean shift clustering to obtain an over-segmentation and then merge regions at multiple scales to minimize the minimum description length criterion. The performance on the Berkeley segmentation benchmark campares favorably with some existing approaches.


Subject(s)
Algorithms , Artificial Intelligence , Colorimetry/methods , Image Enhancement/methods , Image Interpretation, Computer-Assisted/methods , Information Storage and Retrieval/methods , Pattern Recognition, Automated/methods , Cluster Analysis , Color , Computer Simulation , Models, Statistical
SELECTION OF CITATIONS
SEARCH DETAIL
...