Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Entropy (Basel) ; 26(9)2024 Sep 12.
Artículo en Inglés | MEDLINE | ID: mdl-39330116

RESUMEN

Although deep learning (DL) algorithms have been proved to be effective in diverse research domains, their application in developing models for tabular data remains limited. Models trained on tabular data demonstrate higher efficacy using traditional machine learning models than DL models, which are largely attributed to the size and structure of tabular datasets and the specific application contexts in which they are utilized. Thus, the primary objective of this paper is to propose a method to use the supremacy of Stacked Bidirectional LSTM (Long Short-Term Memory) deep learning algorithms in pattern discovery incorporating tabular data with customized 3D tensor modeling in feeding neural networks. Our findings are empirically validated using six diverse, publicly available datasets each varying in size and learning objectives. This paper proves that the proposed model based on time-sequence DL algorithms, which were generally described as inadequate when dealing with tabular data, yields satisfactory results and competes effectively with other algorithms specifically designed for tabular data. An additional benefit of this approach is its ability to preserve simplicity while ensuring fast model training also with large datasets. Even with extremely small datasets, models can be applied to achieve exceptional predictive results and fully utilize their capacity.

2.
Comput Struct Biotechnol J ; 23: 2892-2910, 2024 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-39108677

RESUMEN

Synthetic data generation has emerged as a promising solution to overcome the challenges which are posed by data scarcity and privacy concerns, as well as, to address the need for training artificial intelligence (AI) algorithms on unbiased data with sufficient sample size and statistical power. Our review explores the application and efficacy of synthetic data methods in healthcare considering the diversity of medical data. To this end, we systematically searched the PubMed and Scopus databases with a great focus on tabular, imaging, radiomics, time-series, and omics data. Studies involving multi-modal synthetic data generation were also explored. The type of method used for the synthetic data generation process was identified in each study and was categorized into statistical, probabilistic, machine learning, and deep learning. Emphasis was given to the programming languages used for the implementation of each method. Our evaluation revealed that the majority of the studies utilize synthetic data generators to: (i) reduce the cost and time required for clinical trials for rare diseases and conditions, (ii) enhance the predictive power of AI models in personalized medicine, (iii) ensure the delivery of fair treatment recommendations across diverse patient populations, and (iv) enable researchers to access high-quality, representative multimodal datasets without exposing sensitive patient information, among others. We underline the wide use of deep learning based synthetic data generators in 72.6 % of the included studies, with 75.3 % of the generators being implemented in Python. A thorough documentation of open-source repositories is finally provided to accelerate research in the field.

3.
Stud Health Technol Inform ; 316: 621-625, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176818

RESUMEN

The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.


Asunto(s)
Sistema de Registros , Humanos , Confidencialidad , Aprendizaje Automático , Seguridad Computacional , Neoplasias , Neoplasias de la Mama , Femenino , Privacidad
4.
Stud Health Technol Inform ; 316: 963-967, 2024 Aug 22.
Artículo en Inglés | MEDLINE | ID: mdl-39176952

RESUMEN

Synthetic tabular health data plays a crucial role in healthcare research, addressing privacy regulations and the scarcity of publicly available datasets. This is essential for diagnostic and treatment advancements. Among the most promising models are transformer-based Large Language Models (LLMs) and Generative Adversarial Networks (GANs). In this paper, we compare LLM models of the Pythia LLM Scaling Suite with varying model sizes ranging from 14M to 1B, against a reference GAN model (CTGAN). The generated synthetic data are used to train random forest estimators for classification tasks to make predictions on the real-world data. Our findings indicate that as the number of parameters increases, LLM models outperform the reference GAN model. Even the smallest 14M parameter models perform comparably to GANs. Moreover, we observe a positive correlation between the size of the training dataset and model performance. We discuss implications, challenges, and considerations for the real-world usage of LLM models for synthetic tabular data generation.


Asunto(s)
Benchmarking , Simulación por Computador
5.
Entropy (Basel) ; 26(8)2024 Aug 04.
Artículo en Inglés | MEDLINE | ID: mdl-39202134

RESUMEN

To optimize the utilization and analysis of tables, it is essential to recognize and understand their semantics comprehensively. This requirement is especially critical given that many tables lack explicit annotations, necessitating the identification of column types and inter-column relationships. Such identification can significantly augment data quality, streamline data integration, and support data analysis and mining. Current table annotation models often address each subtask independently, which may result in the neglect of constraints and contextual information, causing relational ambiguities and inference errors. To address this issue, we propose a unified multi-task learning framework capable of concurrently handling multiple tasks within a single model, including column named entity recognition, column type identification, and inter-column relationship detection. By integrating these tasks, the framework exploits their interrelations, facilitating the exchange of shallow features and the sharing of representations. Their cooperation enables each task to leverage insights from the others, thereby improving the performance of individual subtasks and enhancing the model's overall generalization capabilities. Notably, our model is designed to employ only the internal information of tabular data, avoiding reliance on external context or knowledge graphs. This design ensures robust performance even with limited input information. Extensive experiments demonstrate the superior performance of our model across various tasks, validating the effectiveness of unified multi-task learning framework in the recognition and comprehension of table semantics.

6.
Entropy (Basel) ; 26(7)2024 Jul 11.
Artículo en Inglés | MEDLINE | ID: mdl-39056955

RESUMEN

We introduce NodeFlow, a flexible framework for probabilistic regression on tabular data that combines Neural Oblivious Decision Ensembles (NODEs) and Conditional Continuous Normalizing Flows (CNFs). It offers improved modeling capabilities for arbitrary probabilistic distributions, addressing the limitations of traditional parametric approaches. In NodeFlow, the NODE captures complex relationships in tabular data through a tree-like structure, while the conditional CNF utilizes the NODE's output space as a conditioning factor. The training process of NodeFlow employs standard gradient-based learning, facilitating the end-to-end optimization of the NODEs and CNF-based density estimation. This approach ensures outstanding performance, ease of implementation, and scalability, making NodeFlow an appealing choice for practitioners and researchers. Comprehensive assessments on benchmark datasets underscore NodeFlow's efficacy, revealing its achievement of state-of-the-art outcomes in multivariate probabilistic regression setup and its strong performance in univariate regression tasks. Furthermore, ablation studies are conducted to justify the design choices of NodeFlow. In conclusion, NodeFlow's end-to-end training process and strong performance make it a compelling solution for practitioners and researchers. Additionally, it opens new avenues for research and application in the field of probabilistic regression on tabular data.

7.
Front Med (Lausanne) ; 11: 1373005, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38919938

RESUMEN

Background: Liver transplantation (LT) is one of the main curative treatments for hepatocellular carcinoma (HCC). Milan criteria has long been applied to candidate LT patients with HCC. However, the application of Milan criteria failed to precisely predict patients at risk of recurrence. As a result, we aimed to establish and validate a deep learning model comparing with Milan criteria and better guide post-LT treatment. Methods: A total of 356 HCC patients who received LT with complete follow-up data were evaluated. The entire cohort was randomly divided into training set (n = 286) and validation set (n = 70). Multi-layer-perceptron model provided by pycox library was first used to construct the recurrence prediction model. Then tabular neural network (TabNet) that combines elements of deep learning and tabular data processing techniques was utilized to compare with Milan criteria and verify the performance of the model we proposed. Results: Patients with larger tumor size over 7 cm, poorer differentiation of tumor grade and multiple tumor numbers were first classified as high risk of recurrence. We trained a classification model with TabNet and our proposed model performed better than the Milan criteria in terms of accuracy (0.95 vs. 0.86, p < 0.05). In addition, our model showed better performance results with improved AUC, NRI and hazard ratio, proving the robustness of the model. Conclusion: A prognostic model had been proposed based on the use of TabNet on various parameters from HCC patients. The model performed well in post-LT recurrence prediction and the identification of high-risk subgroups.

8.
Front Artif Intell ; 7: 1345179, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38720912

RESUMEN

The rapid proliferation of data across diverse fields has accentuated the importance of accurate imputation for missing values. This task is crucial for ensuring data integrity and deriving meaningful insights. In response to this challenge, we present Xputer, a novel imputation tool that adeptly integrates Non-negative Matrix Factorization (NMF) with the predictive strengths of XGBoost. One of Xputer's standout features is its versatility: it supports zero imputation, enables hyperparameter optimization through Optuna, and allows users to define the number of iterations. For enhanced user experience and accessibility, we have equipped Xputer with an intuitive Graphical User Interface (GUI) ensuring ease of handling, even for those less familiar with computational tools. In performance benchmarks, Xputer often outperforms IterativeImputer in terms of imputation accuracy. Furthermore, Xputer autonomously handles a diverse spectrum of data types, including categorical, continuous, and Boolean, eliminating the need for prior preprocessing. Given its blend of performance, flexibility, and user-friendly design, Xputer emerges as a state-of-the-art solution in the realm of data imputation.

9.
J Med Internet Res ; 26: e54363, 2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38696251

RESUMEN

BACKGROUND: Clinical notes contain contextualized information beyond structured data related to patients' past and current health status. OBJECTIVE: This study aimed to design a multimodal deep learning approach to improve the evaluation precision of hospital outcomes for heart failure (HF) using admission clinical notes and easily collected tabular data. METHODS: Data for the development and validation of the multimodal model were retrospectively derived from 3 open-access US databases, including the Medical Information Mart for Intensive Care III v1.4 (MIMIC-III) and MIMIC-IV v1.0, collected from a teaching hospital from 2001 to 2019, and the eICU Collaborative Research Database v1.2, collected from 208 hospitals from 2014 to 2015. The study cohorts consisted of all patients with critical HF. The clinical notes, including chief complaint, history of present illness, physical examination, medical history, and admission medication, as well as clinical variables recorded in electronic health records, were analyzed. We developed a deep learning mortality prediction model for in-hospital patients, which underwent complete internal, prospective, and external evaluation. The Integrated Gradients and SHapley Additive exPlanations (SHAP) methods were used to analyze the importance of risk factors. RESULTS: The study included 9989 (16.4%) patients in the development set, 2497 (14.1%) patients in the internal validation set, 1896 (18.3%) in the prospective validation set, and 7432 (15%) patients in the external validation set. The area under the receiver operating characteristic curve of the models was 0.838 (95% CI 0.827-0.851), 0.849 (95% CI 0.841-0.856), and 0.767 (95% CI 0.762-0.772), for the internal, prospective, and external validation sets, respectively. The area under the receiver operating characteristic curve of the multimodal model outperformed that of the unimodal models in all test sets, and tabular data contributed to higher discrimination. The medical history and physical examination were more useful than other factors in early assessments. CONCLUSIONS: The multimodal deep learning model for combining admission notes and clinical tabular data showed promising efficacy as a potentially novel method in evaluating the risk of mortality in patients with HF, providing more accurate and timely decision support.


Asunto(s)
Aprendizaje Profundo , Insuficiencia Cardíaca , Humanos , Insuficiencia Cardíaca/mortalidad , Insuficiencia Cardíaca/terapia , Masculino , Femenino , Pronóstico , Anciano , Estudios Retrospectivos , Persona de Mediana Edad , Registros Electrónicos de Salud , Hospitalización/estadística & datos numéricos , Mortalidad Hospitalaria , Anciano de 80 o más Años
10.
Entropy (Basel) ; 26(5)2024 May 04.
Artículo en Inglés | MEDLINE | ID: mdl-38785651

RESUMEN

Due to various reasons, such as limitations in data collection and interruptions in network transmission, gathered data often contain missing values. Existing state-of-the-art generative adversarial imputation methods face three main issues: limited applicability, neglect of latent categorical information that could reflect relationships among samples, and an inability to balance local and global information. We propose a novel generative adversarial model named DTAE-CGAN that incorporates detracking autoencoding and conditional labels to address these issues. This enhances the network's ability to learn inter-sample correlations and makes full use of all data information in incomplete datasets, rather than learning random noise. We conducted experiments on six real datasets of varying sizes, comparing our method with four classic imputation baselines. The results demonstrate that our proposed model consistently exhibited superior imputation accuracy.

11.
Artif Intell Med ; 149: 102804, 2024 03.
Artículo en Inglés | MEDLINE | ID: mdl-38462275

RESUMEN

Sepsis is known as a common syndrome in intensive care units (ICU), and severe sepsis and septic shock are among the leading causes of death worldwide. The purpose of this study is to develop a deep learning model that supports clinicians in efficiently managing sepsis patients in the ICU by predicting mortality, ICU length of stay (>14 days), and hospital length of stay (>30 days). The proposed model was developed using 591 retrospective data with 16 tabular data related to a sequential organ failure assessment (SOFA) score. To analyze tabular data, we designed the modified architecture of the transformer that has achieved extraordinary success in the field of languages and computer vision tasks in recent years. The main idea of the proposed model is to use a skip-connected token, which combines both local (feature-wise token) and global (classification token) information as the output of a transformer encoder. The proposed model was compared with four machine learning models (ElasticNet, Extreme Gradient Boosting [XGBoost]), and Random Forest) and three deep learning models (Multi-Layer Perceptron [MLP], transformer, and Feature-Tokenizer transformer [FT-Transformer]) and achieved the best performance (mortality, area under the receiver operating characteristic (AUROC) 0.8047; ICU length of stay, AUROC 0.8314; hospital length of stay, AUROC 0.7342). We anticipate that the proposed model architecture will provide a promising approach to predict the various clinical endpoints using tabular data such as electronic health and medical records.


Asunto(s)
Sepsis , Humanos , Estudios Retrospectivos , Pronóstico , Sepsis/diagnóstico , Puntuaciones en la Disfunción de Órganos , Curva ROC , Unidades de Cuidados Intensivos
12.
Neural Netw ; 173: 106180, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38447303

RESUMEN

All industries are trying to leverage Artificial Intelligence (AI) based on their existing big data which is available in so called tabular form, where each record is composed of a number of heterogeneous continuous and categorical columns also known as features. Deep Learning (DL) has constituted a major breakthrough for AI in fields related to human skills like natural language processing, but its applicability to tabular data has been more challenging. More classical Machine Learning (ML) models like tree-based ensemble ones usually perform better. This paper presents a novel DL model using Graph Neural Network (GNN) more specifically Interaction Network (IN), for contextual embedding and modeling interactions among tabular features. Its results outperform those of a recently published survey with DL benchmark based on seven public datasets, also achieving competitive results when compared to boosted-tree solutions.


Asunto(s)
Inteligencia Artificial , Aprendizaje Profundo , Humanos , Redes Neurales de la Computación , Benchmarking , Macrodatos
13.
Int J Med Inform ; 185: 105413, 2024 May.
Artículo en Inglés | MEDLINE | ID: mdl-38493547

RESUMEN

BACKGROUND: Ensuring safe adoption of AI tools in healthcare hinges on access to sufficient data for training, testing and validation. Synthetic data has been suggested in response to privacy concerns and regulatory requirements and can be created by training a generator on real data to produce a dataset with similar statistical properties. Competing metrics with differing taxonomies for quality evaluation have been proposed, resulting in a complex landscape. Optimising quality entails balancing considerations that make the data fit for use, yet relevant dimensions are left out of existing frameworks. METHOD: We performed a comprehensive literature review on the use of quality evaluation metrics on synthetic data within the scope of synthetic tabular healthcare data using deep generative methods. Based on this and the collective team experiences, we developed a conceptual framework for quality assurance. The applicability was benchmarked against a practical case from the Dutch National Cancer Registry. CONCLUSION: We present a conceptual framework for quality assuranceof synthetic data for AI applications in healthcare that aligns diverging taxonomies, expands on common quality dimensions to include the dimensions of Fairness and Carbon footprint, and proposes stages necessary to support real-life applications. Building trust in synthetic data by increasing transparency and reducing the safety risk will accelerate the development and uptake of trustworthy AI tools for the benefit of patients. DISCUSSION: Despite the growing emphasis on algorithmic fairness and carbon footprint, these metrics were scarce in the literature review. The overwhelming focus was on statistical similarity using distance metrics while sequential logic detection was scarce. A consensus-backed framework that includes all relevant quality dimensions can provide assurance for safe and responsible real-life applications of synthetic data. As the choice of appropriate metrics are highly context dependent, further research is needed on validation studies to guide metric choices and support the development of technical standards.


Asunto(s)
Atención a la Salud , Confianza , Humanos , Instituciones de Salud
14.
JMIR Med Inform ; 11: e47859, 2023 Nov 24.
Artículo en Inglés | MEDLINE | ID: mdl-37999942

RESUMEN

BACKGROUND: Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. OBJECTIVE: This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. METHODS: The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. RESULTS: The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. CONCLUSIONS: This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.

15.
Biomimetics (Basel) ; 8(4)2023 Aug 07.
Artículo en Inglés | MEDLINE | ID: mdl-37622956

RESUMEN

Parkinson's disease (PD) affects a large proportion of elderly people. Symptoms include tremors, slow movement, rigid muscles, and trouble speaking. With the aging of the developed world's population, this number is expected to rise. The early detection of PD and avoiding its severe consequences require a precise and efficient system. Our goal is to create an accurate AI model that can identify PD using human voices. We developed a transformer-based method for detecting PD by retrieving dysphonia measures from a subject's voice recording. It is uncommon to use a neural network (NN)-based solution for tabular vocal characteristics, but it has several advantages over a tree-based approach, including compatibility with continuous learning and the network's potential to be linked with an image/voice encoder for a more accurate multi modal solution, shifting SOTA approach from tree-based to a neural network (NN) is crucial for advancing research in multimodal solutions. Our method outperforms the state of the art (SOTA), namely Gradient-Boosted Decision Trees (GBDTs), by at least 1% AUC, and the precision and recall scores are also improved. We additionally offered an XgBoost-based feature-selection method and a fully connected NN layer technique for including continuous dysphonia measures, in addition to the solution network. We also discussed numerous important discoveries relating to our suggested solution and deep learning (DL) and its application to dysphonia measures, such as how a transformer-based network is more resilient to increased depth compared to a simple MLP network. The performance of the proposed approach and conventional machine learning techniques such as MLP, SVM, and Random Forest (RF) have also been compared. A detailed performance comparison matrix has been added to this article, along with the proposed solution's space and time complexity.

16.
J Arthroplasty ; 38(10): 1943-1947, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-37598784

RESUMEN

Electronic health records have facilitated the extraction and analysis of a vast amount of data with many variables for clinical care and research. Conventional regression-based statistical methods may not capture all the complexities in high-dimensional data analysis. Therefore, researchers are increasingly using machine learning (ML)-based methods to better handle these more challenging datasets for the discovery of hidden patterns in patients' data and for classification and predictive purposes. This article describes commonly used ML methods in structured data analysis with examples in orthopedic surgery. We present practical considerations in starting an ML project and appraising published studies in this field.


Asunto(s)
Registros Electrónicos de Salud , Aprendizaje Automático , Humanos
17.
Diagnostics (Basel) ; 13(12)2023 Jun 06.
Artículo en Inglés | MEDLINE | ID: mdl-37370876

RESUMEN

Chronic Kidney Disease (CKD) represents a considerable global health challenge, emphasizing the need for precise and prompt prediction of disease progression to enable early intervention and enhance patient outcomes. As per this study, we introduce an innovative fusion deep learning model that combines a Graph Neural Network (GNN) and a tabular data model for predicting CKD progression by capitalizing on the strengths of both graph-structured and tabular data representations. The GNN model processes graph-structured data, uncovering intricate relationships between patients and their medical conditions, while the tabular data model adeptly manages patient-specific features within a conventional data format. An extensive comparison of the fusion model, GNN model, tabular data model, and a baseline model was conducted utilizing various evaluation metrics, encompassing accuracy, precision, recall, and F1-score. The fusion model exhibited outstanding performance across all metrics, underlining its augmented capacity for predicting CKD progression. The GNN model's performance closely trailed the fusion model, accentuating the advantages of integrating graph-structured data into the prediction process. Hyperparameter optimization was performed using grid search, ensuring a fair comparison among the models. The fusion model displayed consistent performance across diverse data splits, demonstrating its adaptability to dataset variations and resilience against noise and outliers. In conclusion, the proposed fusion deep learning model, which amalgamates the capabilities of both the GNN model and the tabular data model, substantially surpasses the individual models and the baseline model in predicting CKD progression. This pioneering approach provides a more precise and dependable method for early detection and management of CKD, highlighting its potential to advance the domain of precision medicine and elevate patient care.

18.
BMC Med Res Methodol ; 23(1): 8, 2023 01 11.
Artículo en Inglés | MEDLINE | ID: mdl-36631766

RESUMEN

BACKGROUND: In the older general population, neurodegenerative diseases (NDs) are associated with increased disability, decreased physical and cognitive function. Detecting risk factors can help implement prevention measures. Using deep neural networks (DNNs), a machine-learning algorithm could be an alternative to Cox regression in tabular datasets with many predictive features. We aimed to compare the performance of different types of DNNs with regularized Cox proportional hazards models to predict NDs in the older general population. METHODS: We performed a longitudinal analysis with participants of the English Longitudinal Study of Ageing. We included men and women with no NDs at baseline, aged 60 years and older, assessed every 2 years from 2004 to 2005 (wave2) to 2016-2017 (wave 8). The features were a set of 91 epidemiological and clinical baseline variables. The outcome was new events of Parkinson's, Alzheimer or dementia. After applying multiple imputations, we trained three DNN algorithms: Feedforward, TabTransformer, and Dense Convolutional (Densenet). In addition, we trained two algorithms based on Cox models: Elastic Net regularization (CoxEn) and selected features (CoxSf). RESULTS: 5433 participants were included in wave 2. During follow-up, 12.7% participants developed NDs. Although the five models predicted NDs events, the discriminative ability was superior using TabTransformer (Uno's C-statistic (coefficient (95% confidence intervals)) 0.757 (0.702, 0.805). TabTransformer showed superior time-dependent balanced accuracy (0.834 (0.779, 0.889)) and specificity (0.855 (0.0.773, 0.909)) than the other models. With the CoxSf (hazard ratio (95% confidence intervals)), age (10.0 (6.9, 14.7)), poor hearing (1.3 (1.1, 1.5)) and weight loss 1.3 (1.1, 1.6)) were associated with a higher DNN risk. In contrast, executive function (0.3 (0.2, 0.6)), memory (0, 0, 0.1)), increased gait speed (0.2, (0.1, 0.4)), vigorous physical activity (0.7, 0.6, 0.9)) and higher BMI (0.4 (0.2, 0.8)) were associated with a lower DNN risk. CONCLUSION: TabTransformer is promising for prediction of NDs with heterogeneous tabular datasets with numerous features. Moreover, it can handle censored data. However, Cox models perform well and are easier to interpret than DNNs. Therefore, they are still a good choice for NDs.


Asunto(s)
Enfermedades Neurodegenerativas , Masculino , Humanos , Femenino , Persona de Mediana Edad , Anciano , Estudios de Cohortes , Estudios Longitudinales , Enfermedades Neurodegenerativas/diagnóstico , Enfermedades Neurodegenerativas/epidemiología , Aprendizaje Automático , Redes Neurales de la Computación
19.
Artículo en Inglés | MEDLINE | ID: mdl-39027675

RESUMEN

Machine learning applications are widespread due to straightforward supervised learning of known data labels. Many data samples in real-world scenarios, including medicine, are unlabeled because data annotation can be time-consuming and error-prone. The application and evaluation of unsupervised clustering methods are not trivial and are limited to traditional methods (e.g., k-means) when clinicians demand deeper insights into patient data beyond classification accuracy. The contribution of this paper is three-fold: 1) to introduce a patient stratification strategy based on a clinical variable instead of a diagnostic label, 2) to evaluate clustering performance using within-cluster homogeneity and between-cluster statistical difference, and 3) to compare widely used traditional clustering algorithms (e.g., k-means) with a state-of-the-art deep learning solution for clustering tabular data. The deep clustering method achieves superior within-cluster homogeneity and between-cluster separation compared to k-means and identifies three statistically distinct and clinically interpretable high blood pressure patient clusters. The proposed clustering strategy and evaluation metrics will facilitate the stratification of large patient cohorts in health science research without requiring explicit diagnostic labels.

20.
Front Big Data ; 6: 1296508, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38260053

RESUMEN

The usage of synthetic data is gaining momentum in part due to the unavailability of original data due to privacy and legal considerations and in part due to its utility as an augmentation to the authentic data. Generative adversarial networks (GANs), a paragon of generative models, initially for images and subsequently for tabular data, has contributed many of the state-of-the-art synthesizers. As GANs improve, the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. In this study, we propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GAN for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on statistical similarity and machine learning utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 21.9% higher machine learning utility (i.e., F1-Score) across multiple datasets and learning tasks under given privacy budget.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA