Search | VHL Regional Portal

1.

DREAMER: a computational framework to evaluate readiness of datasets for machine learning.

Ahangaran, Meysam; Zhu, Hanzhi; Li, Ruihui; Yin, Lingkai; Jang, Joseph; Chaudhry, Arnav P; Farrer, Lindsay A; Au, Rhoda; Kolachalama, Vijaya B.

BMC Med Inform Decis Mak ; 24(1): 152, 2024 Jun 04.

Article in English | MEDLINE | ID: mdl-38831432

ABSTRACT

BACKGROUND: Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community.. RESULTS: The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies. CONCLUSION: Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.

Subject(s)

Machine Learning , Humans , Datasets as Topic , Unsupervised Machine Learning , Algorithms , Supervised Machine Learning , Software

2.

A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement.

Huckvale, Erik D; Moseley, Hunter N B.

PLoS One ; 19(5): e0299583, 2024.

Article in English | MEDLINE | ID: mdl-38696410

ABSTRACT

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.

Subject(s)

Metabolic Networks and Pathways , Supervised Machine Learning , Humans , Datasets as Topic

3.

Harnessing Consumer Wearable Digital Biomarkers for Individualized Recognition of Postpartum Depression Using the All of Us Research Program Data Set: Cross-Sectional Study.

Hurwitz, Eric; Butzin-Dozier, Zachary; Master, Hiral; O'Neil, Shawn T; Walden, Anita; Holko, Michelle; Patel, Rena C; Haendel, Melissa A.

JMIR Mhealth Uhealth ; 12: e54622, 2024 May 02.

Article in English | MEDLINE | ID: mdl-38696234

ABSTRACT

BACKGROUND: Postpartum depression (PPD) poses a significant maternal health challenge. The current approach to detecting PPD relies on in-person postpartum visits, which contributes to underdiagnosis. Furthermore, recognizing PPD symptoms can be challenging. Therefore, we explored the potential of using digital biomarkers from consumer wearables for PPD recognition. OBJECTIVE: The main goal of this study was to showcase the viability of using machine learning (ML) and digital biomarkers related to heart rate, physical activity, and energy expenditure derived from consumer-grade wearables for the recognition of PPD. METHODS: Using the All of Us Research Program Registered Tier v6 data set, we performed computational phenotyping of women with and without PPD following childbirth. Intraindividual ML models were developed using digital biomarkers from Fitbit to discern between prepregnancy, pregnancy, postpartum without depression, and postpartum with depression (ie, PPD diagnosis) periods. Models were built using generalized linear models, random forest, support vector machine, and k-nearest neighbor algorithms and evaluated using the κ statistic and multiclass area under the receiver operating characteristic curve (mAUC) to determine the algorithm with the best performance. The specificity of our individualized ML approach was confirmed in a cohort of women who gave birth and did not experience PPD. Moreover, we assessed the impact of a previous history of depression on model performance. We determined the variable importance for predicting the PPD period using Shapley additive explanations and confirmed the results using a permutation approach. Finally, we compared our individualized ML methodology against a traditional cohort-based ML model for PPD recognition and compared model performance using sensitivity, specificity, precision, recall, and F1-score. RESULTS: Patient cohorts of women with valid Fitbit data who gave birth included <20 with PPD and 39 without PPD. Our results demonstrated that intraindividual models using digital biomarkers discerned among prepregnancy, pregnancy, postpartum without depression, and postpartum with depression (ie, PPD diagnosis) periods, with random forest (mAUC=0.85; κ=0.80) models outperforming generalized linear models (mAUC=0.82; κ=0.74), support vector machine (mAUC=0.75; κ=0.72), and k-nearest neighbor (mAUC=0.74; κ=0.62). Model performance decreased in women without PPD, illustrating the method's specificity. Previous depression history did not impact the efficacy of the model for PPD recognition. Moreover, we found that the most predictive biomarker of PPD was calories burned during the basal metabolic rate. Finally, individualized models surpassed the performance of a conventional cohort-based model for PPD detection. CONCLUSIONS: This research establishes consumer wearables as a promising tool for PPD identification and highlights personalized ML approaches, which could transform early disease detection strategies.

Subject(s)

Biomarkers , Depression, Postpartum , Wearable Electronic Devices , Humans , Depression, Postpartum/diagnosis , Depression, Postpartum/psychology , Female , Adult , Biomarkers/analysis , Cross-Sectional Studies , Wearable Electronic Devices/statistics & numerical data , Wearable Electronic Devices/standards , Machine Learning/standards , Pregnancy , United States , Datasets as Topic , ROC Curve

4.

BarlowTwins-CXR: enhancing chest X-ray abnormality localization in heterogeneous data with cross-domain self-supervised learning.

Sheng, Haoyue; Ma, Linrui; Samson, Jean-François; Liu, Dianbo.

BMC Med Inform Decis Mak ; 24(1): 126, 2024 May 16.

Article in English | MEDLINE | ID: mdl-38755563

ABSTRACT

BACKGROUND: Chest X-ray imaging based abnormality localization, essential in diagnosing various diseases, faces significant clinical challenges due to complex interpretations and the growing workload of radiologists. While recent advances in deep learning offer promising solutions, there is still a critical issue of domain inconsistency in cross-domain transfer learning, which hampers the efficiency and accuracy of diagnostic processes. This study aims to address the domain inconsistency problem and improve autonomic abnormality localization performance of heterogeneous chest X-ray image analysis, particularly in detecting abnormalities, by developing a self-supervised learning strategy called "BarlwoTwins-CXR". METHODS: We utilized two publicly available datasets: the NIH Chest X-ray Dataset and the VinDr-CXR. The BarlowTwins-CXR approach was conducted in a two-stage training process. Initially, self-supervised pre-training was performed using an adjusted Barlow Twins algorithm on the NIH dataset with a Resnet50 backbone pre-trained on ImageNet. This was followed by supervised fine-tuning on the VinDr-CXR dataset using Faster R-CNN with Feature Pyramid Network (FPN). The study employed mean Average Precision (mAP) at an Intersection over Union (IoU) of 50% and Area Under the Curve (AUC) for performance evaluation. RESULTS: Our experiments showed a significant improvement in model performance with BarlowTwins-CXR. The approach achieved a 3% increase in mAP50 accuracy compared to traditional ImageNet pre-trained models. In addition, the Ablation CAM method revealed enhanced precision in localizing chest abnormalities. The study involved 112,120 images from the NIH dataset and 18,000 images from the VinDr-CXR dataset, indicating robust training and testing samples. CONCLUSION: BarlowTwins-CXR significantly enhances the efficiency and accuracy of chest X-ray image-based abnormality localization, outperforming traditional transfer learning methods and effectively overcoming domain inconsistency in cross-domain scenarios. Our experiment results demonstrate the potential of using self-supervised learning to improve the generalizability of models in medical settings with limited amounts of heterogeneous data. This approach can be instrumental in aiding radiologists, particularly in high-workload environments, offering a promising direction for future AI-driven healthcare solutions.

Subject(s)

Radiography, Thoracic , Supervised Machine Learning , Humans , Deep Learning , Radiographic Image Interpretation, Computer-Assisted/methods , Datasets as Topic

5.

Generalizing the Enhanced-Deep-Super-Resolution Neural Network to Brain MR Images: A Retrospective Study on the Cam-CAN Dataset.

Fiscone, Cristiana; Curti, Nico; Ceccarelli, Mattia; Remondini, Daniel; Testa, Claudia; Lodi, Raffaele; Tonon, Caterina; Manners, David Neil; Castellani, Gastone.

eNeuro ; 11(5)2024 May.

Article in English | MEDLINE | ID: mdl-38729763

ABSTRACT

The Enhanced-Deep-Super-Resolution (EDSR) model is a state-of-the-art convolutional neural network suitable for improving image spatial resolution. It was previously trained with general-purpose pictures and then, in this work, tested on biomedical magnetic resonance (MR) images, comparing the network outcomes with traditional up-sampling techniques. We explored possible changes in the model response when different MR sequences were analyzed. T1w and T2w MR brain images of 70 human healthy subjects (F:M, 40:30) from the Cambridge Centre for Ageing and Neuroscience (Cam-CAN) repository were down-sampled and then up-sampled using EDSR model and BiCubic (BC) interpolation. Several reference metrics were used to quantitatively assess the performance of up-sampling operations (RMSE, pSNR, SSIM, and HFEN). Two-dimensional and three-dimensional reconstructions were evaluated. Different brain tissues were analyzed individually. The EDSR model was superior to BC interpolation on the selected metrics, both for two- and three- dimensional reconstructions. The reference metrics showed higher quality of EDSR over BC reconstructions for all the analyzed images, with a significant difference of all the criteria in T1w images and of the perception-based SSIM and HFEN in T2w images. The analysis per tissue highlights differences in EDSR performance related to the gray-level values, showing a relative lack of outperformance in reconstructing hyperintense areas. The EDSR model, trained on general-purpose images, better reconstructs MR T1w and T2w images than BC, without any retraining or fine-tuning. These results highlight the excellent generalization ability of the network and lead to possible applications on other MR measurements.

Subject(s)

Brain , Magnetic Resonance Imaging , Neural Networks, Computer , Humans , Magnetic Resonance Imaging/methods , Male , Female , Retrospective Studies , Brain/diagnostic imaging , Adult , Middle Aged , Image Processing, Computer-Assisted/methods , Aged , Deep Learning , Datasets as Topic

6.

A meta-analysis on global change drivers and the risk of infectious disease.

Mahon, Michael B; Sack, Alexandra; Aleuy, O Alejandro; Barbera, Carly; Brown, Ethan; Buelow, Heather; Civitello, David J; Cohen, Jeremy M; de Wit, Luz A; Forstchen, Meghan; Halliday, Fletcher W; Heffernan, Patrick; Knutie, Sarah A; Korotasz, Alexis; Larson, Joanna G; Rumschlag, Samantha L; Selland, Emily; Shepack, Alexander; Vincent, Nitin; Rohr, Jason R.

Nature ; 629(8013): 830-836, 2024 May.

Article in English | MEDLINE | ID: mdl-38720068

ABSTRACT

Anthropogenic change is contributing to the rise in emerging infectious diseases, which are significantly correlated with socioeconomic, environmental and ecological factors1. Studies have shown that infectious disease risk is modified by changes to biodiversity2-6, climate change7-11, chemical pollution12-14, landscape transformations15-20 and species introductions21. However, it remains unclear which global change drivers most increase disease and under what contexts. Here we amassed a dataset from the literature that contains 2,938 observations of infectious disease responses to global change drivers across 1,497 host-parasite combinations, including plant, animal and human hosts. We found that biodiversity loss, chemical pollution, climate change and introduced species are associated with increases in disease-related end points or harm, whereas urbanization is associated with decreases in disease end points. Natural biodiversity gradients, deforestation and forest fragmentation are comparatively unimportant or idiosyncratic as drivers of disease. Overall, these results are consistent across human and non-human diseases. Nevertheless, context-dependent effects of the global change drivers on disease were found to be common. The findings uncovered by this meta-analysis should help target disease management and surveillance efforts towards global change drivers that increase disease. Specifically, reducing greenhouse gas emissions, managing ecosystem health, and preventing biological invasions and biodiversity loss could help to reduce the burden of plant, animal and human diseases, especially when coupled with improvements to social and economic determinants of health.

Subject(s)

Biodiversity , Climate Change , Communicable Diseases , Environmental Pollution , Introduced Species , Animals , Humans , Anthropogenic Effects , Climate Change/statistics & numerical data , Communicable Diseases/epidemiology , Communicable Diseases/etiology , Conservation of Natural Resources/trends , Datasets as Topic , Environmental Pollution/adverse effects , Forestry , Forests , Introduced Species/statistics & numerical data , Plant Diseases/etiology , Risk Assessment , Urbanization

7.

Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets.

Cuevas-Diaz Duran, Raquel; Wei, Haichao; Wu, Jiaqian.

BMC Genomics ; 25(1): 444, 2024 May 06.

Article in English | MEDLINE | ID: mdl-38711017

ABSTRACT

BACKGROUND: Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data. MAIN BODY: The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis. CONCLUSIONS: According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.

Subject(s)

Single-Cell Analysis , Animals , Humans , Algorithms , Gene Expression Profiling/methods , Gene Expression Profiling/standards , RNA-Seq/methods , RNA-Seq/standards , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Transcriptome , Datasets as Topic

8.

A multidimensional dataset for structure-based machine learning.

Holcomb, Matthew; Forli, Stefano.

Nat Comput Sci ; 4(5): 318-319, 2024 May.

Article in English | MEDLINE | ID: mdl-38745109

Subject(s)

Machine Learning , Humans , Datasets as Topic

9.

A refined approach for evaluating small datasets via binary classification using machine learning.

Steinert, Steffen; Ruf, Verena; Dzsotjan, David; Großmann, Nicolas; Schmidt, Albrecht; Kuhn, Jochen; Küchemann, Stefan.

PLoS One ; 19(5): e0301276, 2024.

Article in English | MEDLINE | ID: mdl-38771767

ABSTRACT

Classical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally positive results. In this study, we present a refined approach for evaluating the performance of a binary classification based on machine learning for small datasets. The approach includes a non-parametric permutation test as a method to quantify the probability of the results generalising to new data. Furthermore, we found that a repeated nested cross-validation is almost free of biases and yields reliable results that are only slightly dependent on chance. Considering the advantages of several evaluation metrics, we suggest a combination of more than one metric to train and evaluate machine learning classifiers. In the specific case that both classes are equally important, the Matthews correlation coefficient exhibits the lowest bias and chance for coincidentally good results. The results indicate that it is essential to avoid several biases when analysing small datasets using machine learning.

Subject(s)

Machine Learning , Humans , Algorithms , Datasets as Topic

10.

Evolvability predicts macroevolution under fluctuating selection.

Holstad, Agnes; Voje, Kjetil L; Opedal, Øystein H; Bolstad, Geir H; Bourg, Salomé; Hansen, Thomas F; Pélabon, Christophe.

Science ; 384(6696): 688-693, 2024 May 10.

Article in English | MEDLINE | ID: mdl-38723067

ABSTRACT

Heritable variation is a prerequisite for evolutionary change, but the relevance of genetic constraints on macroevolutionary timescales is debated. By using two datasets on fossil and contemporary taxa, we show that evolutionary divergence among populations, and to a lesser extent among species, increases with microevolutionary evolvability. We evaluate and reject several hypotheses to explain this relationship and propose that an effect of evolvability on population and species divergence can be explained by the influence of genetic constraints on the ability of populations to track rapid, stationary environmental fluctuations.

Subject(s)

Biological Evolution , Fossils , Selection, Genetic , Animals , Genetic Variation , Datasets as Topic

11.

Standardised Versioning of Datasets: a FAIR-compliant Proposal.

González-Cebrián, Alba; Bradford, Michael; Chis, Adriana E; González-Vélez, Horacio.

Sci Data ; 11(1): 358, 2024 Apr 09.

Article in English | MEDLINE | ID: mdl-38594314

ABSTRACT

This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature ("major.minor.patch") and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (dP, dE,PCA, and dE,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the dE,PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.

Subject(s)

Datasets as Topic , Software , Principal Component Analysis , Reproducibility of Results , Workflow , Datasets as Topic/standards , Machine Learning

12.

MSLTE: multiple self-supervised learning tasks for enhancing EEG emotion recognition.

Li, Guangqiang; Chen, Ning; Niu, Yixiang; Xu, Zhangyong; Dong, Yuxuan; Jin, Jing; Zhu, Hongqin.

J Neural Eng ; 21(2)2024 Apr 17.

Article in English | MEDLINE | ID: mdl-38588700

ABSTRACT

Objective. The instability of the EEG acquisition devices may lead to information loss in the channels or frequency bands of the collected EEG. This phenomenon may be ignored in available models, which leads to the overfitting and low generalization of the model.Approach. Multiple self-supervised learning tasks are introduced in the proposed model to enhance the generalization of EEG emotion recognition and reduce the overfitting problem to some extent. Firstly, channel masking and frequency masking are introduced to simulate the information loss in certain channels and frequency bands resulting from the instability of EEG, and two self-supervised learning-based feature reconstruction tasks combining masked graph autoencoders (GAE) are constructed to enhance the generalization of the shared encoder. Secondly, to take full advantage of the complementary information contained in these two self-supervised learning tasks to ensure the reliability of feature reconstruction, a weight sharing (WS) mechanism is introduced between the two graph decoders. Thirdly, an adaptive weight multi-task loss (AWML) strategy based on homoscedastic uncertainty is adopted to combine the supervised learning loss and the two self-supervised learning losses to enhance the performance further.Main results. Experimental results on SEED, SEED-V, and DEAP datasets demonstrate that: (i) Generally, the proposed model achieves higher averaged emotion classification accuracy than various baselines included in both subject-dependent and subject-independent scenarios. (ii) Each key module contributes to the performance enhancement of the proposed model. (iii) It achieves higher training efficiency, and significantly lower model size and computational complexity than the state-of-the-art (SOTA) multi-task-based model. (iv) The performances of the proposed model are less influenced by the key parameters.Significance. The introduction of the self-supervised learning task helps to enhance the generalization of the EEG emotion recognition model and eliminate overfitting to some extent, which can be modified to be applied in other EEG-based classification tasks.

Subject(s)

Electroencephalography , Emotions , Supervised Machine Learning , Supervised Machine Learning/standards , Datasets as Topic , Humans

13.

Environmental drivers of increased ecosystem respiration in a warming tundra.

Maes, S L; Dietrich, J; Midolo, G; Schwieger, S; Kummu, M; Vandvik, V; Aerts, R; Althuizen, I H J; Biasi, C; Björk, R G; Böhner, H; Carbognani, M; Chiari, G; Christiansen, C T; Clemmensen, K E; Cooper, E J; Cornelissen, J H C; Elberling, B; Faubert, P; Fetcher, N; Forte, T G W; Gaudard, J; Gavazov, K; Guan, Z; Guðmundsson, J; Gya, R; Hallin, S; Hansen, B B; Haugum, S V; He, J-S; Hicks Pries, C; Hovenden, M J; Jalava, M; Jónsdóttir, I S; Juhanson, J; Jung, J Y; Kaarlejärvi, E; Kwon, M J; Lamprecht, R E; Le Moullec, M; Lee, H; Marushchak, M E; Michelsen, A; Munir, T M; Myrsky, E M; Nielsen, C S; Nyberg, M; Olofsson, J; Óskarsson, H; Parker, T C.

Nature ; 629(8010): 105-113, 2024 May.

Article in English | MEDLINE | ID: mdl-38632407

ABSTRACT

Arctic and alpine tundra ecosystems are large reservoirs of organic carbon1,2. Climate warming may stimulate ecosystem respiration and release carbon into the atmosphere3,4. The magnitude and persistency of this stimulation and the environmental mechanisms that drive its variation remain uncertain5-7. This hampers the accuracy of global land carbon-climate feedback projections7,8. Here we synthesize 136 datasets from 56 open-top chamber in situ warming experiments located at 28 arctic and alpine tundra sites which have been running for less than 1 year up to 25 years. We show that a mean rise of 1.4 °C [confidence interval (CI) 0.9-2.0 °C] in air and 0.4 °C [CI 0.2-0.7 °C] in soil temperature results in an increase in growing season ecosystem respiration by 30% [CI 22-38%] (n = 136). Our findings indicate that the stimulation of ecosystem respiration was due to increases in both plant-related and microbial respiration (n = 9) and continued for at least 25 years (n = 136). The magnitude of the warming effects on respiration was driven by variation in warming-induced changes in local soil conditions, that is, changes in total nitrogen concentration and pH and by context-dependent spatial variation in these conditions, in particular total nitrogen concentration and the carbon:nitrogen ratio. Tundra sites with stronger nitrogen limitations and sites in which warming had stimulated plant and microbial nutrient turnover seemed particularly sensitive in their respiration response to warming. The results highlight the importance of local soil conditions and warming-induced changes therein for future climatic impacts on respiration.

Subject(s)

Cell Respiration , Ecosystem , Global Warming , Tundra , Arctic Regions , Carbon/metabolism , Carbon/analysis , Carbon Cycle , Datasets as Topic , Hydrogen-Ion Concentration , Nitrogen/metabolism , Nitrogen/analysis , Plants/metabolism , Seasons , Soil/chemistry , Soil Microbiology , Temperature , Time Factors

14.

Development and external validation of deep learning clinical prediction models using variable-length time series data.

Bashiri, Fereshteh S; Carey, Kyle A; Martin, Jennie; Koyner, Jay L; Edelson, Dana P; Gilbert, Emily R; Mayampurath, Anoop; Afshar, Majid; Churpek, Matthew M.

J Am Med Inform Assoc ; 31(6): 1322-1330, 2024 May 20.

Article in English | MEDLINE | ID: mdl-38679906

ABSTRACT

OBJECTIVES: To compare and externally validate popular deep learning model architectures and data transformation methods for variable-length time series data in 3 clinical tasks (clinical deterioration, severe acute kidney injury [AKI], and suspected infection). MATERIALS AND METHODS: This multicenter retrospective study included admissions at 2 medical centers that spanned 2007-2022. Distinct datasets were created for each clinical task, with 1 site used for training and the other for testing. Three feature engineering methods (normalization, standardization, and piece-wise linear encoding with decision trees [PLE-DTs]) and 3 architectures (long short-term memory/gated recurrent unit [LSTM/GRU], temporal convolutional network, and time-distributed wrapper with convolutional neural network [TDW-CNN]) were compared in each clinical task. Model discrimination was evaluated using the area under the precision-recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC). RESULTS: The study comprised 373 825 admissions for training and 256 128 admissions for testing. LSTM/GRU models tied with TDW-CNN models with both obtaining the highest mean AUPRC in 2 tasks, and LSTM/GRU had the highest mean AUROC across all tasks (deterioration: 0.81, AKI: 0.92, infection: 0.87). PLE-DT with LSTM/GRU achieved the highest AUPRC in all tasks. DISCUSSION: When externally validated in 3 clinical tasks, the LSTM/GRU model architecture with PLE-DT transformed data demonstrated the highest AUPRC in all tasks. Multiple models achieved similar performance when evaluated using AUROC. CONCLUSION: The LSTM architecture performs as well or better than some newer architectures, and PLE-DT may enhance the AUPRC in variable-length time series data for predicting clinical outcomes during external validation.

Subject(s)

Deep Learning , Humans , Retrospective Studies , Acute Kidney Injury , Neural Networks, Computer , ROC Curve , Male , Datasets as Topic , Female , Middle Aged

15.

Large dataset analyses advance knowledge of seed ecology and evolutionary biology.

Rosbakh, Sergey; Carta, Angelino; Fernández-Pascual, Eduardo; Phartyal, Shyam S; Dayrell, Roberta L C; Mattana, Efisio; Saatkamp, Arne; Vandelook, Filip; Baskin, Jerry M; Baskin, Carol C.

New Phytol ; 242(6): 2399-2400, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38643978

Subject(s)

Biological Evolution , Ecology , Seeds , Seeds/physiology , Seeds/genetics , Datasets as Topic

16.

PresRecST: a novel herbal prescription recommendation algorithm for real-world patients with integration of syndrome differentiation and treatment planning.

Dong, Xin; Zhao, Chenxi; Song, Xinpeng; Zhang, Lei; Liu, Yu; Wu, Jun; Xu, Yiran; Xu, Ning; Liu, Jialing; Yu, Haibin; Yang, Kuo; Zhou, Xuezhong.

J Am Med Inform Assoc ; 31(6): 1268-1279, 2024 May 20.

Article in English | MEDLINE | ID: mdl-38598532

ABSTRACT

OBJECTIVES: Herbal prescription recommendation (HPR) is a hot topic and challenging issue in field of clinical decision support of traditional Chinese medicine (TCM). However, almost all previous HPR methods have not adhered to the clinical principles of syndrome differentiation and treatment planning of TCM, which has resulted in suboptimal performance and difficulties in application to real-world clinical scenarios. MATERIALS AND METHODS: We emphasize the synergy among diagnosis and treatment procedure in real-world TCM clinical settings to propose the PresRecST model, which effectively combines the key components of symptom collection, syndrome differentiation, treatment method determination, and herb recommendation. This model integrates a self-curated TCM knowledge graph to learn the high-quality representations of TCM biomedical entities and performs 3 stages of clinical predictions to meet the principle of systematic sequential procedure of TCM decision making. RESULTS: To address the limitations of previous datasets, we constructed the TCM-Lung dataset, which is suitable for the simultaneous training of the syndrome differentiation, treatment method determination, and herb recommendation. Overall experimental results on 2 datasets demonstrate that the proposed PresRecST outperforms the state-of-the-art algorithm by significant improvements (eg, improvements of P@5 by 4.70%, P@10 by 5.37%, P@20 by 3.08% compared with the best baseline). DISCUSSION: The workflow of PresRecST effectively integrates the embedding vectors of the knowledge graph for progressive recommendation tasks, and it closely aligns with the actual diagnostic and treatment procedures followed by TCM doctors. A series of ablation experiments and case study show the availability and interpretability of PresRecST, indicating the proposed PresRecST can be beneficial for assisting the diagnosis and treatment in real-world TCM clinical settings. CONCLUSION: Our technology can be applied in a progressive recommendation scenario, providing recommendations for related items in a progressive manner, which can assist in providing more reliable diagnoses and herbal therapies for TCM clinical task.

Subject(s)

Algorithms , Drugs, Chinese Herbal , Medicine, Chinese Traditional , Humans , Medicine, Chinese Traditional/methods , Drugs, Chinese Herbal/therapeutic use , Decision Support Systems, Clinical , Diagnosis, Differential , Syndrome , Datasets as Topic , Drug Prescriptions

17.

Constructing synthetic datasets with generative artificial intelligence to train large language models to classify acute renal failure from clinical notes.

Litake, Onkar; Park, Brian H; Tully, Jeffrey L; Gabriel, Rodney A.

J Am Med Inform Assoc ; 31(6): 1404-1410, 2024 May 20.

Article in English | MEDLINE | ID: mdl-38622901

ABSTRACT

OBJECTIVES: To compare performances of a classifier that leverages language models when trained on synthetic versus authentic clinical notes. MATERIALS AND METHODS: A classifier using language models was developed to identify acute renal failure. Four types of training data were compared: (1) notes from MIMIC-III; and (2, 3, and 4) synthetic notes generated by ChatGPT of varied text lengths of 15 (GPT-15 sentences), 30 (GPT-30 sentences), and 45 (GPT-45 sentences) sentences, respectively. The area under the receiver operating characteristics curve (AUC) was calculated from a test set from MIMIC-III. RESULTS: With RoBERTa, the AUCs were 0.84, 0.80, 0.84, and 0.76 for the MIMIC-III, GPT-15, GPT-30- and GPT-45 sentences training sets, respectively. DISCUSSION: Training language models to detect acute renal failure from clinical notes resulted in similar performances when using synthetic versus authentic training data. CONCLUSION: The use of training data derived from protected health information may not be needed.

Subject(s)

Acute Kidney Injury , Artificial Intelligence , Electronic Health Records , Humans , Acute Kidney Injury/classification , Acute Kidney Injury/diagnosis , ROC Curve , Natural Language Processing , Area Under Curve , Datasets as Topic

18.

Ghost roads and the destruction of Asia-Pacific tropical forests.

Engert, Jayden E; Campbell, Mason J; Cinner, Joshua E; Ishida, Yoko; Sloan, Sean; Supriatna, Jatna; Alamgir, Mohammed; Cislowski, Jaime; Laurance, William F.

Nature ; 629(8011): 370-375, 2024 May.

Article in English | MEDLINE | ID: mdl-38600390

ABSTRACT

Roads are expanding at the fastest pace in human history. This is the case especially in biodiversity-rich tropical nations, where roads can result in forest loss and fragmentation, wildfires, illicit land invasions and negative societal effects1-5. Many roads are being constructed illegally or informally and do not appear on any existing road map6-10; the toll of such 'ghost roads' on ecosystems is poorly understood. Here we use around 7,000 h of effort by trained volunteers to map ghost roads across the tropical Asia-Pacific region, sampling 1.42 million plots, each 1 km2 in area. Our intensive sampling revealed a total of 1.37 million km of roads in our plots-from 3.0 to 6.6 times more roads than were found in leading datasets of roads globally. Across our study area, road building almost always preceded local forest loss, and road density was by far the strongest correlate11 of deforestation out of 38 potential biophysical and socioeconomic covariates. The relationship between road density and forest loss was nonlinear, with deforestation peaking soon after roads penetrate a landscape and then declining as roads multiply and remaining accessible forests largely disappear. Notably, after controlling for lower road density inside protected areas, we found that protected areas had only modest additional effects on preventing forest loss, implying that their most vital conservation function is limiting roads and road-related environmental disruption. Collectively, our findings suggest that burgeoning, poorly studied ghost roads are among the gravest of all direct threats to tropical forests.

Subject(s)

Automobiles , Conservation of Natural Resources , Forestry , Forests , Trees , Tropical Climate , Asia , Conservation of Natural Resources/statistics & numerical data , Conservation of Natural Resources/trends , Trees/growth & development , Datasets as Topic , Forestry/methods , Forestry/statistics & numerical data , Forestry/trends

19.

Biogeographic response of marine plankton to Cenozoic environmental changes.

Swain, Anshuman; Woodhouse, Adam; Fagan, William F; Fraass, Andrew J; Lowery, Christopher M.

Nature ; 629(8012): 616-623, 2024 May.

Article in English | MEDLINE | ID: mdl-38632405

ABSTRACT

In palaeontological studies, groups with consistent ecological and morphological traits across a clade's history (functional groups)1 afford different perspectives on biodiversity dynamics than do species and genera2,3, which are evolutionarily ephemeral. Here we analyse Triton, a global dataset of Cenozoic macroperforate planktonic foraminiferal occurrences4, to contextualize changes in latitudinal equitability gradients1, functional diversity, palaeolatitudinal specialization and community equitability. We identify: global morphological communities becoming less specialized preceding the richness increase after the Cretaceous-Palaeogene extinction; ecological specialization during the Early Eocene Climatic Optimum, suggesting inhibitive equatorial temperatures during the peak of the Cenozoic hothouse; increased specialization due to circulation changes across the Eocene-Oligocene transition, preceding the loss of morphological diversity; changes in morphological specialization and richness about 19 million years ago, coeval with pelagic shark extinctions5; delayed onset of changing functional group richness and specialization between hemispheres during the mid-Miocene plankton diversification. The detailed nature of the Triton dataset permits a unique spatiotemporal view of Cenozoic pelagic macroevolution, in which global biogeographic responses of functional communities and richness are decoupled during Cenozoic climate events. The global response of functional groups to similar abiotic selection pressures may depend on the background climatic state (greenhouse or icehouse) to which a group is adapted.

Subject(s)

Aquatic Organisms , Biodiversity , Extinction, Biological , Foraminifera , Plankton , Plankton/classification , Plankton/physiology , Foraminifera/classification , Foraminifera/physiology , Aquatic Organisms/physiology , Aquatic Organisms/classification , Fossils , Datasets as Topic , Phylogeography , Biological Evolution , Climate Change , History, Ancient , Animals

20.

[Data-driven intensive care: a lack of comprehensive datasets]. / "Data-driven-Intensivmedizin": Mangel an umfassenden Datensätzen.

Hardenberg, Jan-Hendrik B.

Med Klin Intensivmed Notfmed ; 119(5): 352-357, 2024 Jun.

Article in German | MEDLINE | ID: mdl-38668882

ABSTRACT

Intensive care units provide a data-rich environment with the potential to generate datasets in the realm of big data, which could be utilized to train powerful machine learning (ML) models. However, the currently available datasets are too small and exhibit too little diversity due to their limitation to individual hospitals. This lack of extensive and varied datasets is a primary reason for the limited generalizability and resulting low clinical utility of current ML models. Often, these models are based on data from single centers and suffer from poor external validity. There is an urgent need for the development of large-scale, multicentric, and multinational datasets. Ensuring data protection and minimizing re-identification risks pose central challenges in this process. The "Amsterdam University Medical Center database (AmsterdamUMCdb)" and the "Salzburg Intensive Care database (SICdb)" demonstrate that open access datasets are possible in Europe while complying with the data protection regulations of the General Data Protection Regulation (GDPR). Another challenge in building intensive care datasets is the absence of semantic definitions in the source data and the heterogeneity of data formats. Establishing binding industry standards for the semantic definition is crucial to ensure seamless semantic interoperability between datasets.

Subject(s)

Critical Care , Intensive Care Units , Machine Learning , Humans , Critical Care/standards , Germany , Computer Security , Europe , Databases, Factual , Datasets as Topic , Big Data , Confidentiality

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL