Search | VHL Regional Portal

1.

Datasets for Benchmarking RNA Design Algorithms.

Badura, Jan; Zok, Tomasz; Rybarczyk, Agnieszka.

Methods Mol Biol ; 2847: 229-240, 2025.

Article in English | MEDLINE | ID: mdl-39312148

ABSTRACT

RNA molecules play vital roles in many biological processes, such as gene regulation or protein synthesis. The adoption of a specific secondary and tertiary structure by RNA is essential to perform these diverse functions, making RNA a popular tool in bioengineering therapeutics. The field of RNA design responds to the need to develop novel RNA molecules that possess specific functional attributes. In recent years, computational tools for predicting RNA sequences with desired folding characteristics have improved and expanded. However, there is still a lack of well-defined and standardized datasets to assess these programs. Here, we present a large dataset of internal and multibranched loops extracted from PDB-deposited RNA structures that encompass a wide spectrum of design difficulties. Furthermore, we conducted benchmarking tests of widely utilized open-source RNA design algorithms employing this dataset.

Subject(s)

Algorithms , Benchmarking , Computational Biology , Nucleic Acid Conformation , RNA , RNA/genetics , RNA/chemistry , Computational Biology/methods , Software

2.

Daily life in the Open Biologist's second job, as a Data Curator.

Scorza, Livia C T; Zielinski, Tomasz; Kalita, Irina; Lepore, Alessia; El Karoui, Meriem; Millar, Andrew J.

Wellcome Open Res ; 9: 523, 2024.

Article in English | MEDLINE | ID: mdl-39360219

ABSTRACT

Background: Data reusability is the driving force of the research data life cycle. However, implementing strategies to generate reusable data from the data creation to the sharing stages is still a significant challenge. Even when datasets supporting a study are publicly shared, the outputs are often incomplete and/or not reusable. The FAIR (Findable, Accessible, Interoperable, Reusable) principles were published as a general guidance to promote data reusability in research, but the practical implementation of FAIR principles in research groups is still falling behind. In biology, the lack of standard practices for a large diversity of data types, data storage and preservation issues, and the lack of familiarity among researchers are some of the main impeding factors to achieve FAIR data. Past literature describes biological curation from the perspective of data resources that aggregate data, often from publications. Methods: Our team works alongside data-generating, experimental researchers so our perspective aligns with publication authors rather than aggregators. We detail the processes for organizing datasets for publication, showcasing practical examples from data curation to data sharing. We also recommend strategies, tools and web resources to maximize data reusability, while maintaining research productivity. Conclusion: We propose a simple approach to address research data management challenges for experimentalists, designed to promote FAIR data sharing. This strategy not only simplifies data management, but also enhances data visibility, recognition and impact, ultimately benefiting the entire scientific community.

Researchers should openly share data associated with their publications unless there is a valid reason not to. Additionally, datasets have to be described with enough detail to ensure that they are reproducible and reusable by others. Since most research institutions offer limited professional support in this area, the responsibility for data sharing largely falls to researchers themselves. However, many research groups still struggle to follow data reusability principles in practice. In this work, we describe our data curation (data organization and management) efforts working directly with the researchers who create the data. We show the steps we took to organize, standardize, and share several datasets in biological sciences, pointing out the main challenges we faced. Finally, we suggest simple and practical data management actions, as well as tools that experimentalists can integrate into their daily work, to make sharing data easier and more effective.

3.

Hepatitis C virus infection and Parkinson's disease: insights from a joint sex-stratified BioOptimatics meta-analysis.

Narváez-Bandera, Isis; Suárez-Gómez, Deiver; Castro-Rivera, Coral Del Mar; Camasta-Beníquez, Alaina; Durán-Quintana, Morelia; Cabrera-Ríos, Mauricio; Isaza, Clara E.

Sci Rep ; 14(1): 22838, 2024 10 01.

Article in English | MEDLINE | ID: mdl-39354018

ABSTRACT

Hepatitis C virus (HCV) infection poses a significant public health challenge and often leads to long-term health complications and even death. Parkinson's disease (PD) is a progressive neurodegenerative disorder with a proposed viral etiology. HCV infection and PD have been previously suggested to be related. This work aimed to identify potential biomarkers and pathways that may play a role in the joint development of PD and HCV infection. Using BioOptimatics-bioinformatics driven by mathematical global optimization-, 22 publicly available microarray and RNAseq datasets for both diseases were analyzed, focusing on sex-specific differences. Our results revealed that 19 genes, including MT1H, MYOM2, and RPL18, exhibited significant changes in expression in both diseases. Pathway and network analyses stratified by sex indicated that these gene expression changes were enriched in processes related to immune response regulation in females and immune cell activation in males. These findings suggest a potential link between HCV infection and PD, highlighting the importance of further investigation into the underlying mechanisms and potential therapeutic targets involved.

Subject(s)

Hepatitis C , Parkinson Disease , Humans , Parkinson Disease/genetics , Parkinson Disease/virology , Female , Male , Hepatitis C/virology , Hepacivirus/genetics , Computational Biology/methods , Sex Factors , Gene Expression Profiling , Biomarkers , Gene Regulatory Networks

4.

Multi-datasets for different keyboard key sound recognition.

Rawf, Karwan M Hama; Abdulrahman, Ayub O; Kamel, Hana O; Hassan, Lawen M; Ali, Ahmad O.

Data Brief ; 57: 110949, 2024 Dec.

Article in English | MEDLINE | ID: mdl-39391001

ABSTRACT

Keyboard acoustic recognition is a pivotal area within cybersecurity and human-computer interaction, where the identification and analysis of keyboard sounds are used to enhance security measures. The performance of acoustic-based security systems can be influenced by factors such as the platform used, typing style, and environmental noise. To address these variations and provide a comprehensive resource, we present the Multi-Keyboard Acoustic (MKA) Datasets. These extensive datasets, meticulously gathered by a team in the Computer Science Department at the University of Halabja, include recordings from six widely-used platforms: HP, Lenovo, MSI, Mac, Messenger, and Zoom. The MKA datasets have structured data for each platform, including raw recordings, segmented sound files, and matrices derived from these sounds. They can be used by researchers in keylogging detection, cybersecurity, and other fields related to acoustic emanation attacks on keyboards. Moreover, the datasets capture the intricacies of typing behaviour with both hands and all ten fingers by carefully segmenting and pre-processing the data using the Praat tool, thus ensuring high-quality and dependable data. This comprehensive approach allows researchers to explore various aspects of keyboard sound recognition, contributing to the development of robust recognition algorithms and enhanced security measures. The MKA Datasets stand as one of the largest and most detailed datasets in this domain, offering significant potential for advancing research and improving defences against acoustic-based threats.

5.

Integrated bioinformatics reveals genetic links between visceral obesity and uterine tumors.

Samantaray, Swayamprabha; Joshi, Nidhi; Vasa, Shrinal; Shibu, Shan; Kaloni, Aditi; Parekh, Bhavin; Modi, Anupama.

Mol Genet Genomics ; 299(1): 93, 2024 Oct 05.

Article in English | MEDLINE | ID: mdl-39368016

ABSTRACT

Visceral obesity (VO), characterized by excess fat around internal organs, is a recognized risk factor for gynecological tumors, including benign uterine leiomyoma (ULM) and malignant uterine leiomyosarcoma (ULS). Despite this association, the shared molecular mechanisms remain underexplored. This study utilizes an integrated bioinformatics approach to elucidate common molecular pathways and identify potential therapeutic targets linking VO, ULM, and ULS. We analyzed gene expression datasets from the Gene Expression Omnibus (GEO) to identify differentially expressed genes (DEGs) in each condition. We found 101, 145, and 18 DEGs in VO, ULM, and ULS, respectively, with 37 genes overlapping across all three conditions. Functional enrichment analysis revealed that these overlapping DEGs were significantly enriched in pathways related to cell proliferation, immune response, and transcriptional regulation, suggesting shared biological processes. Protein-protein interaction network analysis identified 14 hub genes, of which TOP2A, APOE, and TYMS showed significant differential expression across all three conditions. Drug-gene interaction analysis identified 26 FDA-approved drugs targeting these hub genes, highlighting potential therapeutic opportunities. In conclusion, this study uncovers shared molecular pathways and actionable drug targets across VO, ULM, and ULS. These findings deepen our understanding of disease etiology and offer promising avenues for drug repurposing. Experimental validation is needed to translate these insights into clinical applications and innovative treatments.

Subject(s)

Computational Biology , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks , Leiomyoma , Obesity, Abdominal , Protein Interaction Maps , Uterine Neoplasms , Female , Humans , Uterine Neoplasms/genetics , Uterine Neoplasms/pathology , Computational Biology/methods , Leiomyoma/genetics , Leiomyoma/pathology , Protein Interaction Maps/genetics , Obesity, Abdominal/genetics , Leiomyosarcoma/genetics , Leiomyosarcoma/pathology , Gene Expression Profiling/methods , DNA Topoisomerases, Type II/genetics , Apolipoproteins E/genetics , Databases, Genetic , Poly-ADP-Ribose Binding Proteins

6.

Epidemiological modeling of COVID-19 data with Advanced statistical inference based on Type-II progressive censoring.

Alotaibi, Naif; Al-Moisheer, A S; Hassan, Amal S; Elbatal, Ibrahim; Alyami, Salem A; Almetwally, Ehab M.

Heliyon ; 10(18): e36774, 2024 Sep 30.

Article in English | MEDLINE | ID: mdl-39315172

ABSTRACT

This research proposes the Kavya-Manoharan Unit Exponentiated Half Logistic (KM-UEHL) distribution as a novel tool for epidemiological modeling of COVID-19 data. Specifically designed to analyze data constrained to the unit interval, the KM-UEHL distribution builds upon the unit exponentiated half logistic model, making it suitable for various data from COVID-19. The paper emphasizes the KM-UEHL distribution's adaptability by examining its density and hazard rate functions. Its effectiveness is demonstrated in handling the diverse nature of COVID-19 data through these functions. Key characteristics like moments, quantile functions, stress-strength reliability, and entropy measures are also comprehensively investigated. Furthermore, the KM-UEHL distribution is employed for forecasting future COVID-19 data under a progressive Type-II censoring scheme, which acknowledges the time-dependent nature of data collection during outbreaks. The paper presents various methods for constructing prediction intervals for future-order statistics, including maximum likelihood estimation, Bayesian inference (both point and interval estimates), and upper-order statistics approaches. The Metropolis-Hastings and Gibbs sampling procedures are combined to create the Markov chain Monte Carlo simulations because it is mathematically difficult to acquire closed-form solutions for the posterior density function in the Bayesian framework. The theoretical developments are validated with numerical simulations, and the practical applicability of the KM-UEHL distribution is showcased using real-world COVID-19 datasets.

7.

Bayesian model of tilling wheat confronting climatic and sustainability challenges.

Ali, Qaisar.

Front Artif Intell ; 7: 1402098, 2024.

Article in English | MEDLINE | ID: mdl-39258233

ABSTRACT

Conventional farming poses threats to sustainable agriculture in growing food demands and increasing flooding risks. This research introduces a Bayesian Belief Network (BBN) to address these concerns. The model explores tillage adaptation for flood management in soils with varying organic carbon (OC) contents for winter wheat production. Three real soils, emphasizing texture and soil water properties, were sourced from the NETMAP soilscape of the Pang catchment area in Berkshire, United Kingdom. Modified with OC content at four levels (1, 3, 5, 7%), they were modeled alongside relevant variables in a BBN. The Decision Support System for Agrotechnology Transfer (DSSAT) simulated datasets across 48 cropping seasons to parameterize the BBN. The study compared tillage effects on wheat yield, surface runoff, and GHG-CO2 emissions, categorizing model parameters (from lower to higher bands) based on statistical data distribution. Results revealed that NT outperformed CT in the highest parametric category, comparing probabilistic estimates with reduced GHG-CO2 emissions from "7.34 to 7.31%" and cumulative runoff from "8.52 to 8.50%," while yield increased from "7.46 to 7.56%." Conversely, CT exhibited increased emissions from "7.34 to 7.36%" and cumulative runoff from "8.52 to 8.55%," along with reduced yield from "7.46 to 7.35%." The BBN model effectively captured uncertainties, offering posterior probability distributions reflecting conditional relationships across variables and offered decision choice for NT favoring soil carbon stocks in winter wheat (highest among soils "NT.OC-7%PDPG8," e.g., 286,634 kg/ha) over CT (lowest in "CT.OC-3.9%PDPG8," e.g., 5,894 kg/ha). On average, NT released minimum GHG- CO2 emissions to "3,985 kgCO2eqv/ha," while CT emitted "7,415 kgCO2eqv/ha." Conversely, NT emitted "8,747 kgCO2eqv/ha" for maximum emissions, while CT emitted "15,356 kgCO2eqv/ha." NT resulted in lower surface runoff against CT in all soils and limits runoff generations naturally for flood alleviation with the potential for customized improvement. The study recommends the model for extensive assessments of various spatiotemporal conditions. The research findings align with sustainable development goals, e.g., SDG12 and SDG13 for responsible production and climate actions, respectively, as defined by the Agriculture and Food Organization of the United Nations.

8.

Multilabel classification of biomedical data.

Diakou, Io; Iliopoulos, Eddie; Papakonstantinou, Eleni; Dragoumani, Konstantina; Yapijakis, Christos; Iliopoulos, Costas; Spandidos, Demetrios A; Chrousos, George P; Eliopoulos, Elias; Vlachakis, Dimitrios.

Med Int (Lond) ; 4(6): 68, 2024.

Article in English | MEDLINE | ID: mdl-39301328

ABSTRACT

Biomedical datasets constitute a rich source of information, containing multivariate data collected during medical practice. In spite of inherent challenges, such as missing or imbalanced data, these types of datasets are increasingly utilized as a basis for the construction of predictive machine-learning models. The prediction of disease outcomes and complications could inform the process of decision-making in the hospital setting and ensure the best possible patient management according to the patient's features. Multi-label classification algorithms, which are trained to assign a set of labels to input samples, can efficiently tackle outcome prediction tasks. Myocardial infarction (MI) represents a widespread health risk, accounting for a significant portion of heart disease-related mortality. Moreover, the danger of potential complications occurring in patients with MI during their period of hospitalization underlines the need for systems to efficiently assess the risks of patients with MI. In order to demonstrate the critical role of applying machine-learning methods in medical challenges, in the present study, a set of multi-label classifiers was evaluated on a public dataset of MI-related complications to predict the outcomes of hospitalized patients with MI, based on a set of input patient features. Such methods can be scaled through the use of larger datasets of patient records, along with fine-tuning for specific patient sub-groups or patient populations in specific regions, to increase the performance of these approaches. Overall, a prediction system based on classifiers trained on patient records may assist healthcare professionals in providing personalized care and efficient monitoring of high-risk patient subgroups.

9.

Deepfake: definitions, performance metrics and standards, datasets, and a meta-review.

Altuncu, Enes; Franqueira, Virginia N L; Li, Shujun.

Front Big Data ; 7: 1400024, 2024.

Article in English | MEDLINE | ID: mdl-39296632

ABSTRACT

Recent advancements in AI, especially deep learning, have contributed to a significant increase in the creation of new realistic-looking synthetic media (video, image, and audio) and manipulation of existing media, which has led to the creation of the new term "deepfake." Based on both the research literature and resources in English, this paper gives a comprehensive overview of deepfake, covering multiple important aspects of this emerging concept, including (1) different definitions, (2) commonly used performance metrics and standards, and (3) deepfake-related datasets. In addition, the paper also reports a meta-review of 15 selected deepfake-related survey papers published since 2020, focusing not only on the mentioned aspects but also on the analysis of key challenges and recommendations. We believe that this paper is the most comprehensive review of deepfake in terms of the aspects covered.

10.

Dataset factors associated with age-related changes in brain structure and function in neurodevelopmental conditions.

Vandewouw, Marlee M; Ye, Yifan Julia; Crosbie, Jennifer; Schachar, Russell J; Iaboni, Alana; Georgiades, Stelios; Nicolson, Robert; Kelley, Elizabeth; Ayub, Muhammad; Jones, Jessica; Arnold, Paul D; Taylor, Margot J; Lerch, Jason P; Anagnostou, Evdokia; Kushki, Azadeh.

Hum Brain Mapp ; 45(13): e26815, 2024 Sep.

Article in English | MEDLINE | ID: mdl-39254138

ABSTRACT

With brain structure and function undergoing complex changes throughout childhood and adolescence, age is a critical consideration in neuroimaging studies, particularly for those of individuals with neurodevelopmental conditions. However, despite the increasing use of large, consortium-based datasets to examine brain structure and function in neurotypical and neurodivergent populations, it is unclear whether age-related changes are consistent between datasets and whether inconsistencies related to differences in sample characteristics, such as demographics and phenotypic features, exist. To address this, we built models of age-related changes of brain structure (regional cortical thickness and regional surface area; N = 1218) and function (resting-state functional connectivity strength; N = 1254) in two neurodiverse datasets: the Province of Ontario Neurodevelopmental Network and the Healthy Brain Network. We examined whether deviations from these models differed between the datasets, and explored whether these deviations were associated with demographic and clinical variables. We found significant differences between the two datasets for measures of cortical surface area and functional connectivity strength throughout the brain. For regional measures of cortical surface area, the patterns of differences were associated with race/ethnicity, while for functional connectivity strength, positive associations were observed with head motion. Our findings highlight that patterns of age-related changes in the brain may be influenced by demographic and phenotypic characteristics, and thus future studies should consider these when examining or controlling for age effects in analyses.

Subject(s)

Datasets as Topic , Magnetic Resonance Imaging , Humans , Female , Male , Child , Adolescent , Young Adult , Adult , Neurodevelopmental Disorders/diagnostic imaging , Neurodevelopmental Disorders/physiopathology , Neurodevelopmental Disorders/pathology , Connectome , Brain/diagnostic imaging , Brain/growth & development , Brain/anatomy & histology , Cerebral Cortex/diagnostic imaging , Cerebral Cortex/growth & development , Cerebral Cortex/anatomy & histology , Aging/physiology

11.

radMLBench: A dataset collection for benchmarking in radiomics.

Demircioglu, Aydin.

Comput Biol Med ; 182: 109140, 2024 Sep 12.

Article in English | MEDLINE | ID: mdl-39270457

ABSTRACT

BACKGROUND: New machine learning methods and techniques are frequently introduced in radiomics, but they are often tested on a single dataset, which makes it challenging to assess their true benefit. Currently, there is a lack of a larger, publicly accessible dataset collection on which such assessments could be performed. In this study, a collection of radiomics datasets with binary outcomes in tabular form was curated to allow benchmarking of machine learning methods and techniques. METHODS: A variety of journals and online sources were searched to identify tabular radiomics data with binary outcomes, which were then compiled into a homogeneous data collection that is easily accessible via Python. To illustrate the utility of the dataset collection, it was applied to investigate whether feature decorrelation prior to feature selection could improve predictive performance in a radiomics pipeline. RESULTS: A total of 50 radiomic datasets were collected, with sample sizes ranging from 51 to 969 and 101 to 11165 features. Using this data, it was observed that decorrelating features did not yield any significant improvement on average. CONCLUSIONS: A large collection of datasets, easily accessible via Python, suitable for benchmarking and evaluating new machine learning techniques and methods was curated. Its utility was exemplified by demonstrating that feature decorrelation prior to feature selection does not, on average, lead to significant performance gains and could be omitted, thereby increasing the robustness and reliability of the radiomics pipeline.

12.

The Concept of a Versatile Computing Tool Chain for Utilizing the Core Data Set of the Medical Informatics Initiative in the INTERPOLAR Project.

Stäubert, Sebastian; Strübing, Alexander; Schmidt, Florian; Yahiaoui-Doktor, Maryam; Reusche, Matthias; Meineke, Frank; Neumann, Daniel; Loeffler, Markus.

Stud Health Technol Inform ; 317: 59-66, 2024 Aug 30.

Article in English | MEDLINE | ID: mdl-39234707

ABSTRACT

INTRODUCTION: To support research projects that require medical data from multiple sites is one of the goals of the German Medical Informatics Initiative (MII). The data integration centers (DIC) at university medical centers in Germany provide patient data via FHIR® in compliance with the MII core data set (CDS). Requirements for data protection and other legal bases for processing prefer decentralized processing of the relevant data in the DICs and the subsequent exchange of aggregated results for cross-site evaluation. METHODS: Requirements from clinical experts were obtained in the context of the MII use case INTERPOLAR. A software architecture was then developed, modeled using 3LGM2, finally implemented and published in a github repository. RESULTS: With the CDS tool chain, we have created software components for decentralized processing on the basis of the MII CDS. The CDS tool chain requires access to a local FHIR endpoint and then transfers the data to an SQL database. This is accessed by the DataProcessor component, which performs calculations with the help of rules (input repo) and writes the results back to the database. The CDS tool chain also has a frontend module (REDCap), which is used to display the output data and calculated results, and allows verification, evaluation, comments and other responses. This feedback is also persisted in the database and is available for further use, analysis or data sharing in the future. DISCUSSION: Other solutions are conceivable. Our solution utilizes the advantages of an SQL database. This enables flexible and direct processing of the stored data using established analysis methods. Due to the modularization, adjustments can be made so that it can be used in other projects. We are planning further developments to support pseudonymization and data sharing. Initial experience is being gathered. An evaluation is pending and planned.

Subject(s)

Software , Germany , Electronic Health Records , Humans , Medical Informatics , Computer Security , Datasets as Topic

13.

The 'SmartNIALMeter' electrical appliance disaggregation dataset.

Vogel, Manuel; Friedli, Martin; Camenzind, Martin; Kniesel, Guido; Klemenjak, Christoph; Gugolz, Gianni; Huber, Patrick; Calatroni, Alberto; Kaufmann, Lukas; Rumsch, Andreas; Paice, Andrew.

Data Brief ; 56: 110854, 2024 Oct.

Article in English | MEDLINE | ID: mdl-39286425

ABSTRACT

Electrical disaggregation, also known as non-intrusive load monitoring (NILM) or non-intrusive appliance load monitoring (NIALM), attempts to recognize the energy consumption of single electrical appliances from the aggregated signal. This capability unlocks several applications, such as giving feedback to users regarding their energy consumption patterns or helping distribution system operators (DSOs) to recognize loads which could be shifted to stabilize the electrical grid. The project "SmartNIALMeter" brought together universities, companies and DSOs and involved the collection of a large data corpus comprising 20 buildings with a total of 100 electrical appliances for a period of up to two years at a sampling interval of five seconds. The variability of the loads, including heat pumps and a charging station for electric vehicles, and the presence of single-phase and three-phase devices make this dataset suitable for several investigations. The total consumption was collected through smart meters and each appliance's consumption was measured with a dedicated sensor, providing sub-metering for all loads. The dataset can be used to tackle several open research questions, for example to investigate new NILM algorithms able to learn with a limited amount of sub-metered data.

14.

Multiple remotely sensed datasets and machine learning models to predict chlorophyll-a concentration in the Nakdong River, South Korea.

Lee, Byeongwon; Im, Jong Kwon; Han, Ji Woo; Kang, Taegu; Kim, Wonkook; Kim, Moonil; Lee, Sangchul.

Environ Sci Pollut Res Int ; 31(48): 58505-58526, 2024 Oct.

Article in English | MEDLINE | ID: mdl-39316212

ABSTRACT

The Nakdong River is a crucial water resource in South Korea, supplying water for various purposes such as potable water, irrigation, and recreation. However, the river is vulnerable to algal blooms due to the inflow of pollutants from multiple points and non-point sources. Monitoring chlorophyll-a (Chl-a) concentrations, a proxy for algal biomass is essential for assessing the trophic status of the river and managing its ecological health. This study aimed to improve the accuracy and reliability of Chl-a estimation in the Nakdong River using machine learning models (MLMs) and simultaneous use of multiple remotely sensed datasets. This study compared the performances of four MLMs: multi-layer perceptron (MLP), support vector machine (SVM), random forest (RF), and eXetreme Gradient Boosting (XGB) using three different input datasets: (1) two remotely sensed datasets (Sentinel-2 and Landsat-8), (2) standalone Sentinel-2, and (3) standalone Landsat-8. The results showed that the MLP model with multiple remotely sensed datasets outperformed other MLMs with 0.43 - 0.86 greater in R2 and 0.36 - 5.88 lower in RMSE. The MLP model demonstrated the highest performance across the range of Chl-a concentrations and predicted peaks above 20 mg/m3 relatively well compared to other models. This was likely due to the capacity of MLP to handle imbalanced datasets. The predictive map of the spatial distribution of Chl-a generated by MLP well captured the areas with high and low Chl-a concentrations. This study pointed out the impacts of imbalanced Chl-a concentration observations (dominated by low Chl-a concentrations) on the performance of MLMs. The data imbalance likely led to MLMs poorly trained for high Chl-a values, producing low prediction accuracy. In conclusion, this study demonstrated the value of multiple remotely sensed datasets in enhancing the accuracy and reliability of Chl-a estimation, mainly when using the MLP model. These findings would provide valuable insights into utilizing MLMs effectively for Chl-a monitoring.

Subject(s)

Chlorophyll A , Environmental Monitoring , Machine Learning , Rivers , Republic of Korea , Environmental Monitoring/methods , Rivers/chemistry , Chlorophyll/analysis , Remote Sensing Technology , Support Vector Machine

15.

Unlocking the potential of AI: Machine learning and deep learning models for predicting carcinogenicity of chemicals.

Guo, Wenjing; Liu, Jie; Dong, Fan; Hong, Huixiao.

J Environ Sci Health C Toxicol Carcinog ; : 1-28, 2024 Sep 03.

Article in English | MEDLINE | ID: mdl-39228157

ABSTRACT

The escalating apprehension surrounding the carcinogenic potential of chemicals emphasizes the imperative need for efficient methods of assessing carcinogenicity. Conventional experimental approaches such as in vitro and in vivo assays, albeit effective, suffer from being costly and time-consuming. In response to this challenge, new alternative methodologies, notably machine learning and deep learning techniques, have attracted attention for their potential in developing carcinogenicity prediction models. This article reviews the progress in predicting carcinogenicity using various machine learning and deep learning algorithms. A comparative analysis on these developed models reveals that support vector machine, random forest, and ensemble learning are commonly preferred for their robustness and effectiveness in predicting chemical carcinogenicity. Conversely, models based on deep learning algorithms, such as feedforward neural network, convolutional neural network, graph convolutional neural network, capsule neural network, and hybrid neural networks, exhibit promising capabilities but are limited by the size of available carcinogenicity datasets. This review provides a comprehensive analysis of current machine learning and deep learning models for carcinogenicity prediction, underscoring the importance of high-quality and large datasets. These observations are anticipated to catalyze future advancements in developing effective and generalizable machine learning and deep learning models for predicting chemical carcinogenicity.

16.

Time Sequence Deep Learning Model for Ubiquitous Tabular Data with Unique 3D Tensors Manipulation.

Gicic, Adaleta; Donko, Dzenana; Subasi, Abdulhamit.

Entropy (Basel) ; 26(9)2024 Sep 12.

Article in English | MEDLINE | ID: mdl-39330116

ABSTRACT

Although deep learning (DL) algorithms have been proved to be effective in diverse research domains, their application in developing models for tabular data remains limited. Models trained on tabular data demonstrate higher efficacy using traditional machine learning models than DL models, which are largely attributed to the size and structure of tabular datasets and the specific application contexts in which they are utilized. Thus, the primary objective of this paper is to propose a method to use the supremacy of Stacked Bidirectional LSTM (Long Short-Term Memory) deep learning algorithms in pattern discovery incorporating tabular data with customized 3D tensor modeling in feeding neural networks. Our findings are empirically validated using six diverse, publicly available datasets each varying in size and learning objectives. This paper proves that the proposed model based on time-sequence DL algorithms, which were generally described as inadequate when dealing with tabular data, yields satisfactory results and competes effectively with other algorithms specifically designed for tabular data. An additional benefit of this approach is its ability to preserve simplicity while ensuring fast model training also with large datasets. Even with extremely small datasets, models can be applied to achieve exceptional predictive results and fully utilize their capacity.

17.

Predicting diabetes in adults: identifying important features in unbalanced data over a 5-year cohort study using machine learning algorithm.

Talebi Moghaddam, Maryam; Jahani, Yones; Arefzadeh, Zahra; Dehghan, Azizallah; Khaleghi, Mohsen; Sharafi, Mehdi; Nikfar, Ghasem.

BMC Med Res Methodol ; 24(1): 220, 2024 Sep 27.

Article in English | MEDLINE | ID: mdl-39333899

ABSTRACT

BACKGROUND: Imbalanced datasets pose significant challenges in predictive modeling, leading to biased outcomes and reduced model reliability. This study addresses data imbalance in diabetes prediction using machine learning techniques. Utilizing data from the Fasa Adult Cohort Study (FACS) with a 5-year follow-up of 10,000 participants, we developed predictive models for Type 2 diabetes. METHODS: We employed various data-level and algorithm-level interventions, including SMOTE, ADASYN, SMOTEENN, Random Over Sampling and KMeansSMOTE, paired with Random Forest, Gradient Boosting, Decision Tree and Multi-Layer Perceptron (MLP) classifier. We evaluated model performance using F1 score, AUC, and G-means-metrics chosen to provide a comprehensive assessment of model accuracy, discrimination ability, and overall balance in performance, particularly in the context of imbalanced datasets. RESULTS: our study uncovered key factors influencing diabetes risk and evaluated the performance of various machine learning models. Feature importance analysis revealed that the most influential predictors of diabetes differ between males and females. For females, the most important factors are triglyceride (TG), basal metabolic rate (BMR), and total cholesterol (CHOL), whereas for males, the key predictors are body Mass Index (BMI), serum glutamate Oxaloacetate Transaminase (SGOT), and Gamma-Glutamyl (GGT). Across the entire dataset, BMI remains the most important variable, followed by SGOT, BMR, and energy intake. These insights suggest that gender-specific risk profiles should be considered in diabetes prevention and management strategies. In terms of model performance, our results show that ADASYN with MLP classifier achieved an F1 score of 82.17 ± 3.38, AUC of 89.61 ± 2.09, and G-means of 89.15 ± 2.31. SMOTE with MLP followed closely with an F1 score of 79.85 ± 3.91, AUC of 89.7 ± 2.54, and G-means of 89.31 ± 2.78. The SMOTEENN with Random Forest combination achieved an F1 score of 78.27 ± 1.54, AUC of 87.18 ± 1.12, and G-means of 86.47 ± 1.28. CONCLUSION: These combinations effectively address class imbalance, improving the accuracy and reliability of diabetes predictions. The findings highlight the importance of using appropriate data-balancing techniques in medical data analysis.

Subject(s)

Algorithms , Diabetes Mellitus, Type 2 , Machine Learning , Humans , Diabetes Mellitus, Type 2/blood , Diabetes Mellitus, Type 2/diagnosis , Female , Male , Adult , Cohort Studies , Middle Aged , Risk Factors , Reproducibility of Results

18.

Skin Type Diversity in Skin Lesion Datasets: A Review.

Alipour, Neda; Burke, Ted; Courtney, Jane.

Curr Dermatol Rep ; 13(3): 198-210, 2024.

Article in English | MEDLINE | ID: mdl-39184010

ABSTRACT

Purpose of review: Skin type diversity in image datasets refers to the representation of various skin types. This diversity allows for the verification of comparable performance of a trained model across different skin types. A widespread problem in datasets involving human skin is the lack of verifiable diversity in skin types, making it difficult to evaluate whether the performance of the trained models generalizes across different skin types. For example, the diversity issues in skin lesion datasets, which are used to train deep learning-based models, often result in lower accuracy for darker skin types that are typically under-represented in these datasets. Under-representation in datasets results in lower performance in deep learning models for under-represented skin types. Recent findings: This issue has been discussed in previous works; however, the reporting of skin types, and inherent diversity, have not been fully assessed. Some works report skin types but do not attempt to assess the representation of each skin type in datasets. Others, focusing on skin lesions, identify the issue but do not measure skin type diversity in the datasets examined. Summary: Effort is needed to address these shortcomings and move towards facilitating verifiable diversity. Building on previous works in skin lesion datasets, this review explores the general issue of skin type diversity by investigating and evaluating skin lesion datasets specifically. The main contributions of this work are an evaluation of publicly available skin lesion datasets and their metadata to assess the frequency and completeness of reporting of skin type and an investigation into the diversity and representation of each skin type within these datasets. Supplementary Information: The online version contains material available at 10.1007/s13671-024-00440-0.

19.

Artificial Intelligence as a Replacement for Animal Experiments in Neurology: Potential, Progress, and Challenges.

Rudroff, Thorsten.

Neurol Int ; 16(4): 805-820, 2024 Jul 29.

Article in English | MEDLINE | ID: mdl-39195562

ABSTRACT

Animal experimentation has long been a cornerstone of neurology research, but it faces growing scientific, ethical, and economic challenges. Advances in artificial intelligence (AI) are providing new opportunities to replace animal testing with more human-relevant and efficient methods. This article explores the potential of AI technologies such as brain organoids, computational models, and machine learning to revolutionize neurology research and reduce reliance on animal models. These approaches can better recapitulate human brain physiology, predict drug responses, and uncover novel insights into neurological disorders. They also offer faster, cheaper, and more ethical alternatives to animal experiments. Case studies demonstrate AI's ability to accelerate drug discovery for Alzheimer's, predict neurotoxicity, personalize treatments for Parkinson's, and restore movement in paralysis. While challenges remain in validating and integrating these technologies, the scientific, economic, practical, and moral advantages are driving a paradigm shift towards AI-based, animal-free research in neurology. With continued investment and collaboration across sectors, AI promises to accelerate the development of safer and more effective therapies for neurological conditions while significantly reducing animal use. The path forward requires the ongoing development and validation of these technologies, but a future in which they largely replace animal experiments in neurology appears increasingly likely. This transition heralds a new era of more humane, human-relevant, and innovative brain research.

20.

Challenges in LncRNA Biology: Views and Opinions.

Adjeroh, Donald A; Zhou, Xiaobo; Paschoal, Alexandre Rossi; Dimitrova, Nadya; Derevyanchuk, Ekaterina G; Shkurat, Tatiana P; Loeb, Jeffrey A; Martinez, Ivan; Lipovich, Leonard.

Noncoding RNA ; 10(4)2024 Aug 01.

Article in English | MEDLINE | ID: mdl-39195572

ABSTRACT

This is a mini-review capturing the views and opinions of selected participants at the 2021 IEEE BIBM 3rd Annual LncRNA Workshop, held in Dubai, UAE. The views and opinions are expressed on five broad themes related to problems in lncRNA, namely, challenges in the computational analysis of lncRNAs, lncRNAs and cancer, lncRNAs in sports, lncRNAs and COVID-19, and lncRNAs in human brain activity.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL