Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 218
Filter
1.
J Int Med Res ; 52(7): 3000605241259655, 2024 Jul.
Article in English | MEDLINE | ID: mdl-39068529

ABSTRACT

OBJECTIVE: This study aimed to identify significantly differentially expressed genes (DEGs) related to cervical cancer by exploring extensive gene expression datasets to unveil new therapeutic targets. METHODS: Gene expression profiles were extracted from the Gene Expression Omnibus, The Cancer Genome Atlas, and the Genotype-Tissue Expression platforms. A differential expression analysis identified DEGs in cervical cancer cases. Weighted gene co-expression network analysis (WGCNA) was implemented to locate genes closely linked to the clinical traits of diseases. Machine learning algorithms, including LASSO regression and the random forest algorithm, were applied to pinpoint key genes. RESULTS: The investigation successfully isolated DEGs pertinent to cervical cancer. Interleukin-24 was recognized as a pivotal gene via WGCNA and machine learning techniques. Experimental validations demonstrated that human interleukin (hIL)-24 inhibited proliferation, migration, and invasion, while promoting apoptosis, in SiHa and HeLa cervical cancer cells, affirming its role as a therapeutic target. CONCLUSION: The multi-database analysis strategy employed herein emphasized hIL-24 as a principal gene in cervical cancer pathogenesis. The findings suggest hIL-24 as a promising candidate for targeted therapy, offering a potential avenue for innovative treatment modalities. This study enhances the understanding of molecular mechanisms of cervical cancer and aids in the pursuit of novel oncological therapies.


Subject(s)
Apoptosis , Cell Movement , Cell Proliferation , Gene Expression Regulation, Neoplastic , Interleukins , Neoplasm Invasiveness , Uterine Cervical Neoplasms , Humans , Uterine Cervical Neoplasms/genetics , Uterine Cervical Neoplasms/pathology , Uterine Cervical Neoplasms/metabolism , Female , Cell Proliferation/genetics , Cell Movement/genetics , Interleukins/genetics , Interleukins/metabolism , Apoptosis/genetics , Gene Regulatory Networks , Gene Expression Profiling , HeLa Cells , Machine Learning , Cell Line, Tumor
2.
Technol Health Care ; 32(S1): 229-239, 2024.
Article in English | MEDLINE | ID: mdl-38759052

ABSTRACT

BACKGROUND: Selecting an appropriate similarity measurement method is crucial for obtaining biologically meaningful clustering modules. Commonly used measurement methods are insufficient in capturing the complexity of biological systems and fail to accurately represent their intricate interactions. OBJECTIVE: This study aimed to obtain biologically meaningful gene modules by using the clustering algorithm based on a similarity measurement method. METHODS: A new algorithm called the Dual-Index Nearest Neighbor Similarity Measure (DINNSM) was proposed. This algorithm calculated the similarity matrix between genes using Pearson's or Spearman's correlation. It was then used to construct a nearest-neighbor table based on the similarity matrix. The final similarity matrix was reconstructed using the positions of shared genes in the nearest neighbor table and the number of shared genes. RESULTS: Experiments were conducted on five different gene expression datasets and compared with five widely used similarity measurement techniques for gene expression data. The findings demonstrate that when utilizing DINNSM as the similarity measure, the clustering results performed better than using alternative measurement techniques. CONCLUSIONS: DINNSM provided more accurate insights into the intricate biological connections among genes, facilitating the identification of more accurate and biological gene co-expression modules.


Subject(s)
Algorithms , Gene Expression Profiling , Cluster Analysis , Humans , Gene Expression Profiling/methods , Computational Biology/methods
3.
Artif Intell Med ; 151: 102840, 2024 May.
Article in English | MEDLINE | ID: mdl-38658129

ABSTRACT

High-throughput technologies are becoming increasingly important in discovering prognostic biomarkers and in identifying novel drug targets. With Mammaprint, Oncotype DX, and many other prognostic molecular signatures breast cancer is one of the paradigmatic examples of the utility of high-throughput data to deliver prognostic biomarkers, that can be represented in a form of a rather short gene list. Such gene lists can be obtained as a set of features (genes) that are important for the decisions of a Machine Learning (ML) method applied to high-dimensional gene expression data. Several studies have identified predictive gene lists for patient prognosis in breast cancer, but these lists are unstable and have only a few genes in common. Instability of feature selection impedes biological interpretability: genes that are relevant for cancer pathology should be members of any predictive gene list obtained for the same clinical type of patients. Stability and interpretability of selected features can be improved by including information on molecular networks in ML methods. Graph Convolutional Neural Network (GCNN) is a contemporary deep learning approach applicable to gene expression data structured by a prior knowledge molecular network. Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP) are methods to explain individual decisions of deep learning models. We used both GCNN+LRP and GCNN+SHAP techniques to construct feature sets by aggregating individual explanations. We suggest a methodology to systematically and quantitatively analyze the stability, the impact on the classification performance, and the interpretability of the selected feature sets. We used this methodology to compare GCNN+LRP to GCNN+SHAP and to more classical ML-based feature selection approaches. Utilizing a large breast cancer gene expression dataset we show that, while feature selection with SHAP is useful in applications where selected features have to be impactful for classification performance, among all studied methods GCNN+LRP delivers the most stable (reproducible) and interpretable gene lists.


Subject(s)
Biomarkers, Tumor , Breast Neoplasms , Neural Networks, Computer , Humans , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Biomarkers, Tumor/genetics , Female , Gene Expression Profiling/methods , Deep Learning , Prognosis , Machine Learning
4.
Helicobacter ; 29(2): e13074, 2024.
Article in English | MEDLINE | ID: mdl-38615332

ABSTRACT

BACKGROUND: Helicobacter pylori is considered a true human pathogen for which rising drug resistance constitutes a drastic concern globally. The present study aimed to reconstruct a genome-scale metabolic model (GSMM) to decipher the metabolic capability of H. pylori strains in response to clarithromycin and rifampicin along with identification of novel drug targets. MATERIALS AND METHODS: The iIT341 model of H. pylori was updated based on genome annotation data, and biochemical knowledge from literature and databases. Context-specific models were generated by integrating the transcriptomic data of clarithromycin and rifampicin resistance into the model. Flux balance analysis was employed for identifying essential genes in each strain, which were further prioritized upon being nonhomologs to humans, virulence factor analysis, druggability, and broad-spectrum analysis. Additionally, metabolic differences between sensitive and resistant strains were also investigated based on flux variability analysis and pathway enrichment analysis of transcriptomic data. RESULTS: The reconstructed GSMM was named as HpM485 model. Pathway enrichment and flux variability analyses demonstrated reduced activity in the ribosomal pathway in both clarithromycin- and rifampicin-resistant strains. Also, a significant decrease was detected in the activity of metabolic pathways of clarithromycin-resistant strain. Moreover, 23 and 16 essential genes were exclusively detected in clarithromycin- and rifampicin-resistant strains, respectively. Based on prioritization analysis, cyclopropane fatty acid synthase and phosphoenolpyruvate synthase were identified as putative drug targets in clarithromycin- and rifampicin-resistant strains, respectively. CONCLUSIONS: We present a robust and reliable metabolic model of H. pylori. This model can predict novel drug targets to combat drug resistance and explore the metabolic capability of H. pylori in various conditions.


Subject(s)
Helicobacter Infections , Helicobacter pylori , Humans , Helicobacter pylori/genetics , Clarithromycin/pharmacology , Rifampin/pharmacology , Helicobacter Infections/drug therapy , Databases, Factual
5.
BMC Bioinformatics ; 25(1): 133, 2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38539106

ABSTRACT

Cancer is one of the leading causes of deaths worldwide. Survival analysis and prediction of cancer patients is of great significance for their precision medicine. The robustness and interpretability of the survival prediction models are important, where robustness tells whether a model has learned the knowledge, and interpretability means if a model can show human what it has learned. In this paper, we propose a robust and interpretable model SurvConvMixer, which uses pathways customized gene expression images and ConvMixer for cancer short-term, mid-term and long-term overall survival prediction. With ConvMixer, the representation of each pathway can be learned respectively. We show the robustness of our model by testing the trained model on absolutely untrained external datasets. The interpretability of SurvConvMixer depends on gradient-weighted class activation mapping (Grad-Cam), by which we can obtain the pathway-level activation heat map. Then wilcoxon rank-sum tests are conducted to obtain the statistically significant pathways, thereby revealing which pathways the model focuses on more. SurvConvMixer achieves remarkable performance on the short-term, mid-term and long-term overall survival of lung adenocarcinoma, lung squamous cell carcinoma and skin cutaneous melanoma, and the external validation tests show that SurvConvMixer can generalize to external datasets so that it is robust. Finally, we investigate the activation maps generated by Grad-Cam, after wilcoxon rank-sum test and Kaplan-Meier estimation, we find that some survival-related pathways play important role in SurvConvMixer.


Subject(s)
Adenocarcinoma of Lung , Lung Neoplasms , Melanoma , Skin Neoplasms , Humans , Gene Expression
6.
BMC Bioinformatics ; 25(1): 125, 2024 Mar 22.
Article in English | MEDLINE | ID: mdl-38519883

ABSTRACT

In the battle of the host against lentiviral pathogenesis, the immune response is crucial. However, several questions remain unanswered about the interaction with different viruses and their influence on disease progression. The simian immunodeficiency virus (SIV) infecting nonhuman primates (NHP) is widely used as a model for the study of the human immunodeficiency virus (HIV) both because they are evolutionarily linked and because they share physiological and anatomical similarities that are largely explored to understand the disease progression. The HIHISIV database was developed to support researchers to integrate and evaluate the large number of transcriptional data associated with the presence/absence of the pathogen (SIV or HIV) and the host response (NHP and human). The datasets are composed of microarray and RNA-Seq gene expression data that were selected, curated, analyzed, enriched, and stored in a relational database. Six query templates comprise the main data analysis functions and the resulting information can be downloaded. The HIHISIV database, available at  https://hihisiv.github.io , provides accurate resources for browsing and visualizing results and for more robust analyses of pre-existing data in transcriptome repositories.


Subject(s)
HIV Infections , Simian Acquired Immunodeficiency Syndrome , Simian Immunodeficiency Virus , Animals , Humans , Simian Immunodeficiency Virus/genetics , HIV , Simian Acquired Immunodeficiency Syndrome/genetics , Disease Progression , Immunity , Gene Expression
7.
Front Immunol ; 15: 1285785, 2024.
Article in English | MEDLINE | ID: mdl-38433833

ABSTRACT

Introduction: Enteric infections are a major cause of under-5 (age) mortality in low/middle-income countries. Although vaccines against these infections have already been licensed, unwavering efforts are required to boost suboptimalefficacy and effectiveness in regions that are highly endemic to enteric pathogens. The role of baseline immunological profiles in influencing vaccine-induced immune responses is increasingly becoming clearer for several vaccines. Hence, for the development of advanced and region-specific enteric vaccines, insights into differences in immune responses to perturbations in endemic and non-endemic settings become crucial. Materials and methods: For this reason, we employed a two-tiered system and computational pipeline (i) to study the variations in differentially expressed genes (DEGs) associated with immune responses to enteric infections in endemic and non-endemic study groups, and (ii) to derive features (genes) of importance that keenly distinguish between these two groups using unsupervised machine learning algorithms on an aggregated gene expression dataset. The derived genes were further curated using topological analysis of the constructed STRING networks. The findings from these two tiers are validated using multilayer perceptron classifier and were further explored using correlation and regression analysis for the retrieval of associated gene regulatory modules. Results: Our analysis reveals aggressive suppression of GRB-2, an adaptor molecule integral for TCR signaling, as a primary immunomodulatory response against S. typhi infection in endemic settings. Moreover, using retrieved correlation modules and multivariant regression models, we found a positive association between regulators of activated T cells and mediators of Hedgehog signaling in the endemic population, which indicates the initiation of an effector (involving differentiation and homing) rather than an inductive response upon infection. On further exploration, we found STAT3 to be instrumental in designating T-cell functions upon early responses to enteric infections in endemic settings. Conclusion: Overall, through a systems and computational biology approach, we characterized distinct molecular players involved in immune responses to enteric infections in endemic settings in the process, contributing to the mounting evidence of endemicity being a major determiner of pathogen/vaccine-induced immune responses. The gained insights will have important implications in the design and development of region/endemicity-specific vaccines.


Subject(s)
Hedgehog Proteins , Vaccines , Immunomodulation , Immunity , Gene Expression
8.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38436561

ABSTRACT

Enrichment analysis (EA) is a common approach to gain functional insights from genome-scale experiments. As a consequence, a large number of EA methods have been developed, yet it is unclear from previous studies which method is the best for a given dataset. The main issues with previous benchmarks include the complexity of correctly assigning true pathways to a test dataset, and lack of generality of the evaluation metrics, for which the rank of a single target pathway is commonly used. We here provide a generalized EA benchmark and apply it to the most widely used EA methods, representing all four categories of current approaches. The benchmark employs a new set of 82 curated gene expression datasets from DNA microarray and RNA-Seq experiments for 26 diseases, of which only 13 are cancers. In order to address the shortcomings of the single target pathway approach and to enhance the sensitivity evaluation, we present the Disease Pathway Network, in which related Kyoto Encyclopedia of Genes and Genomes pathways are linked. We introduce a novel approach to evaluate pathway EA by combining sensitivity and specificity to provide a balanced evaluation of EA methods. This approach identifies Network Enrichment Analysis methods as the overall top performers compared with overlap-based methods. By using randomized gene expression datasets, we explore the null hypothesis bias of each method, revealing that most of them produce skewed P-values.


Subject(s)
Benchmarking , RNA-Seq
9.
Biosystems ; 236: 105126, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38278505

ABSTRACT

The inference of gene regulatory networks (GRNs) is a widely addressed problem in Systems Biology. GRNs can be modeled as Boolean networks, which is the simplest approach for this task. However, Boolean models need binarized data. Several approaches have been developed for the discretization of gene expression data (GED). Also, the advance of data extraction technologies, such as single-cell RNA-Sequencing (scRNA-Seq), provides a new vision of gene expression and brings new challenges for dealing with its specificities, such as a large occurrence of zero data. This work proposes a new discretization approach for dealing with scRNA-Seq time-series data, named Distribution and Successive Spline Points Discretization (DSSPD), which considers the data distribution and a proper preprocessing step. Here, Cartesian Genetic Programming (CGP) is used to infer GRNs using the results of DSSPD. The proposal is compared with CGP with the standard data handling and five state-of-the-art algorithms on curated models and experimental data. The results show that the proposal improves the results of CGP in all tested cases and outperforms the state-of-the-art algorithms in most cases.


Subject(s)
Gene Regulatory Networks , Single-Cell Gene Expression Analysis , Tosyl Compounds , Gene Regulatory Networks/genetics , Algorithms , Systems Biology , Gene Expression Profiling/methods
10.
Comput Biol Med ; 170: 107981, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38262204

ABSTRACT

A framework is developed for gene expression analysis by introducing fuzzy Jaccard similarity (FJS) and combining Lukasiewicz implication with it through weights in hybrid ensemble framework (WCLFJHEF) for gene selection in cancer. The method is called weighted combination of Lukasiewicz implication and fuzzy Jaccard similarity in hybrid ensemble framework (WCLFJHEF). While the fuzziness in Jaccard similarity is incorporated by using the existing Gödel fuzzy logic, the weights are obtained by maximizing the average F-score of selected genes in classifying the cancer patients. The patients are first divided into different clusters, based on the number of patient groups, using average linkage agglomerative clustering and a new score, called WCLFJ (weighted combination of Lukasiewicz implication and fuzzy Jaccard similarity). The genes are then selected from each cluster separately using filter based Relief-F and wrapper based SVMRFE (Support Vector Machine with Recursive Feature Elimination). A gene (feature) pool is created by considering the union of selected features for all the clusters. A set of informative genes is selected from the pool using sequential backward floating search (SBFS) algorithm. Patients are then classified using Naïve Bayes'(NB) and Support Vector Machine (SVM) separately, using the selected genes and the related F-scores are calculated. The weights in WCLFJ are then updated iteratively to maximize the average F-score obtained from the results of the classifier. The effectiveness of WCLFJHEF is demonstrated on six gene expression datasets. The average values of accuracy, F-score, recall, precision and MCC over all the datasets, are 95%, 94%, 94%, 94%, and 90%, respectively. The explainability of the selected genes is shown using SHapley Additive exPlanations (SHAP) values and this information is further used to rank them. The relevance of the selected gene set are biologically validated using the KEGG Pathway, Gene Ontology (GO), and existing literatures. It is seen that the genes that are selected by WCLFJHEF are candidates for genomic alterations in the various cancer types. The source code of WCLFJHEF is available at http://www.isical.ac.in/~shubhra/WCLFJHEF.html.


Subject(s)
Gene Expression Profiling , Neoplasms , Humans , Bayes Theorem , Gene Expression Profiling/methods , Algorithms , Neoplasms/metabolism , Software
11.
Comput Methods Programs Biomed ; 244: 107966, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38091844

ABSTRACT

BACKGROUND: In Diffuse Large B-Cell Lymphoma (DLBCL), several methodologies are emerging to derive novel biomarkers to be incorporated in the risk assessment. We realized a pipeline that relies on autoencoders (AE) and Explainable Artificial Intelligence (XAI) to stratify prognosis and derive a gene-based signature. METHODS: AE was exploited to learn an unsupervised representation of the gene expression (GE) from three publicly available datasets, each with its own technology. Multi-layer perceptron (MLP) was used to classify prognosis from latent representation. GE data were preprocessed as normalized, scaled, and standardized. Four different AE architectures (Large, Medium, Small and Extra Small) were compared to find the most suitable for GE data. The joint AE-MLP classified patients on six different outcomes: overall survival at 12, 36, 60 months and progression-free survival (PFS) at 12, 36, 60 months. XAI techniques were used to derive a gene-based signature aimed at refining the Revised International Prognostic Index (R-IPI) risk, which was validated in a fourth independent publicly available dataset. We named our tool SurvIAE: Survival prediction with Interpretable AE. RESULTS: From the latent space of AEs, we observed that scaled and standardized data reduced the batch effect. SurvIAE models outperformed R-IPI with Matthews Correlation Coefficient up to 0.42 vs. 0.18 for the validation-set (PFS36) and to 0.30 vs. 0.19 for the test-set (PFS60). We selected the SurvIAE-Small-PFS36 as the best model and, from its gene signature, we stratified patients in three risk groups: R-IPI Poor patients with High levels of GAB1, R-IPI Poor patients with Low levels of GAB1 or R-IPI Good/Very Good patients with Low levels of GPR132, and R-IPI Good/Very Good patients with High levels of GPR132. CONCLUSIONS: SurvIAE showed the potential to derive a gene signature with translational purpose in DLBCL. The pipeline was made publicly available and can be reused for other pathologies.


Subject(s)
Artificial Intelligence , Lymphoma, Large B-Cell, Diffuse , Humans , Antineoplastic Combined Chemotherapy Protocols , Lymphoma, Large B-Cell, Diffuse/genetics , Lymphoma, Large B-Cell, Diffuse/drug therapy , Prognosis , Gene Expression , Retrospective Studies
12.
Sci Total Environ ; 912: 169021, 2024 Feb 20.
Article in English | MEDLINE | ID: mdl-38061659

ABSTRACT

Coral reefs are facing unprecedented threats due to global climate change, particularly elevated sea surface temperatures causing coral bleaching. Understanding coral responses at the molecular level is crucial for predicting their resilience and developing effective conservation strategies. In this study, we conducted a comprehensive gene expression analysis of four coral species to investigate their long-term molecular response to heat stress. We identified distinct gene expression patterns among the coral species, with laminar corals exhibiting a stronger response compared to branching corals. Heat shock proteins (HSPs) showed an overall decreasing expression trend, indicating the high energy cost associated with sustaining elevated HSP levels during prolonged heat stress. Peroxidases and oxidoreductases involved in oxidative stress response demonstrated significant upregulation, highlighting their role in maintaining cellular redox balance. Differential expression of genes related to calcium homeostasis and bioluminescence suggested distinct mechanisms for coping with heat stress among the coral species. Furthermore, the impact of heat stress on coral biomineralization varied, with downregulation of carbonic anhydrase and skeletal organic matrix proteins indicating reduced capacity for biomineralization in the later stages of heat stress. Our findings provide insights into the molecular mechanisms underlying coral responses to heat stress and highlight the importance of considering species-specific responses in assessing coral resilience. The identified biomarkers may serve as indicators of heat stress and contribute to early detection of coral bleaching events. These findings contribute to our understanding of coral resilience and provide a basis for future research aimed at enhancing coral survival in the face of climate change.


Subject(s)
Anthozoa , Resilience, Psychological , Animals , Anthozoa/physiology , Heat-Shock Response , Coral Reefs , Gene Expression
13.
PeerJ Comput Sci ; 9: e1686, 2023.
Article in English | MEDLINE | ID: mdl-38077583

ABSTRACT

Background: Identifying the genes responsible for diseases requires precise prioritization of significant genes. Gene expression analysis enables differentiation between gene expressions in disease and normal samples. Increasing the number of high-quality samples enhances the strength of evidence regarding gene involvement in diseases. This process has led to the discovery of disease biomarkers through the collection of diverse gene expression data. Methods: This study presents GeneCompete, a web-based tool that integrates gene expression data from multiple platforms and experiments to identify the most promising biomarkers. GeneCompete incorporates a novel union strategy and eight well-established ranking methods, including Win-Loss, Massey, Colley, Keener, Elo, Markov, PageRank, and Bi-directional PageRank algorithms, to prioritize genes across multiple gene expression datasets. Each gene in the competition is assigned a score based on log-fold change values, and significant genes are determined as winners. Results: We tested the tool on the expression datasets of Hypertrophic cardiomyopathy (HCM) and the datasets from Microarray Quality Control (MAQC) project, which include both microarray and RNA-Sequencing techniques. The results demonstrate that all ranking scores have more power to predict new occurrence datasets than the classical method. Moreover, the PageRank method with a union strategy delivers the best performance for both up-regulated and down-regulated genes. Furthermore, the top-ranking genes exhibit a strong association with the disease. For MAQC, the two-sides ranking score shows a high relationship with TaqMan validation set in all log-fold change thresholds. Conclusion: GeneCompete is a powerful web-based tool that revolutionizes the identification of disease-causing genes through the integration of gene expression data from multiple platforms and experiments.

14.
BMC Bioinformatics ; 24(1): 427, 2023 Nov 13.
Article in English | MEDLINE | ID: mdl-37957576

ABSTRACT

BACKGROUND: Although gene expression data play significant roles in biological and medical studies, their applications are hampered due to the difficulty and high expenses of gathering them through biological experiments. It is an urgent problem to generate high quality gene expression data with computational methods. WGAN-GP, a generative adversarial network-based method, has been successfully applied in augmenting gene expression data. However, mode collapse or over-fitting may take place for small training samples due to just one discriminator is adopted in the method. RESULTS: In this study, an improved data augmentation approach MDWGAN-GP, a generative adversarial network model with multiple discriminators, is proposed. In addition, a novel method is devised for enriching training samples based on linear graph convolutional network. Extensive experiments were implemented on real biological data. CONCLUSIONS: The experimental results have demonstrated that compared with other state-of-the-art methods, the MDWGAN-GP method can produce higher quality generated gene expression data in most cases.


Subject(s)
Data Accuracy , Gene Expression
15.
Big Data ; 2023 Sep 04.
Article in English | MEDLINE | ID: mdl-37668992

ABSTRACT

Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.

16.
BMC Bioinformatics ; 24(1): 362, 2023 Sep 26.
Article in English | MEDLINE | ID: mdl-37752445

ABSTRACT

BACKGROUND: The central biological clock governs numerous facets of mammalian physiology, including sleep, metabolism, and immune system regulation. Understanding gene regulatory relationships is crucial for unravelling the mechanisms that underlie various cellular biological processes. While it is possible to infer circadian gene regulatory relationships from time-series gene expression data, relying solely on correlation-based inference may not provide sufficient information about causation. Moreover, gene expression data often have high dimensions but a limited number of observations, posing challenges in their analysis. METHODS: In this paper, we introduce a new hybrid framework, referred to as Circadian Gene Regulatory Framework (CGRF), to infer circadian gene regulatory relationships from gene expression data of rats. The framework addresses the challenges of high-dimensional data by combining the fuzzy C-means clustering algorithm with dynamic time warping distance. Through this approach, we efficiently identify the clusters of genes related to the target gene. To determine the significance of genes within a specific cluster, we employ the Wilcoxon signed-rank test. Subsequently, we use a dynamic vector autoregressive method to analyze the selected significant gene expression profiles and reveal directed causal regulatory relationships based on partial correlation. CONCLUSION: The proposed CGRF framework offers a comprehensive and efficient solution for understanding circadian gene regulation. Circadian gene regulatory relationships are inferred from the gene expression data of rats based on the Aanat target gene. The results show that genes Pde10a, Atp7b, Prok2, Per1, Rhobtb3 and Dclk1 stand out, which have been known to be essential for the regulation of circadian activity. The potential relationships between genes Tspan15, Eprs, Eml5 and Fsbp with a circadian rhythm need further experimental research.


Subject(s)
Gene Expression Profiling , Gene Expression Regulation , Rats , Animals , Gene Expression Profiling/methods , Transcription Factors/metabolism , Algorithms , Circadian Rhythm/genetics , Gene Expression , Mammals/genetics
17.
Front Genet ; 14: 1139082, 2023.
Article in English | MEDLINE | ID: mdl-37671046

ABSTRACT

Introduction: Identifying significant sets of genes that are up/downregulated under specific conditions is vital to understand disease development mechanisms at the molecular level. Along this line, in order to analyze transcriptomic data, several computational feature selection (i.e., gene selection) methods have been proposed. On the other hand, uncovering the core functions of the selected genes provides a deep understanding of diseases. In order to address this problem, biological domain knowledge-based feature selection methods have been proposed. Unlike computational gene selection approaches, these domain knowledge-based methods take the underlying biology into account and integrate knowledge from external biological resources. Gene Ontology (GO) is one such biological resource that provides ontology terms for defining the molecular function, cellular component, and biological process of the gene product. Methods: In this study, we developed a tool named GeNetOntology which performs GO-based feature selection for gene expression data analysis. In the proposed approach, the process of Grouping, Scoring, and Modeling (G-S-M) is used to identify significant GO terms. GO information has been used as the grouping information, which has been embedded into a machine learning (ML) algorithm to select informative ontology terms. The genes annotated with the selected ontology terms have been used in the training part to carry out the classification task of the ML model. The output is an important set of ontologies for the two-class classification task applied to gene expression data for a given phenotype. Results: Our approach has been tested on 11 different gene expression datasets, and the results showed that GeNetOntology successfully identified important disease-related ontology terms to be used in the classification model. Discussion: GeNetOntology will assist geneticists and scientists to identify a range of disease-related genes and ontologies in transcriptomic data analysis, and it will also help doctors design diagnosis platforms and improve patient treatment plans.

18.
Med Biol Eng Comput ; 61(11): 2895-2919, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37530887

ABSTRACT

Prediction of the stage of cancer plays an important role in planning the course of treatment and has been largely reliant on imaging tools which do not capture molecular events that cause cancer progression. Gene-expression data-based analyses are able to identify these events, allowing RNA-sequence and microarray cancer data to be used for cancer analyses. Breast cancer is the most common cancer worldwide, and is classified into four stages - stages 1, 2, 3, and 4 [2]. While machine learning models have previously been explored to perform stage classification with limited success, multi-class stage classification has not had significant progress. There is a need for improved multi-class classification models, such as by investigating deep learning models. Gene-expression-based cancer data is characterised by the small size of available datasets, class imbalance, and high dimensionality. Class balancing methods must be applied to the dataset. Since all the genes are not necessary for stage prediction, retaining only the necessary genes can improve classification accuracy. The breast cancer samples are to be classified into 4 classes of stages 1 to 4. Invasive ductal carcinoma breast cancer samples are obtained from The Cancer Genome Atlas (TCGA) and Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) datasets and combined. Two class balancing techniques are explored, synthetic minority oversampling technique (SMOTE) and SMOTE followed by random undersampling. A hybrid feature selection pipeline is proposed, with three pipelines explored involving combinations of filter and embedded feature selection methods: Pipeline 1 - minimum-redundancy maximum-relevancy (mRMR) and correlation feature selection (CFS), Pipeline 2 - mRMR, mutual information (MI) and CFS, and Pipeline 3 - mRMR and support vector machine-recursive feature elimination (SVM-RFE). The classification is done using deep learning models, namely deep neural network, convolutional neural network, recurrent neural network, a modified deep neural network, and an AutoKeras generated model. Classification performance post class-balancing and various feature selection techniques show marked improvement over classification prior to feature selection. The best multiclass classification was found to be by a deep neural network post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with a Cohen-Kappa score of 0.303 and a classification accuracy of 53.1%. For binary classification into early and late-stage cancer, the best performance is obtained by a modified deep neural network (DNN) post SMOTE and random undersampling, and feature selection using mRMR and recursive feature elimination, with an accuracy of 81.0% and a Cohen-Kappa score (CKS) of 0.280. This pipeline also showed improved multiclass classification performance on neuroblastoma cancer data, with a best area under the receiver operating characteristic (auROC) curve score of 0.872, as compared to 0.71 obtained in previous work, an improvement of 22.81%. The results and analysis reveal that feature selection techniques play a vital role in gene-expression data-based classification, and the proposed hybrid feature selection pipeline improves classification performance. Multi-class classification is possible using deep learning models, though further improvement particularly in late-stage classification is necessary and should be explored further.


Subject(s)
Breast Neoplasms , Deep Learning , Humans , Female , Breast Neoplasms/genetics , Transcriptome , Neoplasm Staging , Gene Expression Profiling/methods
19.
BMC Bioinformatics ; 24(1): 289, 2023 Jul 19.
Article in English | MEDLINE | ID: mdl-37468832

ABSTRACT

BACKGROUND: Cancer subtype classification is helpful for personalized cancer treatment. Although, some approaches have been developed to classifying caner subtype based on high dimensional gene expression data, it is difficult to obtain satisfactory classification results. Meanwhile, some cancers have been well studied and classified to some subtypes, which are adopt by most researchers. Hence, this priori knowledge is significant for further identifying new meaningful subtypes. RESULTS: In this paper, we present a combined parallel random forest and autoencoder approach for cancer subtype identification based on high dimensional gene expression data, ForestSubtype. ForestSubtype first adopts the parallel RF and the priori knowledge of cancer subtype to train a module and extract significant candidate features. Second, ForestSubtype uses a random forest as the base module and ten parallel random forests to compute each feature weight and rank them separately. Then, the intersection of the features with the larger weights output by the ten parallel random forests is taken as our subsequent candidate features. Third, ForestSubtype uses an autoencoder to condenses the selected features into a two-dimensional data. Fourth, ForestSubtype utilizes k-means++ to obtain new cancer subtype identification results. In this paper, the breast cancer gene expression data obtained from The Cancer Genome Atlas are used for training and validation, and an independent breast cancer dataset from the Molecular Taxonomy of Breast Cancer International Consortium is used for testing. Additionally, we use two other cancer datasets for validating the generalizability of ForestSubtype. ForestSubtype outperforms the other two methods in terms of the distribution of clusters, internal and external metric results. The open-source code is available at https://github.com/lffyd/ForestSubtype . CONCLUSIONS: Our work shows that the combination of high-dimensional gene expression data and parallel random forests and autoencoder, guided by a priori knowledge, can identify new subtypes more effectively than existing methods of cancer subtype classification.


Subject(s)
Breast Neoplasms , Random Forest , Humans , Female , Genomics , Software
20.
Adv Exp Med Biol ; 1424: 273-279, 2023.
Article in English | MEDLINE | ID: mdl-37486504

ABSTRACT

A significant challenge in high-dimensional and big data analysis is related to the classification and prediction of the variables of interest. The massive genetic datasets are complex. Gene expression datasets are enriched with useful genes that are associated with specific diseases such as cancer. In this study, we used two gene expression datasets from the Gene Expression Omnibus and preprocessed them before classification. We used optimal kernel principal component analysis in which the optimal kernel function was chosen for dataset dimensionality reduction and extraction of the most important features. The gene sets with a high validity index were collected using a combined hieratical clustering and optimal kernel principal component analysis (KHC-RLR) algorithm. Logistic regression is one of the most common methods for classification, and it has been shown to be a useful classification approach for gene expression data analysis. In this study, we used multi-class logistic regression to classify the collected gene sets. We found that ordinary logistic regression caused a major overfitting problem; therefore, we used regularized multi-class logistic regression to classify the gene sets. The proposed KHC-RLR algorithm showed a high performance and satisfied accuracy measures.


Subject(s)
Algorithms , Neoplasms , Humans , Logistic Models , Gene Expression Profiling/methods , Neoplasms/metabolism , Cluster Analysis
SELECTION OF CITATIONS
SEARCH DETAIL
...