Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 34
Filter
1.
J Vis Exp ; (205)2024 Mar 01.
Article in English | MEDLINE | ID: mdl-38497637

ABSTRACT

Transcriptome represents the expression levels of many genes in a sample and has been widely used in biological research and clinical practice. Researchers usually focused on transcriptomic biomarkers with differential representations between a phenotype group and a control group of samples. This study presented a multitask graph-attention network (GAT) learning framework to learn the complex inter-genic interactions of the reference samples. A demonstrative reference model was pre-trained on the healthy samples (HealthModel), which could be directly used to generate the model-based quantitative transcriptional regulation (mqTrans) view of the independent test transcriptomes. The generated mqTrans view of transcriptomes was demonstrated by prediction tasks and dark biomarker detection. The coined term "dark biomarker" stemmed from its definition that a dark biomarker showed differential representation in the mqTrans view but no differential expression in its original expression level. A dark biomarker was always overlooked in traditional biomarker detection studies due to the absence of differential expression. The source code and the manual of the pipeline HealthModelPipe can be downloaded from http://www.healthinformaticslab.org/supp/resources.php.


Subject(s)
Gene Expression Profiling , Transcriptome , Gene Expression Regulation , Biomarkers , Phenotype
2.
Genes (Basel) ; 14(12)2023 Dec 01.
Article in English | MEDLINE | ID: mdl-38136991

ABSTRACT

A transcriptome profiles the expression levels of genes in cells and has accumulated a huge amount of public data. Most of the existing biomarker-related studies investigated the differential expression of individual transcriptomic features under the assumption of inter-feature independence. Many transcriptomic features without differential expression were ignored from the biomarker lists. This study proposed a computational analysis protocol (mqTrans) to analyze transcriptomes from the view of high-dimensional inter-feature correlations. The mqTrans protocol trained a regression model to predict the expression of an mRNA feature from those of the transcription factors (TFs). The difference between the predicted and real expression of an mRNA feature in a query sample was defined as the mqTrans feature. The new mqTrans view facilitated the detection of thirteen transcriptomic features with differentially expressed mqTrans features, but without differential expression in the original transcriptomic values in three independent datasets of lung cancer. These features were called dark biomarkers because they would have been ignored in a conventional differential analysis. The detailed discussion of one dark biomarker, GBP5, and additional validation experiments suggested that the overlapping long non-coding RNAs might have contributed to this interesting phenomenon. In summary, this study aimed to find undifferentially expressed genes with significantly changed mqTrans values in lung cancer. These genes were usually ignored in most biomarker detection studies of undifferential expression. However, their differentially expressed mqTrans values in three independent datasets suggested their strong associations with lung cancer.


Subject(s)
Lung Neoplasms , Humans , Lung Neoplasms/genetics , Lung Neoplasms/diagnosis , Gene Expression Profiling , Transcriptome/genetics , Biomarkers , RNA, Messenger/genetics
3.
BMC Infect Dis ; 23(1): 622, 2023 Sep 21.
Article in English | MEDLINE | ID: mdl-37735372

ABSTRACT

BACKGROUND: Coronavirus disease 2019 (COVID-19) is a rapidly developing and sometimes lethal pulmonary disease. Accurately predicting COVID-19 mortality will facilitate optimal patient treatment and medical resource deployment, but the clinical practice still needs to address it. Both complete blood counts and cytokine levels were observed to be modified by COVID-19 infection. This study aimed to use inexpensive and easily accessible complete blood counts to build an accurate COVID-19 mortality prediction model. The cytokine fluctuations reflect the inflammatory storm induced by COVID-19, but their levels are not as commonly accessible as complete blood counts. Therefore, this study explored the possibility of predicting cytokine levels based on complete blood counts. METHODS: We used complete blood counts to predict cytokine levels. The predictive model includes an autoencoder, principal component analysis, and linear regression models. We used classifiers such as support vector machine and feature selection models such as adaptive boost to predict the mortality of COVID-19 patients. RESULTS: Complete blood counts and original cytokine levels reached the COVID-19 mortality classification area under the curve (AUC) values of 0.9678 and 0.9111, respectively, and the cytokine levels predicted by the feature set alone reached the classification AUC value of 0.9844. The predicted cytokine levels were more significantly associated with COVID-19 mortality than the original values. CONCLUSIONS: Integrating the predicted cytokine levels and complete blood counts improved a COVID-19 mortality prediction model using complete blood counts only. Both the cytokine level prediction models and the COVID-19 mortality prediction models are publicly available at http://www.healthinformaticslab.org/supp/resources.php .


Subject(s)
COVID-19 , Humans , Area Under Curve , Cytokines , Linear Models , Principal Component Analysis
4.
Brief Bioinform ; 24(4)2023 07 20.
Article in English | MEDLINE | ID: mdl-37427963

ABSTRACT

Survival analysis is critical to cancer prognosis estimation. High-throughput technologies facilitate the increase in the dimension of genic features, but the number of clinical samples in cohorts is relatively small due to various reasons, including difficulties in participant recruitment and high data-generation costs. Transcriptome is one of the most abundantly available OMIC (referring to the high-throughput data, including genomic, transcriptomic, proteomic and epigenomic) data types. This study introduced a multitask graph attention network (GAT) framework DQSurv for the survival analysis task. We first used a large dataset of healthy tissue samples to pretrain the GAT-based HealthModel for the quantitative measurement of the gene regulatory relations. The multitask survival analysis framework DQSurv used the idea of transfer learning to initiate the GAT model with the pretrained HealthModel and further fine-tuned this model using two tasks i.e. the main task of survival analysis and the auxiliary task of gene expression prediction. This refined GAT was denoted as DiseaseModel. We fused the original transcriptomic features with the difference vector between the latent features encoded by the HealthModel and DiseaseModel for the final task of survival analysis. The proposed DQSurv model stably outperformed the existing models for the survival analysis of 10 benchmark cancer types and an independent dataset. The ablation study also supported the necessity of the main modules. We released the codes and the pretrained HealthModel to facilitate the feature encodings and survival analysis of transcriptome-based future studies, especially on small datasets. The model and the code are available at http://www.healthinformaticslab.org/supp/.


Subject(s)
Algorithms , Neoplasms , Humans , Proteomics , Survival Analysis
5.
Genes (Basel) ; 14(6)2023 05 24.
Article in English | MEDLINE | ID: mdl-37372321

ABSTRACT

Background: Colon cancer (CC) is common, and the mortality rate greatly increases as the disease progresses to the metastatic stage. Early detection of metastatic colon cancer (mCC) is crucial for reducing the mortality rate. Most previous studies have focused on the top-ranked differentially expressed transcriptomic biomarkers between mCC and primary CC while ignoring non-differentially expressed genes. Results: This study proposed that the complicated inter-feature correlations could be quantitatively formulated as a complementary transcriptomic view. We used a regression model to formulate the correlation between the expression levels of a messenger RNA (mRNA) and its regulatory transcription factors (TFs). The change between the predicted and real expression levels of a query mRNA was defined as the mqTrans value in the given sample, reflecting transcription regulatory changes compared with the model-training samples. A dark biomarker in mCC is defined as an mRNA gene that is non-differentially expressed in mCC but demonstrates mqTrans values significantly associated with mCC. This study detected seven dark biomarkers using 805 samples from three independent datasets. Evidence from the literature supports the role of some of these dark biomarkers. Conclusions: This study presented a complementary high-dimensional analysis procedure for transcriptome-based biomarker investigations with a case study on mCC.


Subject(s)
Colonic Neoplasms , Gene Expression Profiling , Humans , Biomarkers , Gene Expression Profiling/methods , Transcriptome/genetics , Colonic Neoplasms/genetics , Colonic Neoplasms/pathology , RNA, Messenger/genetics
6.
Comput Biol Med ; 160: 107030, 2023 06.
Article in English | MEDLINE | ID: mdl-37196456

ABSTRACT

Methylation is a major DNA epigenetic modification for regulating the biological processes without altering the DNA sequence, and multiple types of DNA methylations have been discovered, including 6mA, 5hmC, and 4mC. Multiple computational approaches were developed to automatically identify the DNA methylation residues using machine learning or deep learning algorithms. The machine learning (ML) based methods are difficult to be transferred to the other predicting tasks of the DNA methylation sites using additional knowledge. Deep learning (DL) may facilitate the transfer learning of knowledge from similar tasks, but they are often ineffective on small datasets. This study proposes an integrated feature representation framework EpiTEAmDNA based on the strategies of transfer learning and ensemble learning, which is evaluated on multiple DNA methylation types across 15 species. EpiTEAmDNA integrates convolutional neural network (CNN) and conventional machine learning methods, and shows improved performances than the existing DL-based methods on small datasets when no additional knowledge is available. The experimental data suggests that the EpiTEAmDNA models may be further improved via transfer learning based on additional knowledge. The evaluation experiments on the independent test datasets also suggest that the proposed EpiTEAmDNA framework outperforms the existing models in most prediction tasks of the 3 DNA methylation types across 15 species. The source code, pre-trained global model, and the EpiTEAmDNA feature representation framework are freely available at http://www.healthinformaticslab.org/supp/.


Subject(s)
Machine Learning , Neural Networks, Computer , DNA/genetics , Epigenesis, Genetic , DNA Methylation
7.
Comput Biol Chem ; 104: 107858, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37058814

ABSTRACT

Colon cancer is a common cancer type in both sexes and its mortality rate increases at the metastatic stage. Most studies exclude nondifferentially expressed genes from biomarker analysis of metastatic colon cancers. The motivation of this study is to find the latent associations of the nondifferentially expressed genes with metastatic colon cancers and to evaluate the gender specificity of such associations. This study formulates the expression level prediction of a gene as a regression model trained for primary colon cancers. The difference between a gene's predicted and original expression levels in a testing sample is defined as its mqTrans value (model-based quantitative measure of transcription regulation), which quantitatively measures the change of the gene's transcription regulation in this testing sample. We use the mqTrans analysis to detect the messenger RNA (mRNA) genes with nondifferential expression on their original expression levels but differentially expressed mqTrans values between primary and metastatic colon cancers. These genes are referred to as dark biomarkers of metastatic colon cancer. All dark biomarker genes were verified by two transcriptome profiling technologies, RNA-seq and microarray. The mqTrans analysis of a mixed cohort of both sexes could not recover gender-specific dark biomarkers. Most dark biomarkers overlap with long non-coding RNAs (lncRNAs), and these lncRNAs might have contributed their transcripts to calculating the dark biomarkers' expression levels. Therefore, mqTrans analysis serves as a complementary approach to identify dark biomarkers generally ignored by conventional studies, and it is essential to separate the female and male samples into two analysis experiments. The dataset and mqTrans analysis code are available at https://figshare.com/articles/dataset/22250536.


Subject(s)
Adenocarcinoma , Colonic Neoplasms , RNA, Long Noncoding , Humans , Male , Female , RNA, Long Noncoding/genetics , Biomarkers, Tumor/genetics , Biomarkers, Tumor/metabolism , Colonic Neoplasms/genetics , Gene Expression Profiling , Adenocarcinoma/genetics , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks
8.
Per Med ; 20(2): 143-155, 2023 03.
Article in English | MEDLINE | ID: mdl-36705049

ABSTRACT

Aim: Transcriptional regulation is actively involved in the onset and progression of various diseases. This study used the feature-engineering approach model-based quantitative transcription regulation to quantitatively measure the correlation between mRNA and transcription factors in a reference dataset of chronic lymphocytic leukemia (CLL) transcriptomes. Methods: A comprehensive investigation of transcriptional regulation changes in CLL was conducted using 973 samples in six independent datasets. Results & conclusion: Seven mRNAs were detected to have significantly differential model-based quantitative transcription regulation values but no differential expression between CLL patients and controls. We called these genes 'dark biomarkers' because their original expression levels did not show differential changes in the CLL patients. The overlapping lncRNAs might have contributed their transcripts to the expression miscalculations of these dark biomarkers.


Subject(s)
Leukemia, Lymphocytic, Chronic, B-Cell , Humans , Leukemia, Lymphocytic, Chronic, B-Cell/genetics , Leukemia, Lymphocytic, Chronic, B-Cell/metabolism , Transcription Factors/genetics , Transcriptome/genetics , Biomarkers, Tumor/genetics
9.
Genes (Basel) ; 13(10)2022 10 21.
Article in English | MEDLINE | ID: mdl-36292801

ABSTRACT

Melanoma is a lethal skin disease that develops from moles. This study aimed to integrate multimodal data to predict metastatic melanoma, which is highly aggressive and difficult to treat. The proposed EnsembleSKCM method evaluated the prediction performances of long noncoding RNAs (lncRNAs), protein-coding messenger genes (mRNAs) and pathology images (images) for metastatic melanoma. Feature selection was used to screen for metastatic biomarkers in the lncRNA and mRNA datasets. The integrated EnsembleSKCM model was built based on the weighted results of the lncRNA-, mRNA- and image-based models. EnsembleSKCM achieved 0.9444 in the prediction accuracy of metastatic melanoma and outperformed the single-modal prediction models based on the lncRNA, mRNA and image data. The experimental data suggest the importance of integrating the complementary information from the three data modalities. WGCNA was used to analyze the relationship of molecular-level features and image features, and the results show connections between them. Another cohort was used to validate our prediction.


Subject(s)
Melanoma , Neoplasms, Second Primary , RNA, Long Noncoding , Humans , RNA, Long Noncoding/genetics , Melanoma/diagnostic imaging , Melanoma/genetics , Melanoma/pathology , RNA, Messenger/genetics , Biomarkers
10.
J Bioinform Comput Biol ; 20(3): 2250013, 2022 06.
Article in English | MEDLINE | ID: mdl-35818996

ABSTRACT

Modern biotechnologies have generated huge amount of OMIC data, among which transcriptomes and methylomes are two major OMIC types. Transcriptomes measure the expression levels of all the transcripts while methylomes depict the cytosine methylation levels across a genome. Both OMIC data types could be generated by array or sequencing. And some studies deliver many more features (the number of features is denoted as [Formula: see text]) for a sample than the number [Formula: see text] of samples in a cohort, which induce the "large [Formula: see text] small [Formula: see text]" paradigm. This study focused on the classification problem about OMIC with "large [Formula: see text] small [Formula: see text]" paradigm. A Siamese convolutional network was utilized to transform the OMIC features into a new space with minimized intra-class distances and maximized inter-class distances between the samples. The proposed feature engineering algorithm SiaCo was comprehensively evaluated using both transcriptome and methylome datasets. The experimental data showed that SiaCo generated SiaCo features with improved classification accuracies for binary classification problems, and achieved improvements on the independent test dataset. The individual SiaCo features did not show better inter-class discrimination powers than the original OMIC features. This may be due to that the Siamese convolutional network optimized the collective performances of the SiaCo features, instead of the individual feature's discrimination power. The inherent transformation nature of the Siamese twin network also makes the SiaCo features lack of interpretability. The source code of SiaCo is freely available at http://www.healthinformaticslab.org/supp/resources.php.


Subject(s)
Algorithms , Genome , Humans , Software
11.
Comput Biol Med ; 148: 105883, 2022 09.
Article in English | MEDLINE | ID: mdl-35878490

ABSTRACT

The transcriptome describes the expression of all genes in a sample. Most studies have investigated the differential patterns or discrimination powers of transcript expression levels. In this study, we hypothesized that the quantitative correlations between the expression levels of transcription factors (TFs) and their regulated target genes (mRNAs) serve as a novel view of healthy status, and a disease sample exhibits a differential landscape (mqTrans) of transcription regulations compared with healthy status. We formulated quantitative transcription regulation relationships of metabolism-related genes as a multi-input multi-output regression model via a gated recurrent unit (GRU) network. The GRU model was trained using healthy blood transcriptomes and the expression levels of mRNAs were predicted by those of the TFs. The mqTrans feature of a gene was defined as the difference between its predicted and actual expression levels. A pan-cancer investigation of the differentially expressed mqTrans features was conducted between the early- and late-stage cancers in 26 cancer types of The Cancer Genome Atlas database. This study focused on the differentially expressed mqTrans features, that did not show differential expression in the actual expression levels. These genes could not be detected by conventional differential analysis. Such dark biomarkers are worthy of further wet-lab investigation. The experimental data also showed that the proposed mqTrans investigation improved the classification between early- and late-stage samples for some cancer types. Thus, the mqTrans features serve as a complementary view to transcriptomes, an OMIC type with mature high-throughput production technologies, and abundant public resources.


Subject(s)
Gene Expression Regulation, Neoplastic , Neoplasms , Gene Expression Profiling , Gene Regulatory Networks , Humans , RNA, Messenger , Transcription Factors , Transcriptome
12.
Skin Res Technol ; 28(5): 677-688, 2022 Sep.
Article in English | MEDLINE | ID: mdl-35639819

ABSTRACT

BACKGROUND: Acne is one of the most common skin lesions in adolescents. Some severe or inflammatory acne leads to scars, which may have major impacts on patients' quality of life or even job prospects. Grading acne plays an important role in diagnosis, and the diagnosis is made by counting the number of acne. It is a labor-intensive job and it is easy for dermatologists to make mistakes, so it is very important to develop automatic diagnosis methods. Ensemble learning may improve the prediction results of the base models, but its time complexity is relatively high. The ensemble pruning strategy may solve this computational challenge by removing the redundant base models. MATERIALS AND METHODS: This study proposed a novel ensemble pruning framework of deep learning models to accurately detect and grade acne using images. First, we train multi-base models and prune the redundancy models according to the performance and diversity of the models. Then, we construct the new features of the training data by the base models we select in the previous step. Next, we remove the redundancy models further by a feature selection algorithm. Finally, we integrate all the base models by classifiers. The ensemble pruning algorithm was proposed to prune the deep learning base models. RESULTS: The experimental data showed that the ensemble pruned framework achieved a prediction accuracy of 85.82% on the acne dataset, better than the existing studies. To verify our method's effectiveness, we test our method in a skin cancer dataset and greatly outperform the state-of-the-art methods. CONCLUSION: The method we proposed is used to grade acne. Our method's performance outperforms state-of-the-art methods on two datasets, and it can also remove redundancy models to reduce computational complexity.


Subject(s)
Acne Vulgaris , Deep Learning , Acne Vulgaris/diagnostic imaging , Adolescent , Algorithms , Humans , Quality of Life
13.
Brief Bioinform ; 23(5)2022 09 20.
Article in English | MEDLINE | ID: mdl-35514183

ABSTRACT

Human Leukocyte Antigen (HLA) is a type of molecule residing on the surfaces of most human cells and exerts an essential role in the immune system responding to the invasive items. The T cell antigen receptors may recognize the HLA-peptide complexes on the surfaces of cancer cells and destroy these cancer cells through toxic T lymphocytes. The computational determination of HLA-binding peptides will facilitate the rapid development of cancer immunotherapies. This study hypothesized that the natural language processing-encoded peptide features may be further enriched by another deep neural network. The hypothesis was tested with the Bi-directional Long Short-Term Memory-extracted features from the pretrained Protein Bidirectional Encoder Representations from Transformers-encoded features of the class I HLA (HLA-I)-binding peptides. The experimental data showed that our proposed HLAB feature engineering algorithm outperformed the existing ones in detecting the HLA-I-binding peptides. The extensive evaluation data show that the proposed HLAB algorithm outperforms all the seven existing studies on predicting the peptides binding to the HLA-A*01:01 allele in AUC and achieves the best average AUC values on the six out of the seven k-mers (k=8,9,...,14, respectively represent the prediction task of a polypeptide consisting of k amino acids) except for the 9-mer prediction tasks. The source code and the fine-tuned feature extraction models are available at http://www.healthinformaticslab.org/supp/resources.php.


Subject(s)
Histocompatibility Antigens Class I , Peptides , Amino Acids/metabolism , HLA Antigens/chemistry , HLA Antigens/genetics , HLA-A Antigens/metabolism , Histocompatibility Antigens Class I/chemistry , Humans , Peptides/chemistry , Protein Binding
14.
Mol Ther Nucleic Acids ; 28: 477-487, 2022 Jun 14.
Article in English | MEDLINE | ID: mdl-35505964

ABSTRACT

Immune thrombocytopenia (ITP) is an autoimmune disease with the typical symptom of a low platelet count in blood. ITP demonstrated age and sex biases in both occurrences and prognosis, and adult ITP was mainly induced by the living environments. The current diagnosis guideline lacks the integration of molecular heterogenicity. This study recruited the largest cohort of platelet transcriptome samples. A comprehensive procedure of feature selection, feature engineering, and stacking classification was carried out to detect the ITP biomarkers using RNA sequencing (RNA-seq) transcriptomes. The 40 detected biomarkers were loaded to train the final ITP detection model, with an overall accuracy 0.974. The biomarkers suggested that ITP onset may be associated with various transcribed components, including protein-coding genes, long intergenic non-coding RNA (lincRNA) genes, and pseudogenes with apparent transcriptions. The delivered ITP detection model may also be utilized as a complementary ITP diagnosis tool. The code and the example dataset is freely available on http://www.healthinformaticslab.org/supp/resources.php.

15.
Mol Psychiatry ; 27(4): 2114-2125, 2022 04.
Article in English | MEDLINE | ID: mdl-35136228

ABSTRACT

Small average differences in the left-right asymmetry of cerebral cortical thickness have been reported in individuals with autism spectrum disorder (ASD) compared to typically developing controls, affecting widespread cortical regions. The possible impacts of these regional alterations in terms of structural network effects have not previously been characterized. Inter-regional morphological covariance analysis can capture network connectivity between different cortical areas at the macroscale level. Here, we used cortical thickness data from 1455 individuals with ASD and 1560 controls, across 43 independent datasets of the ENIGMA consortium's ASD Working Group, to assess hemispheric asymmetries of intra-individual structural covariance networks, using graph theory-based topological metrics. Compared with typical features of small-world architecture in controls, the ASD sample showed significantly altered average asymmetry of networks involving the fusiform, rostral middle frontal, and medial orbitofrontal cortex, involving higher randomization of the corresponding right-hemispheric networks in ASD. A network involving the superior frontal cortex showed decreased right-hemisphere randomization. Based on comparisons with meta-analyzed functional neuroimaging data, the altered connectivity asymmetry particularly affected networks that subserve executive functions, language-related and sensorimotor processes. These findings provide a network-level characterization of altered left-right brain asymmetry in ASD, based on a large combined sample. Altered asymmetrical brain development in ASD may be partly propagated among spatially distant regions through structural connectivity.


Subject(s)
Autism Spectrum Disorder , Brain , Brain Mapping , Cerebral Cortex/diagnostic imaging , Humans , Magnetic Resonance Imaging/methods , Neural Pathways
16.
J Healthc Eng ; 2021: 6698176, 2021.
Article in English | MEDLINE | ID: mdl-34188791

ABSTRACT

Results: This study developed mole detection and segmentation software DiaMole using mobile phone images. DiaMole utilized multiple deep learning algorithms for the object detection problem and mole segmentation problem. An object detection algorithm generated a rectangle tightly surrounding a mole in the mobile phone image. Moreover, the segmentation algorithm detected the precise boundary of that mole. Three deep learning algorithms were evaluated for their object detection performance. The popular performance metric mean average precision (mAP) was used to evaluate the algorithms. Among the utilized algorithms, the Faster R-CNN could achieve the best mAP = 0.835, and the integrated algorithm could achieve the mAP = 0.4228. Although the integrated algorithm could not achieve the best mAP, it can avoid the missing of detecting the moles. A popular Unet model was utilized to find the precise mole boundary. Clinical users may annotate the detected moles based on their experiences. Conclusions: DiaMole is user-friendly software for researchers focusing on skin lesions. DiaMole may automatically detect and segment the moles from the mobile phone skin images. The users may also annotate each candidate mole according to their own experiences. The automatically calculated mole image masks and the annotations may be saved for further investigations.


Subject(s)
Algorithms , Cell Phone , Humans , Image Processing, Computer-Assisted/methods , Skin , Software
17.
Comput Biol Med ; 135: 104571, 2021 08.
Article in English | MEDLINE | ID: mdl-34166881

ABSTRACT

Cancer is one of the major causes of mortality worldwide. Regional lymph node metastasis is an important mechanism during the spread of human cancers, in which transcription regulation plays an essential role. This study formulated a regression-model-based quantitative transcription regulation (mqTrans) between one mRNA gene and multiple transcription factors (TFs). Computational pan-cancer screening was carried out to detect the quantitative dysregulation of transcription regulation in the regional lymph node metastasis of 18 cancer types. Only a few metastasis-dysregulated mqTrans models were shared among the cancer types. The mRNA genes of the metastasis-dysregulated mqTrans models were not differentially expressed in regional lymph node metastasis. The experimental data suggested that mqTrans technology provided a complementary approach to the evaluation of transcription regulation mechanisms and may facilitate its quantitative investigation in other phenotypes.


Subject(s)
Lymph Nodes , Humans , Lymphatic Metastasis , RNA, Messenger
19.
Int J Mol Sci ; 22(6)2021 Mar 17.
Article in English | MEDLINE | ID: mdl-33802922

ABSTRACT

Enhancers are short genomic regions exerting tissue-specific regulatory roles, usually for remote coding regions. Enhancers are observed in both prokaryotic and eukaryotic genomes, and their detections facilitate a better understanding of the transcriptional regulation mechanism. The accurate detection and transcriptional regulation strength evaluation of the enhancers remain a major bioinformatics challenge. Most of the current studies utilized the statistical features of short fixed-length nucleotide sequences. This study introduces the location information of each k-mer (SeqPose) into the encoding strategy of a DNA sequence and employs the attention mechanism in the two-layer bi-directional long-short term memory (BD-LSTM) model (spEnhancer) for the enhancer detection problem. The first layer of the delivered classifier discriminates between enhancers and non-enhancers, and the second layer evaluates the transcriptional regulation strength of the detected enhancer. The SeqPose-encoded features are selected by the Chi-squared test, and 45 positions are removed from further analysis. The existing studies may focus on selecting the statistical DNA sequence descriptors with large contributions to the prediction models. This study does not utilize these statistical DNA sequence descriptors. Then the word vector of the SeqPose-encoded features is obtained by using the word embedding layer. This study hypothesizes that different word vector features may contribute differently to the enhancer detection model, and assigns different weights to these word vectors through the attention mechanism in the BD-LSTM model. The previous study generously provided the training and independent test datasets, and the proposed spEnhancer is compared with the three existing state-of-the-art studies using the same experimental procedure. The leave-one-out validation data on the training dataset shows that the proposed spEnhancer achieves similar detection performances as the three existing studies. While spEnhancer achieves the best overall performance metric MCC for both of the two binary classification problems on the independent test dataset. The experimental data shows that the strategy of removing redundant positions (SeqPose) may help improve the DNA sequence-based prediction models. spEnhancer may serve well as a complementary model to the existing studies, especially for the novel query enhancers that are not included in the training dataset.


Subject(s)
Algorithms , Computational Biology/methods , Enhancer Elements, Genetic , Base Sequence , Databases, Genetic
20.
Comput Biol Med ; 133: 104405, 2021 06.
Article in English | MEDLINE | ID: mdl-33930763

ABSTRACT

The era of big data introduces both opportunities and challenges for biomedical researchers. One of the inherent difficulties in the biomedical research field is to recruit large cohorts of samples, while high-throughput biotechnologies may produce thousands or even millions of features for each sample. Researchers tend to evaluate the individual correlation of each feature with the class label and use the incremental feature selection (IFS) strategy to select the top-ranked features with the best prediction performance. Recent experimental data showed that a subset of continuously ranked features randomly restarted from a low-ranked feature (an RIFS block) may outperform the subset of top-ranked features. This study proposed a feature selection Algorithm RIFS2D by integrating multiple RIFS blocks. A comprehensive comparative experiment was conducted with the IFS, RIFS and existing feature selection algorithms and demonstrated that a subset of low-ranked features may also achieve promising prediction performance. This study suggested that a prediction model with promising performance may be trained by low-ranked features, even when top-ranked features did not achieve satisfying prediction performance. Further comparative experiments were conducted between RIFS2D and t-tests for the detection of early-stage breast cancer. The data showed that the RIFS2D-recommended features achieved better prediction accuracy and were targeted by more drugs than the t-test top-ranked features.


Subject(s)
Algorithms , Biomarkers
SELECTION OF CITATIONS
SEARCH DETAIL
...