Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 12 de 12
Filter
Add more filters










Publication year range
1.
J Chem Inf Model ; 64(10): 4322-4333, 2024 May 27.
Article in English | MEDLINE | ID: mdl-38733561

ABSTRACT

Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R2 of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.


Subject(s)
Transcription Factors , Transcription Factors/metabolism , Transcription Factors/chemistry , Binding Sites , Deep Learning , DNA/chemistry , DNA/metabolism , Computational Biology/methods , Neural Networks, Computer
2.
J Bioinform Comput Biol ; 22(1): 2450001, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38406833

ABSTRACT

Antimicrobial peptides (AMPs), as the preferred alternatives to antibiotics, have wide application with good prospects. Identifying AMPs through wet lab experiments remains expensive, time-consuming and challenging. Many machine learning methods have been proposed to predict AMPs and achieved good results. In this work, we combine two kinds of word embedding features with the statistical features of peptide sequences to develop an ensemble classifier, named EnAMP, in which, two deep neural networks are trained based on Word2vec and Glove word embedding features of peptide sequences, respectively, meanwhile, we utilize statistical features of peptide sequences to train random forest and support vector machine classifiers. The average of four classifiers is the final prediction result. Compared with other state-of-the-art algorithms on six datasets, EnAMP outperforms most existing models with similar computational costs, even when compared with high computational cost algorithms based on Bidirectional Encoder Representation from Transformers (BERT), the performance of our model is comparable. EnAMP source code and the data are available at https://github.com/ruisue/EnAMP.


Subject(s)
Deep Learning , Algorithms , Neural Networks, Computer , Anti-Bacterial Agents/pharmacology , Peptides
3.
Math Biosci Eng ; 20(9): 15809-15829, 2023 07 31.
Article in English | MEDLINE | ID: mdl-37919990

ABSTRACT

Transcription factors (TFs) are important factors that regulate gene expression. Revealing the mechanism affecting the binding specificity of TFs is the key to understanding gene regulation. Most of the previous studies focus on TF-DNA binding sites at the sequence level, and they seldom utilize the contextual features of DNA sequences. In this paper, we develop an integrated spatiotemporal context-aware neural network framework, named GNet, for predicting TF-DNA binding signal at single nucleotide resolution by achieving three tasks: single nucleotide resolution signal prediction, identification of binding regions at the sequence level, and TF-DNA binding motif prediction. GNet extracts implicit spatial contextual information with a gated highway neural mechanism, which captures large context multi-level patterns using linear shortcut connections, and the idea of it permeates the encoder and decoder parts of GNet. The improved dual external attention mechanism, which learns implicit relationships both within and among samples, and improves the performance of the model. Experimental results on 53 human TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets shows that GNet outperforms the state-of-the-art methods in the three tasks, and the results of cross-species studies on 15 human and 18 mouse TF datasets of the corresponding TF families indicate that GNet also shows the best performance in cross-species prediction over the competitive methods.


Subject(s)
Nucleotides , Transcription Factors , Humans , Animals , Mice , Nucleotides/metabolism , Protein Binding , Transcription Factors/genetics , Chromatin , DNA
4.
J Comput Biol ; 30(12): 1289-1304, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38010531

ABSTRACT

Interleukins (ILs) are a group of multifunctional cytokines, which play important roles in immune regulations and inflammatory responses. Recently, IL-6 has been found to affect the development of COVID-19, and significantly elevated levels of IL-6 cytokines have been reported in patients with severe COVID-19. IL-10 and IL-17 are anti-inflammatory and proinflammatory cytokines, respectively, which play multiple protective roles in host defense against pathogens. At present, a number of machine learning methods have been proposed to predict ILs inducing peptides, but their predictive performance needs to be further improved, and the inducing peptides of different ILs are predicted separately, rather than using a general approach. In our work, we combine the statistical features of peptide sequence with word embedding to design a general ensemble model named EnILs to predict inducing peptides of different ILs, in which the predictive probabilities of random forest, eXtreme Gradient Boosting and neural network are integrated in an average way. Compared with the state-of-the-art machine learning methods, EnILs shows considerable performance in the prediction of IL-6, IL-10, and IL-17 inducing peptides. In addition, we predict the most promising IL-6 inducing peptides in Severe Acute Respiratory Syndrome Coronavirus 2 spike protein in the case study for further experimental verification.


Subject(s)
COVID-19 , Interleukin-17 , Humans , Interleukin-10 , Interleukin-6 , Interleukins/metabolism , Peptides , Cytokines
5.
Biosaf Health ; 5(3): 152-158, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37362223

ABSTRACT

Human-virus protein-protein interactions (PPIs) play critical roles in viral infection. For example, the spike protein of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) binds primarily to human angiotensin-converting enzyme 2 (ACE2) protein to infect human cells. Thus, identifying and blocking these PPIs contribute to controlling and preventing viruses. However, wet-lab experiment-based identification of human-virus PPIs is usually expensive, labor-intensive, and time-consuming, which presents the need for computational methods. Many machine-learning methods have been proposed recently and achieved good results in predicting human-virus PPIs. However, most methods are based on protein sequence features and apply manually extracted features, such as statistical characteristics, phylogenetic profiles, and physicochemical properties. In this work, we present an embedding-based neural framework with convolutional neural network (CNN) and bi-directional long short-term memory unit (Bi-LSTM) architecture, named Emvirus, to predict human-virus PPIs (including human-SARS-CoV-2 PPIs). In addition, we conduct cross-viral experiments to explore the generalization ability of Emvirus. Compared to other feature extraction methods, Emvirus achieves better prediction accuracy.

6.
Comput Biol Med ; 146: 105697, 2022 07.
Article in English | MEDLINE | ID: mdl-35697529

ABSTRACT

Recent advances in single-cell RNA sequencing (scRNA-seq) provide exciting opportunities for transcriptome analysis at single-cell resolution. Clustering individual cells is a key step to reveal cell subtypes and infer cell lineage in scRNA-seq analysis. Although many dedicated algorithms have been proposed, clustering quality remains a computational challenge for scRNA-seq data, which is exacerbated by inflated zero counts due to various technical noise. To address this challenge, we assess the combinations of nine popular dropout imputation methods and eight clustering methods on a collection of 10 well-annotated scRNA-seq datasets with different sample sizes. Our results show that (i) imputation algorithms do typically improve the performance of clustering methods, and the quality of data visualization using t-Distributed Stochastic Neighbor Embedding; and (ii) the performance of a particular combination of imputation and clustering methods varies with dataset size. For example, the combination of single-cell analysis via expression recovery and Sparse Subspace Clustering (SSC) methods usually works well on smaller datasets, while the combination of adaptively-thresholded low-rank approximation and single-cell interpretation via multikernel learning (SIMLR) usually achieves the best performance on larger datasets.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Algorithms , Cluster Analysis , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods
7.
Front Genet ; 12: 773882, 2021.
Article in English | MEDLINE | ID: mdl-34868261

ABSTRACT

Background: Pseudouridine (Ψ) is a common ribonucleotide modification that plays a significant role in many biological processes. The identification of Ψ modification sites is of great significance for disease mechanism and biological processes research in which machine learning algorithms are desirable as the lab exploratory techniques are expensive and time-consuming. Results: In this work, we propose a deep learning framework, called PseUdeep, to identify Ψ sites of three species: H. sapiens, S. cerevisiae, and M. musculus. In this method, three encoding methods are used to extract the features of RNA sequences, that is, one-hot encoding, K-tuple nucleotide frequency pattern, and position-specific nucleotide composition. The three feature matrices are convoluted twice and fed into the capsule neural network and bidirectional gated recurrent unit network with a self-attention mechanism for classification. Conclusion: Compared with other state-of-the-art methods, our model gets the highest accuracy of the prediction on the independent testing data set S-200; the accuracy improves 12.38%, and on the independent testing data set H-200, the accuracy improves 0.68%. Moreover, the dimensions of the features we derive from the RNA sequences are only 109,109, and 119 in H. sapiens, M. musculus, and S. cerevisiae, which is much smaller than those used in the traditional algorithms. On evaluation via tenfold cross-validation and two independent testing data sets, PseUdeep outperforms the best traditional machine learning model available. PseUdeep source code and data sets are available at https://github.com/dan111262/PseUdeep.

8.
Front Oncol ; 11: 797057, 2021.
Article in English | MEDLINE | ID: mdl-34917514

ABSTRACT

Critical in revealing cell heterogeneity and identifying new cell subtypes, cell clustering based on single-cell RNA sequencing (scRNA-seq) is challenging. Due to the high noise, sparsity, and poor annotation of scRNA-seq data, existing state-of-the-art cell clustering methods usually ignore gene functions and gene interactions. In this study, we propose a feature extraction method, named FEGFS, to analyze scRNA-seq data, taking advantage of known gene functions. Specifically, we first derive the functional gene sets based on Gene Ontology (GO) terms and reduce their redundancy by semantic similarity analysis and gene repetitive rate reduction. Then, we apply the kernel principal component analysis to select features on each non-redundant functional gene set, and we combine the selected features (for each functional gene set) together for subsequent clustering analysis. To test the performance of FEGFS, we apply agglomerative hierarchical clustering based on FEGFS and compared it with seven state-of-the-art clustering methods on six real scRNA-seq datasets. For small datasets like Pollen and Goolam, FEGFS outperforms all methods on all four evaluation metrics including adjusted Rand index (ARI), normalized mutual information (NMI), homogeneity score (HOM), and completeness score (COM). For example, the ARIs of FEGFS are 0.955 and 0.910, respectively, on Pollen and Goolam; and those of the second-best method are only 0.938 and 0.910, respectively. For large datasets, FEGFS also outperforms most methods. For example, the ARIs of FEGFS are 0.781 on both Klein and Zeisel, which are higher than those of all other methods but slight lower than those of SC3 (0.798 and 0.807, respectively). Moreover, we demonstrate that CMF-Impute is powerful in reconstructing cell-to-cell and gene-to-gene correlation and in inferring cell lineage trajectories. As for application, take glioma as an example; we demonstrated that our clustering methods could identify important cell clusters related to glioma and also inferred key marker genes related to these cell clusters.

9.
Front Genet ; 10: 607, 2019.
Article in English | MEDLINE | ID: mdl-31396256

ABSTRACT

Phylogenetic networks are used to estimate evolutionary relationships among biological entities or taxa involving reticulate events such as horizontal gene transfer, hybridization, recombination, and reassortment. In the past decade, many phylogenetic tree and network reconstruction methods have been proposed. Despite that they are highly accurate in reconstructing simple to moderate complex reticulate events, the performance decreases when several reticulate events are present simultaneously. In this paper, we proposed QS-Net, a phylogenetic network reconstruction method taking advantage of information on the relationship among six taxa. To evaluate the performance of QS-Net, we conducted experiments on three artificial sequence data simulated from an evolutionary tree, an evolutionary network involving three reticulate events, and a complex evolutionary network involving five reticulate events. Comparison with popular phylogenetic methods including Neighbor-Joining, Split-Decomposition, Neighbor-Net, and Quartet-Net suggests that QS-Net is comparable with other methods in reconstructing tree-like evolutionary histories, while it outperforms them in reconstructing reticulate events. In addition, we also applied QS-Net in real data including a bacterial taxonomy data consisting of 36 bacterial species and the whole genome sequences of 22 H7N9 influenza A viruses. The results indicate that QS-Net is capable of inferring commonly believed bacterial taxonomy and influenza evolution as well as identifying novel reticulate events. The software QS-Net is publically available at https://github.com/Tmyiri/QS-Net.

10.
Sci Rep ; 9(1): 6220, 2019 04 17.
Article in English | MEDLINE | ID: mdl-30996271

ABSTRACT

With the rapid growth of the aging population, exploring the biological basis of aging and related molecular mechanisms has become an important topic in modern scientific research. Aging can cause multiple organ function attenuations, leading to the occurrence and development of various age-related metabolic, nervous system, and cardiovascular diseases. In addition, aging is closely related to the occurrence and development of tumors. Although a number of studies have used various mouse models to study aging, further research is needed to associate mouse and human aging at the molecular level. In this paper, we systematically assessed the relationship between human and mouse aging by comparing multi-tissue age-related gene expression sets. We compared 18 human and mouse tissues, and found 9 significantly correlated tissue pairs. Functional analysis also revealed some terms related to aging in human and mouse. And we performed a crosswise comparison of homologous age-related genes with 18 tissues in human and mouse respectively, and found that human Brain_Cortex was significantly correlated with Brain_Hippocampus, which was also found in mouse. In addition, we focused on comparing four brain-related tissues in human and mouse, and found a gene-GFAP-related to aging in both human and mouse.


Subject(s)
Aging/genetics , Cerebral Cortex/metabolism , Hippocampus/metabolism , Transcriptome , Adult , Aged , Algorithms , Animals , Databases, Genetic , Glial Fibrillary Acidic Protein/genetics , Humans , Mice , RNA-Seq
11.
Oncotarget ; 9(1): 1063-1074, 2018 Jan 02.
Article in English | MEDLINE | ID: mdl-29416677

ABSTRACT

Aging is a major risk factor for age-related diseases such as certain cancers. In this study, we developed Age Associated Gene Co-expression Identifier (AAGCI), a liquid association based method to infer age-associated gene co-expressions at thousands of biological processes and pathways across 9 human tissues. Several hundred to thousands of gene pairs were inferred to be age co-expressed across different tissues, the genes involved in which are significantly enriched in functions like immunity, ATP binding, DNA damage, and many cancer pathways. The age co-expressed genes are significantly overlapped with aging genes curated in the GenAge database across all 9 tissues, suggesting a tissue-wide correlation between age-associated genes and co-expressions. Interestingly, age-associated gene co-expressions are significantly different from gene co-expressions identified through correlation analysis, indicating that aging might only contribute to a small portion of gene co-expressions. Moreover, the key driver analysis identified biologically meaningful genes in important function modules. For example, IGF1, ERBB2, TP53 and STAT5A were inferred to be key genes driving age co-expressed genes in the network module associated with function "T cell proliferation". Finally, we prioritized a few anti-aging drugs such as metformin based on an enrichment analysis between age co-expressed genes and drug signatures from a recent study. The predicted drugs were partially validated by literature mining and can be readily used to generate hypothesis for further experimental validations.

12.
Sci Rep ; 7(1): 3367, 2017 06 13.
Article in English | MEDLINE | ID: mdl-28611393

ABSTRACT

Considering as one of the major goals in quantitative proteomics, detection of the differentially expressed proteins (DEPs) plays an important role in biomarker selection and clinical diagnostics. There have been plenty of algorithms and tools focusing on DEP detection in proteomics research. However, due to the different application scopes of these methods, and various kinds of experiment designs, it is not very apparent about the best choice for large-scale proteomics data analyses. Moreover, given the fact that proteomics data usually contain high percentage of missing values (MVs), but few replicates, a systematic evaluation of the DEP detection methods combined with the MV imputation methods is essential and urgent. Here, we analyzed a total of four representative imputation methods and five DEP methods on different experimental and simulated datasets. The results showed that (i) MV imputation could not always improve the performances of DEP detection methods and the imputation effects differed in the missing value percentages; (ii) the DEP detection methods had different statistical powers affected by the percentage of MVs. Two statistical methods (i.e. the empirical Bayesian random censoring threshold model, and the significance analysis of microarray) performed better than the other evaluated methods in terms of accuracy and sensitivity.


Subject(s)
Data Interpretation, Statistical , Escherichia coli Proteins/metabolism , Neoplasm Proteins/metabolism , Proteome/analysis , Proteomics/methods , Algorithms , Bayes Theorem , HeLa Cells , Humans , ROC Curve
SELECTION OF CITATIONS
SEARCH DETAIL
...