Search | VHL Regional Portal

1.

BLAM6A-Merge: Leveraging Attention Mechanisms and Feature Fusion Strategies to Improve the Identification of RNA N6-methyladenosine Sites.

Xia, Yunpeng; Zhang, Ying; Liu, Dian; Zhu, Yi-Heng; Wang, Zhikang; Song, Jiangning; Yu, Dong-Jun.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Jun 24.

Article in English | MEDLINE | ID: mdl-38913512

ABSTRACT

RNA N6-methyladenosine is a prevalent and abundant type of RNA modification that exerts significant influence on diverse biological processes. To date, numerous computational approaches have been developed for predicting methylation, with most of them ignoring the correlations of different encoding strategies and failing to explore the adaptability of various attention mechanisms for methylation identification. To solve the above issues, we proposed an innovative framework for predicting RNA m6A modification site, termed BLAM6A-Merge. Specifically, it utilized a multimodal feature fusion strategy to combine the classification results of four features and Blastn tool. Apart from this, different attention mechanisms were employed for extracting higher-level features on specific features after the screening process. Extensive experiments on 12 benchmarking datasets demonstrated that BLAM6A-Merge achieved superior performance (average AUC: 0.849 for the full transcript mode and 0.784 for the mature mRNA mode). Notably, the Blastn tool was employed for the first time in the identification of methylation sites. The data and code can be accessed at https://github.com/DoraemonXia/BLAM6A-Merge.

2.

ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein-DNA binding site prediction.

Zhu, Yi-Heng; Liu, Zi; Liu, Yan; Ji, Zhiwei; Yu, Dong-Jun.

Brief Bioinform ; 25(2)2024 Jan 22.

Article in English | MEDLINE | ID: mdl-38349057

ABSTRACT

Efficient and accurate recognition of protein-DNA interactions is vital for understanding the molecular mechanisms of related biological processes and further guiding drug discovery. Although the current experimental protocols are the most precise way to determine protein-DNA binding sites, they tend to be labor-intensive and time-consuming. There is an immediate need to design efficient computational approaches for predicting DNA-binding sites. Here, we proposed ULDNA, a new deep-learning model, to deduce DNA-binding sites from protein sequences. This model leverages an LSTM-attention architecture, embedded with three unsupervised language models that are pre-trained on large-scale sequences from multiple database sources. To prove its effectiveness, ULDNA was tested on 229 protein chains with experimental annotation of DNA-binding sites. Results from computational experiments revealed that ULDNA significantly improves the accuracy of DNA-binding site prediction in comparison with 17 state-of-the-art methods. In-depth data analyses showed that the major strength of ULDNA stems from employing three transformer language models. Specifically, these language models capture complementary feature embeddings with evolution diversity, in which the complex DNA-binding patterns are buried. Meanwhile, the specially crafted LSTM-attention network effectively decodes evolution diversity-based embeddings as DNA-binding results at the residue level. Our findings demonstrated a new pipeline for predicting DNA-binding sites on a large scale with high accuracy from protein sequence alone.

Subject(s)

Data Analysis , Language , Binding Sites , Amino Acid Sequence , Databases, Factual

3.

Integrating unsupervised language model with multi-view multiple sequence alignments for high-accuracy inter-chain contact prediction.

Liu, Zi; Zhu, Yi-Heng; Shen, Long-Chen; Xiao, Xuan; Qiu, Wang-Ren; Yu, Dong-Jun.

Comput Biol Med ; 166: 107529, 2023 Sep 20.

Article in English | MEDLINE | ID: mdl-37748220

ABSTRACT

Accurate identification of inter-chain contacts in the protein complex is critical to determine the corresponding 3D structures and understand the biological functions. We proposed a new deep learning method, ICCPred, to deduce the inter-chain contacts from the amino acid sequences of the protein complex. This pipeline was built on the designed deep residual network architecture, integrating the pre-trained language model with three multiple sequence alignments (MSAs) from different biological views. Experimental results on 709 non-redundant benchmarking protein complexes showed that the proposed ICCPred significantly increased inter-chain contact prediction accuracy compared to the state-of-the-art approaches. Detailed data analyses showed that the significant advantage of ICCPred lies in the utilization of pre-trained transformer language models which can effectively extract the complementary co-evolution diversity from three MSAs. Meanwhile, the designed deep residual network enhances the correlation between the co-evolution diversity and the patterns of inter-chain contacts. These results demonstrated a new avenue for high-accuracy deep-learning inter-chain contact prediction that is applicable to large-scale protein-protein interaction annotations from sequence alone.

4.

GCmapCrys: Integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction.

Wang, Peng-Hao; Zhu, Yi-Heng; Yang, Xibei; Yu, Dong-Jun.

Anal Biochem ; 663: 115020, 2023 02 15.

Article in English | MEDLINE | ID: mdl-36521558

ABSTRACT

X-ray crystallography is the major approach for atomic-level protein structure determination. Since not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity is critical to guiding the experimental design and improving the success rate of X-ray crystallography experiments. In this work, we proposed a new deep learning pipeline, GCmapCrys, for multi-stage crystallization propensity prediction through integrating graph attention network with predicted protein contact map. Experimental results on 1548 proteins with known crystallization records demonstrated that GCmapCrys increased the value of Matthew's correlation coefficient by 37.0% in average compared to state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of GCmapCrys lie in the efficiency of the graph attention network with predicted contact map, which effectively associates the residue-interaction knowledge with crystallization pattern. Meanwhile, the designed four sequence-based features can be complementary to further enhance crystallization propensity proprediction.

Subject(s)

Computational Biology , Proteins , Crystallization/methods , Proteins/chemistry , Crystallography, X-Ray , Computational Biology/methods

5.

Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction.

Zhu, Yi-Heng; Zhang, Chengxin; Yu, Dong-Jun; Zhang, Yang.

PLoS Comput Biol ; 18(12): e1010793, 2022 12.

Article in English | MEDLINE | ID: mdl-36548439

ABSTRACT

Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.

Subject(s)

Computational Biology , Proteins , Gene Ontology , Computational Biology/methods , Proteins/genetics , Proteins/metabolism , Neural Networks, Computer , Language

6.

Spatial and Temporal Evolutionary Characteristics and Its Influencing Factors of Economic Spatial Polarization in the Yangtze River Delta Region.

Zhu, Yiheng; Yang, Shan; Lin, Jinping; Yin, Shanggang.

Int J Environ Res Public Health ; 19(12)2022 06 07.

Article in English | MEDLINE | ID: mdl-35742246

ABSTRACT

Economic spatial polarization is a manifestation of unbalanced urban development. To study the unbalanced development of Chinese cities, this paper selects 41 cities in the Yangtze River Delta (YRD) region, introduces the polarization index and exploratory spatio-temporal analysis to portray their spatio-temporal evolution process, and analyzes the differences in spatial polarization patterns of economic development in three dimensions of economic quantity, quality, and structure. Finally, we use the geographic detector model to explore the driving factors and then propose corresponding policy recommendations. The results show that: (1) the degree of difference in economic development in the YRD region narrowed from 2000 to 2019, and the spatial polarization level of urban economic development showed a fluctuating downward trend, among which the spatial polarization level of the economic structure dimension has been increasing. (2) In terms of spatial distribution, the "Yangtze River Delta urban agglomeration" has economic spatial polarization in the YRD region has become the peak contiguous zone, and the spatial polarization of economic quantity and quality dimensions has formed a "polycentric" pattern, while the spatial polarization of economic structure dimensions shows a stable "one core, multiple sub-center" distribution. (3) From the evolution of spatial polarization, most cities have strong spatial locking characteristics without a transition. Spatially positive polarized are concentrated in the YRD urban agglomeration, and the inter-city neighboring relations are mainly positive synergistic growth, while the negatively polarized cities are mostly distributed in the peripheral areas of the YRD and the neighboring relations are negative synergistic growth. At the same time, the spatially positive polarization effect of the economic quantity dimension and the spatially negative polarization effect of the economic structure dimension among cities are more significant. (4) The economic spatial polarization in the YRD region is mainly dominated by market prosperity and urbanization level, while the driving effect of scientific and technological innovation development on the urban economy has also been expanding in recent years. Promoting the reasonable allocation of marketization, urbanization, and technology among cities with positive and negative spatial polarization in the future will contribute to balanced urban and regional economic development in a coordinated and orderly manner.

Subject(s)

Rivers , Urbanization , China , Cities , Economic Development

7.

TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction.

Zhu, Yi-Heng; Zhang, Chengxin; Liu, Yan; Omenn, Gilbert S; Freddolino, Peter L; Yu, Dong-Jun; Zhang, Yang.

Genomics Proteomics Bioinformatics ; 20(5): 1013-1027, 2022 10.

Article in English | MEDLINE | ID: mdl-35568117

ABSTRACT

Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.

Subject(s)

Computational Biology , Proteins , Animals , Mice , Rats , Humans , Proteins/metabolism , Molecular Sequence Annotation , Amino Acid Sequence , Sequence Alignment , Computational Biology/methods

8.

TargetMM: Accurate Missense Mutation Prediction by Utilizing Local and Global Sequence Information with Classifier Ensemble.

Ge, Fang; Hu, Jun; Zhu, Yi-Heng; Arif, Muhammad; Yu, Dong-Jun.

Comb Chem High Throughput Screen ; 25(1): 38-52, 2022.

Article in English | MEDLINE | ID: mdl-33280588

ABSTRACT

AIM AND OBJECTIVE: Missense mutation (MM) may lead to various human diseases by disabling proteins. Accurate prediction of MM is important and challenging for both protein function annotation and drug design. Although several computational methods yielded acceptable success rates, there is still room for further enhancing the prediction performance of MM. MATERIALS AND METHODS: In the present study, we designed a new feature extracting method, which considers the impact degree of residues in the microenvironment range to the mutation site. Stringent cross-validation and independent test on benchmark datasets were performed to evaluate the efficacy of the proposed feature extracting method. Furthermore, three heterogeneous prediction models were trained and then ensembled for the final prediction. By combining the feature representation method and classifier ensemble technique, we reported a novel MM predictor called TargetMM for identifying the pathogenic mutations from the neutral ones. RESULTS: Comparison outcomes based on statistical evaluation demonstrate that TargetMM outperforms the prior advanced methods on the independent test data. The source codes and benchmark datasets of TargetMM are freely available at https://github.com/sera616/TargetMM.git for academic use.

Subject(s)

Algorithms , Mutation, Missense , Humans , Proteins/chemistry , Software

9.

TGSA: protein-protein association-based twin graph neural networks for drug response prediction with similarity augmentation.

Zhu, Yiheng; Ouyang, Zhenqiu; Chen, Wenbo; Feng, Ruiwei; Chen, Danny Z; Cao, Ji; Wu, Jian.

Bioinformatics ; 38(2): 461-468, 2022 01 03.

Article in English | MEDLINE | ID: mdl-34559177

ABSTRACT

MOTIVATION: Drug response prediction (DRP) plays an important role in precision medicine (e.g. for cancer analysis and treatment). Recent advances in deep learning algorithms make it possible to predict drug responses accurately based on genetic profiles. However, existing methods ignore the potential relationships among genes. In addition, similarity among cell lines/drugs was rarely considered explicitly. RESULTS: We propose a novel DRP framework, called TGSA, to make better use of prior domain knowledge. TGSA consists of Twin Graph neural networks for Drug Response Prediction (TGDRP) and a Similarity Augmentation (SA) module to fuse fine-grained and coarse-grained information. Specifically, TGDRP abstracts cell lines as graphs based on STRING protein-protein association networks and uses Graph Neural Networks (GNNs) for representation learning. SA views DRP as an edge regression problem on a heterogeneous graph and utilizes GNNs to smooth the representations of similar cell lines/drugs. Besides, we introduce an auxiliary pre-training strategy to remedy the identified limitations of scarce data and poor out-of-distribution generalization. Extensive experiments on the GDSC2 dataset demonstrate that our TGSA consistently outperforms all the state-of-the-art baselines under various experimental settings. We further evaluate the effectiveness and contributions of each component of TGSA via ablation experiments. The promising performance of TGSA shows enormous potential for clinical applications in precision medicine. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/violet-sto/TGSA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Neoplasms , Neural Networks, Computer , Humans , Algorithms , Software , Precision Medicine , Proteins

10.

MAResNet: predicting transcription factor binding sites by combining multi-scale bottom-up and top-down attention and residual network.

Han, Ke; Shen, Long-Chen; Zhu, Yi-Heng; Xu, Jian; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 23(1)2022 01 17.

Article in English | MEDLINE | ID: mdl-34664074

ABSTRACT

Accurate identification of transcription factor binding sites is of great significance in understanding gene expression, biological development and drug design. Although a variety of methods based on deep-learning models and large-scale data have been developed to predict transcription factor binding sites in DNA sequences, there is room for further improvement in prediction performance. In addition, effective interpretation of deep-learning models is greatly desirable. Here we present MAResNet, a new deep-learning method, for predicting transcription factor binding sites on 690 ChIP-seq datasets. More specifically, MAResNet combines the bottom-up and top-down attention mechanisms and a state-of-the-art feed-forward network (ResNet), which is constructed by stacking attention modules that generate attention-aware features. In particular, the multi-scale attention mechanism is utilized at the first stage to extract rich and representative sequence features. We further discuss the attention-aware features learned from different attention modules in accordance with the changes as the layers go deeper. The features learned by MAResNet are also visualized through the TMAP tool to illustrate that the method can extract the unique characteristics of transcription factor binding sites. The performance of MAResNet is extensively tested on 690 test subsets with an average AUC of 0.927, which is higher than that of the current state-of-the-art methods. Overall, this study provides a new and useful framework for the prediction of transcription factor binding sites by combining the funnel attention modules with the residual network.

Subject(s)

Deep Learning , Binding Sites/genetics , Neural Networks, Computer , Protein Binding , Transcription Factors/metabolism

11.

MutTMPredictor: Robust and accurate cascade XGBoost classifier for prediction of mutations in transmembrane proteins.

Ge, Fang; Zhu, Yi-Heng; Xu, Jian; Muhammad, Arif; Song, Jiangning; Yu, Dong-Jun.

Comput Struct Biotechnol J ; 19: 6400-6416, 2021.

Article in English | MEDLINE | ID: mdl-34938415

ABSTRACT

Transmembrane proteins have critical biological functions and play a role in a multitude of cellular processes including cell signaling, transport of molecules and ions across membranes. Approximately 60% of transmembrane proteins are considered as drug targets. Missense mutations in such proteins can lead to many diverse diseases and disorders, such as neurodegenerative diseases and cystic fibrosis. However, there are limited studies on mutations in transmembrane proteins. In this work, we first design a new feature encoding method, termed weight attenuation position-specific scoring matrix (WAPSSM), which builds upon the protein evolutionary information. Then, we propose a new mutation prediction algorithm (cascade XGBoost) by leveraging the idea learned from consensus predictors and gcForest. Multi-level experiments illustrate the effectiveness of WAPSSM and cascade XGBoost algorithms. Finally, based on WAPSSM and other three types of features, in combination with the cascade XGBoost algorithm, we develop a new transmembrane protein mutation predictor, named MutTMPredictor. We benchmark the performance of MutTMPredictor against several existing predictors on seven datasets. On the 546 mutations dataset, MutTMPredictor achieves the accuracy (ACC) of 0.9661 and the Matthew's Correlation Coefficient (MCC) of 0.8950. While on the 67,584 dataset, MutTMPredictor achieves an MCC of 0.7523 and area under curve (AUC) of 0.8746, which are 0.1625 and 0.0801 respectively higher than those of the existing best predictor (fathmm). Besides, MutTMPredictor also outperforms two specific predictors on the Pred-MutHTP datasets. The results suggest that MutTMPredictor can be used as an effective method for predicting and prioritizing missense mutations in transmembrane proteins. The MutTMPredictor webserver and datasets are freely accessible at http://csbio.njust.edu.cn/bioinf/muttmpredictor/ for academic use.

12.

Improving protein fold recognition using triplet network and ensemble deep learning.

Liu, Yan; Han, Ke; Zhu, Yi-Heng; Zhang, Ying; Shen, Long-Chen; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(6)2021 11 05.

Article in English | MEDLINE | ID: mdl-34226918

ABSTRACT

Protein fold recognition is a critical step toward protein structure and function prediction, aiming at providing the most likely fold type of the query protein. In recent years, the development of deep learning (DL) technique has led to massive advances in this important field, and accordingly, the sensitivity of protein fold recognition has been dramatically improved. Most DL-based methods take an intermediate bottleneck layer as the feature representation of proteins with new fold types. However, this strategy is indirect, inefficient and conditional on the hypothesis that the bottleneck layer's representation is assumed as a good representation of proteins with new fold types. To address the above problem, in this work, we develop a new computational framework by combining triplet network and ensemble DL. We first train a DL-based model, termed FoldNet, which employs triplet loss to train the deep convolutional network. FoldNet directly optimizes the protein fold embedding itself, making the proteins with the same fold types be closer to each other than those with different fold types in the new protein embedding space. Subsequently, using the trained FoldNet, we implement a new residue-residue contact-assisted predictor, termed FoldTR, which improves protein fold recognition. Furthermore, we propose a new ensemble DL method, termed FSD_XGBoost, which combines protein fold embedding with the other two discriminative fold-specific features extracted by two DL-based methods SSAfold and DeepFR. The Top 1 sensitivity of FSD_XGBoost increases to 74.8% at the fold level, which is ~9% higher than that of the state-of-the-art method. Together, the results suggest that fold-specific features extracted by different DL methods complement with each other, and their combination can further improve fold recognition at the fold level. The implemented web server of FoldTR and benchmark datasets are publicly available at http://csbio.njust.edu.cn/bioinf/foldtr/.

Subject(s)

Computational Biology/methods , Deep Learning , Models, Molecular , Protein Conformation , Protein Folding , Proteins/chemistry , Algorithms , Databases, Protein , Neural Networks, Computer , Reproducibility of Results , Sensitivity and Specificity

13.

Why can deep convolutional neural networks improve protein fold recognition? A visual explanation by interpretation.

Liu, Yan; Zhu, Yi-Heng; Song, Xiaoning; Song, Jiangning; Yu, Dong-Jun.

Brief Bioinform ; 22(5)2021 09 02.

Article in English | MEDLINE | ID: mdl-33537753

ABSTRACT

As an essential task in protein structure and function prediction, protein fold recognition has attracted increasing attention. The majority of the existing machine learning-based protein fold recognition approaches strongly rely on handcrafted features, which depict the characteristics of different protein folds; however, effective feature extraction methods still represent the bottleneck for further performance improvement of protein fold recognition. As a powerful feature extractor, deep convolutional neural network (DCNN) can automatically extract discriminative features for fold recognition without human intervention, which has demonstrated an impressive performance on protein fold recognition. Despite the encouraging progress, DCNN often acts as a black box, and as such, it is challenging for users to understand what really happens in DCNN and why it works well for protein fold recognition. In this study, we explore the intrinsic mechanism of DCNN and explain why it works for protein fold recognition using a visual explanation technique. More specifically, we first trained a VGGNet-based DCNN model, termed VGGNet-FE, which can extract fold-specific features from the predicted protein residue-residue contact map for protein fold recognition. Subsequently, based on the trained VGGNet-FE, we implemented a new contact-assisted predictor, termed VGGfold, for protein fold recognition; we then visualized what features were extracted by each of the convolutional layers in VGGNet-FE using a deconvolution technique. Furthermore, we visualized the high-level semantic information, termed fold-discriminative region, of a predicted contact map from the localization map obtained from the last convolutional layer of VGGNet-FE. It is visually confirmed that VGGNet-FE could effectively extract distinct fold-discriminative regions for different types of protein folds, thereby accounting for the improved performance of VGGfold for protein fold recognition. In summary, this study is of great significance for both understanding the working principle of DCNNs in protein fold recognition and exploring the relationship between the predicted protein contact map and protein tertiary structure. This proposed visualization method is flexible and applicable to address other DCNN-based bioinformatics and computational biology questions. The online web server of VGGfold is freely available at http://csbio.njust.edu.cn/bioinf/vggfold/.

Subject(s)

Computational Biology/methods , Machine Learning , Neural Networks, Computer , Protein Folding , Proteins/chemistry , Data Visualization , Humans , Protein Interaction Maps , Protein Structure, Tertiary , Proteins/metabolism , Semantics

14.

TargetDBP+: Enhancing the Performance of Identifying DNA-Binding Proteins via Weighted Convolutional Features.

Hu, Jun; Rao, Liang; Zhu, Yi-Heng; Zhang, Gui-Jun; Yu, Dong-Jun.

J Chem Inf Model ; 61(1): 505-515, 2021 01 25.

Article in English | MEDLINE | ID: mdl-33410688

ABSTRACT

Protein-DNA interactions exist ubiquitously and play important roles in the life cycles of living cells. The accurate identification of DNA-binding proteins (DBPs) is one of the key steps to understand the mechanisms of protein-DNA interactions. Although many DBP identification methods have been proposed, the current performance is still unsatisfactory. In this study, a new method, called TargetDBP+, is developed to further enhance the performance of identifying DBPs. In TargetDBP+, five convolutional features are first extracted from five feature sources, i.e., amino acid one-hot matrix (AAOHM), position-specific scoring matrix (PSSM), predicted secondary structure probability matrix (PSSPM), predicted solvent accessibility probability matrix (PSAPM), and predicted probabilities of DNA-binding sites (PPDBSs); second, the five features are weightedly and serially combined using the weights of all of the elements learned by the differential evolution algorithm; and finally, the DBP identification model of TargetDBP+ is trained using the support vector machine (SVM) algorithm. To evaluate the developed TargetDBP+ and compare it with other existing methods, a new gold-standard benchmark data set, called UniSwiss, is constructed, which consists of 4881 DBPs and 4881 non-DBPs extracted from the UniprotKB/Swiss-Prot database. Experimental results demonstrate that TargetDBP+ can obtain an accuracy of 85.83% and precision of 88.45% covering 82.41% of all DBP data on the independent validation subset of UniSwiss, with the MCC value (0.718) being significantly higher than those of other state-of-the-art control methods. The web server of TargetDBP+ is accessible at http://csbio.njust.edu.cn/bioinf/targetdbpplus/; the UniSwiss data set and stand-alone program of TargetDBP+ are accessible at https://github.com/jun-csbio/TargetDBPplus.

Subject(s)

DNA-Binding Proteins , Support Vector Machine , Algorithms , Binding Sites , DNA-Binding Proteins/metabolism , Databases, Protein , Position-Specific Scoring Matrices

15.

Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features.

Zhu, Yi-Heng; Hu, Jun; Ge, Fang; Li, Fuyi; Song, Jiangning; Zhang, Yang; Yu, Dong-Jun.

Brief Bioinform ; 22(3)2021 05 20.

Article in English | MEDLINE | ID: mdl-32436937

ABSTRACT

X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew's correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.

Subject(s)

Computational Biology/methods , Crystallization/methods , Proteins/chemistry , Amino Acid Sequence , Crystallography, X-Ray , Databases, Protein , Models, Chemical

16.

SSCpred: Single-Sequence-Based Protein Contact Prediction Using Deep Fully Convolutional Network.

Chen, Ming-Cai; Li, Yang; Zhu, Yi-Heng; Ge, Fang; Yu, Dong-Jun.

J Chem Inf Model ; 60(6): 3295-3303, 2020 06 22.

Article in English | MEDLINE | ID: mdl-32338512

ABSTRACT

There has been a significant improvement in protein residue contact prediction in recent years. Nevertheless, state-of-the-art methods still show deficiencies in the contact prediction of proteins with low-homology information. These top methods depend largely on statistical features that derived from homologous sequences, but previous studies, along with our analyses, show that they are insufficient for inferencing an accurate contact map for nonhomology protein targets. To compensate, we proposed a brand new single-sequence-based contact predictor (SSCpred) that performs prediction through the deep fully convolutional network (Deep FCN) with only the target sequence itself, i.e., without additional homology information. The proposed pipeline makes good use of the target sequence by utilizing the pair-wise encoding technique and Deep FCN. Experimental results demonstrated that SSCpred can produce accurate predictions based on the efficient pipeline. Compared with several most recent methods, SSCpred achieves completive performance on nonhomology targets. Overall, we explored the possibilities of single-sequence-based contact prediction and designed a novel pipeline without using a complex and redundant feature set. The proposed SSCpred can compensate for current methods' disadvantages and achieves better performance on the nonhomology targets. The web server of SSCpred is freely available at http://csbio.njust.edu.cn/bioinf/sscpred/.

Subject(s)

Computational Biology , Proteins , Algorithms , Proteins/genetics

17.

TargetDBP: Accurate DNA-Binding Protein Prediction Via Sequence-Based Multi-View Feature Learning.

Hu, Jun; Zhou, Xiao-Gen; Zhu, Yi-Heng; Yu, Dong-Jun; Zhang, Gui-Jun.

IEEE/ACM Trans Comput Biol Bioinform ; 17(4): 1419-1429, 2020.

Article in English | MEDLINE | ID: mdl-30668479

ABSTRACT

Accurately identifying DNA-binding proteins (DBPs) from protein sequence information is an important but challenging task for protein function annotations. In this paper, we establish a novel computational method, named TargetDBP, for accurately targeting DBPs from primary sequences. In TargetDBP, four single-view features, i.e., AAC (Amino Acid Composition), PsePSSM (Pseudo Position-Specific Scoring Matrix), PsePRSA (Pseudo Predicted Relative Solvent Accessibility), and PsePPDBS (Pseudo Predicted Probabilities of DNA-Binding Sites), are first extracted to represent different base features, respectively. Second, differential evolution algorithm is employed to learn the weights of four base features. Using the learned weights, we weightedly combine these base features to form the original super feature. An excellent subset of the super feature is then selected by using a suitable feature selection algorithm SVM-REF+CBR (Support Vector Machine Recursive Feature Elimination with Correlation Bias Reduction). Finally, the prediction model is learned via using support vector machine on the selected feature subset. We also construct a new gold-standard and non-redundant benchmark dataset from PDB database to evaluate and compare the proposed TargetDBP with other existing predictors. On this new dataset, TargetDBP can achieve higher performance than other state-of-the-art predictors. The TargetDBP web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/targetdbp/ for academic use.

Subject(s)

Computational Biology/methods , DNA-Binding Proteins , Machine Learning , Sequence Analysis, Protein/methods , Algorithms , DNA-Binding Proteins/chemistry , DNA-Binding Proteins/genetics , DNA-Binding Proteins/metabolism , Databases, Protein , Position-Specific Scoring Matrices , Support Vector Machine

18.

Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites.

Zhu, Yi-Heng; Hu, Jun; Qi, Yong; Song, Xiao-Ning; Yu, Dong-Jun.

Comb Chem High Throughput Screen ; 22(7): 455-469, 2019.

Article in English | MEDLINE | ID: mdl-31553288

ABSTRACT

AIM AND OBJECTIVE: The accurate identification of protein-ligand binding sites helps elucidate protein function and facilitate the design of new drugs. Machine-learning-based methods have been widely used for the prediction of protein-ligand binding sites. Nevertheless, the severe class imbalance phenomenon, where the number of nonbinding (majority) residues is far greater than that of binding (minority) residues, has a negative impact on the performance of such machine-learning-based predictors. MATERIALS AND METHODS: In this study, we aim to relieve the negative impact of class imbalance by Boosting Multiple Granular Support Vector Machines (BGSVM). In BGSVM, each base SVM is trained on a granular training subset consisting of all minority samples and some reasonably selected majority samples. The efficacy of BGSVM for dealing with class imbalance was validated by benchmarking it with several typical imbalance learning algorithms. We further implemented a protein-nucleotide binding site predictor, called BGSVM-NUC, with the BGSVM algorithm. RESULTS: Rigorous cross-validation and independent validation tests for five types of proteinnucleotide interactions demonstrated that the proposed BGSVM-NUC achieves promising prediction performance and outperforms several popular sequence-based protein-nucleotide binding site predictors. The BGSVM-NUC web server is freely available at http://csbio.njust.edu.cn/bioinf/BGSVM-NUC/ for academic use.

Subject(s)

Nucleotides/chemistry , Proteins/chemistry , Support Vector Machine , Binding Sites , High-Throughput Screening Assays

19.

DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines.

Zhu, Yi-Heng; Hu, Jun; Song, Xiao-Ning; Yu, Dong-Jun.

J Chem Inf Model ; 59(6): 3057-3071, 2019 06 24.

Article in English | MEDLINE | ID: mdl-30943723

ABSTRACT

Accurate identification of protein-DNA binding sites is significant for both understanding protein function and drug design. Machine-learning-based methods have been extensively used for the prediction of protein-DNA binding sites. However, the data imbalance problem, in which the number of nonbinding residues (negative-class samples) is far larger than that of binding residues (positive-class samples), seriously restricts the performance improvements of machine-learning-based predictors. In this work, we designed a two-stage imbalanced learning algorithm, called ensembled hyperplane-distance-based support vector machines (E-HDSVM), to improve the prediction performance of protein-DNA binding sites. The first stage of E-HDSVM designs a new iterative sampling algorithm, called hyperplane-distance-based under-sampling (HD-US), to extract multiple subsets from the original imbalanced data set, each of which is used to train a support vector machine (SVM). Unlike traditional sampling algorithms, HD-US selects samples by calculating the distances between the samples and the separating hyperplane of the SVM. The second stage of E-HDSVM proposes an enhanced AdaBoost (EAdaBoost) algorithm to ensemble multiple trained SVMs. As an enhanced version of the original AdaBoost algorithm, EAdaBoost overcomes the overfitting problem. Stringent cross-validation and independent tests on benchmark data sets demonstrated the superiority of E-HDSVM over several popular imbalanced learning algorithms. Based on the proposed E-HDSVM algorithm, we further implemented a sequence-based protein-DNA binding site predictor, called DNAPred, which is freely available at http://csbio.njust.edu.cn/bioinf/dnapred/ for academic use. The computational experimental results showed that our predictor achieved an average overall accuracy of 91.7% and a Mathew's correlation coefficient of 0.395 on five benchmark data sets and outperformed several state-of-the-art sequence-based protein-DNA binding site predictors.

Subject(s)

DNA-Binding Proteins/metabolism , DNA/metabolism , Models, Molecular , Support Vector Machine , DNA/chemistry , DNA-Binding Proteins/chemistry , Nucleic Acid Conformation , Protein Conformation

20.

[Two-dimensional measurement of blood flow velocity in rat arteries based on ultrasonic particle image velocimetry].

Zhu, Yiheng; Qian, Ming; Niu, Lili; Zheng, Hairong; Lu, Guangwen.

Nan Fang Yi Ke Da Xue Xue Bao ; 34(9): 1305-9, 2014 Aug.

Article in Chinese | MEDLINE | ID: mdl-25263364

ABSTRACT

OBJECTIVE: Ultrasonic pulse wave Doppler technique for noninvasive blood flow imaging does not provide precise information of complex blood flow field, and observing two-dimensional artery blood flow field distribution provides important clinical information for cardiovascular disease. METHODS: Ultrasonic particle image velocimetry (Echo PIV) was used to measure blood flows on B-mode ultrasonic particle image to assess the whole field velocity of the blood vessels in 5 groups of healthy rats. The reliability of Echo PIV was verified in comparison with ultrasonic Doppler method in 3 cardiac cycles. RESULTS AND CONCLUSION: The results of Echo PIV were similar with the those of ultrasound spectral Doppler. The Echo PIV-measured peak and average velocity within 3 cardiac cycles were about 5%-10% and 2%-8% below the values measured by the ultrasonic spectral Doppler, respectively, but these differences were not statistically significant (P>0.05). As a new technique for monitoring complex blood flow in stenotic arteries, echo PIV can be used to directly and non-invasively assess whole field hemodynamic changes in blood vessels in real time and distinguish different groups of rats by velocity.

Subject(s)

Arteries/diagnostic imaging , Blood Flow Velocity , Ultrasonics , Animals , Hemodynamics , Rats , Reproducibility of Results , Rheology , Ultrasonography

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL