Search | VHL Regional Portal

LMCrot: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model.

Pratyush, Pawel; Bahmani, Soufia; Pokharel, Suresh; Ismail, Hamid D; Kc, Dukka B.

Bioinformatics ; 40(5)2024 May 02.

Article in English | MEDLINE | ID: mdl-38662579

ABSTRACT

MOTIVATION: Recent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. RESULTS: Herein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer's encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. AVAILABILITY AND IMPLEMENTATION: LMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.

Subject(s)

Proteins , Proteins/chemistry , Proteins/metabolism , Natural Language Processing , Computational Biology/methods , Databases, Protein , Software , Protein Processing, Post-Translational , Amino Acid Sequence

Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction.

Pokharel, Suresh; Pratyush, Pawel; Ismail, Hamid D; Ma, Junfeng; Kc, Dukka B.

Int J Mol Sci ; 24(21)2023 Nov 06.

Article in English | MEDLINE | ID: mdl-37958983

ABSTRACT

O-linked ß-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site's web server and source code are publicly available to the community.

Subject(s)

Protein Processing, Post-Translational , Proteins , Proteins/chemistry , Amino Acid Sequence , Acetylglucosamine/metabolism , N-Acetylglucosaminyltransferases/metabolism

LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model.

Pakhrin, Subash C; Pokharel, Suresh; Pratyush, Pawel; Chaudhari, Meenal; Ismail, Hamid D; Kc, Dukka B.

J Proteome Res ; 22(8): 2548-2557, 2023 08 04.

Article in English | MEDLINE | ID: mdl-37459437

ABSTRACT

Phosphorylation is one of the most important post-translational modifications and plays a pivotal role in various cellular processes. Although there exist several computational tools to predict phosphorylation sites, existing tools have not yet harnessed the knowledge distilled by pretrained protein language models. Herein, we present a novel deep learning-based approach called LMPhosSite for the general phosphorylation site prediction that integrates embeddings from the local window sequence and the contextualized embedding obtained using global (overall) protein sequence from a pretrained protein language model to improve the prediction performance. Thus, the LMPhosSite consists of two base-models: one for capturing effective local representation and the other for capturing global per-residue contextualized embedding from a pretrained protein language model. The output of these base-models is integrated using a score-level fusion approach. LMPhosSite achieves a precision, recall, Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%, 0.390, and 49.15%, for the combined serine and threonine independent test data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively, for the tyrosine independent test data set, which is better than the compared approaches. These results demonstrate that LMPhosSite is a robust computational tool for the prediction of the general phosphorylation sites in proteins.

Subject(s)

Deep Learning , Phosphorylation , Proteins/metabolism , Protein Processing, Post-Translational , Amino Acid Sequence

A Review of the Neutrophil Extracellular Traps (NETs) from Cow, Sheep and Goat Models.

Worku, Mulumebet; Rehrah, Djaafar; Ismail, Hamid D; Asiamah, Emmanuel; Adjei-Fremah, Sarah.

Int J Mol Sci ; 22(15)2021 Jul 28.

Article in English | MEDLINE | ID: mdl-34360812

ABSTRACT

This review provides insight into the importance of understanding NETosis in cows, sheep, and goats in light of the importance to their health, welfare and use as animal models. Neutrophils are essential to innate immunity, pathogen infection, and inflammatory diseases. The relevance of NETosis as a conserved innate immune response mechanism and the translational implications for public health are presented. Increased understanding of NETosis in ruminants will contribute to the prediction of pathologies and design of strategic interventions targeting NETs. This will help to control pathogens such as coronaviruses and inflammatory diseases such as mastitis that impact all mammals, including humans. Definition of unique attributes of NETosis in ruminants, in comparison to what has been observed in humans, has significant translational implications for one health and global food security, and thus warrants further study.

Subject(s)

Extracellular Traps/immunology , Immunity, Innate , Neutrophils/immunology , Ruminants/immunology , Animals , Humans , Neutrophils/cytology

RF-NR: Random Forest Based Approach for Improved Classification of Nuclear Receptors.

Ismail, Hamid D; Saigo, Hiroto; Kc, Dukka B.

IEEE/ACM Trans Comput Biol Bioinform ; 15(6): 1844-1852, 2018.

Article in English | MEDLINE | ID: mdl-29990125

ABSTRACT

The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental, and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study, we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-NR sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition, and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamilies. RF-NR is freely available at http://bcb.ncat.edu/RF_NR.

Subject(s)

Computational Biology/methods , Receptors, Cytoplasmic and Nuclear/chemistry , Receptors, Cytoplasmic and Nuclear/classification , Algorithms , Databases, Protein , Machine Learning

CNN-BLPred: a Convolutional neural network based predictor for ß-Lactamases (BL) and their classes.

White, Clarence; Ismail, Hamid D; Saigo, Hiroto; Kc, Dukka B.

BMC Bioinformatics ; 18(Suppl 16): 577, 2017 12 28.

Article in English | MEDLINE | ID: mdl-29297322

ABSTRACT

BACKGROUND: The ß-Lactamase (BL) enzyme family is an important class of enzymes that plays a key role in bacterial resistance to antibiotics. As the newly identified number of BL enzymes is increasing daily, it is imperative to develop a computational tool to classify the newly identified BL enzymes into one of its classes. There are two types of classification of BL enzymes: Molecular Classification and Functional Classification. Existing computational methods only address Molecular Classification and the performance of these existing methods is unsatisfactory. RESULTS: We addressed the unsatisfactory performance of the existing methods by implementing a Deep Learning approach called Convolutional Neural Network (CNN). We developed CNN-BLPred, an approach for the classification of BL proteins. The CNN-BLPred uses Gradient Boosted Feature Selection (GBFS) in order to select the ideal feature set for each BL classification. Based on the rigorous benchmarking of CCN-BLPred using both leave-one-out cross-validation and independent test sets, CCN-BLPred performed better than the other existing algorithms. Compared with other architectures of CNN, Recurrent Neural Network, and Random Forest, the simple CNN architecture with only one convolutional layer performs the best. After feature extraction, we were able to remove ~95% of the 10,912 features using Gradient Boosted Trees. During 10-fold cross validation, we increased the accuracy of the classic BL predictions by 7%. We also increased the accuracy of Class A, Class B, Class C, and Class D performance by an average of 25.64%. The independent test results followed a similar trend. CONCLUSIONS: We implemented a deep learning algorithm known as Convolutional Neural Network (CNN) to develop a classifier for BL classification. Combined with feature selection on an exhaustive feature set and using balancing method such as Random Oversampling (ROS), Random Undersampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE), CNN-BLPred performs significantly better than existing algorithms for BL classification.

Subject(s)

Algorithms , Neural Networks, Computer , beta-Lactamases/classification , Amino Acid Sequence , Databases, Protein , Models, Molecular , ROC Curve , Reproducibility of Results

RF-Hydroxysite: a random forest based predictor for hydroxylation sites.

Ismail, Hamid D; Newman, Robert H; Kc, Dukka B.

Mol Biosyst ; 12(8): 2427-35, 2016 07 19.

Article in English | MEDLINE | ID: mdl-27292874

ABSTRACT

Protein hydroxylation is an emerging posttranslational modification involved in both normal cellular processes and a growing number of pathological states, including several cancers. Protein hydroxylation is mediated by members of the hydroxylase family of enzymes, which catalyze the conversion of an alkyne group at select lysine or proline residues on their target substrates to a hydroxyl. Traditionally, hydroxylation has been identified using expensive and time-consuming experimental methods, such as tandem mass spectrometry. Therefore, to facilitate identification of putative hydroxylation sites and to complement existing experimental approaches, computational methods designed to predict the hydroxylation sites in protein sequences have recently been developed. Building on these efforts, we have developed a new method, termed RF-hydroxysite, that uses random forest to identify putative hydroxylysine and hydroxyproline residues in proteins using only the primary amino acid sequence as input. RF-Hydroxysite integrates features previously shown to contribute to hydroxylation site prediction with several new features that we found to augment the performance remarkably. These include features that capture physicochemical, structural, sequence-order and evolutionary information from the protein sequences. The features used in the final model were selected based on their contribution to the prediction. Physicochemical information was found to contribute the most to the model. The present study also sheds light on the contribution of evolutionary, sequence order, and protein disordered region information to hydroxylation site prediction. The web server for RF-hydroxysite is available online at .

Subject(s)

Computational Biology/methods , Lysine/chemistry , Proline/chemistry , Proteins/chemistry , Amino Acid Sequence , Amino Acids/chemistry , Amino Acids/metabolism , Hydrophobic and Hydrophilic Interactions , Hydroxylation , Lysine/metabolism , Proline/metabolism , Proteins/metabolism , ROC Curve

RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest.

Ismail, Hamid D; Jones, Ahoi; Kim, Jung H; Newman, Robert H; Kc, Dukka B.

Biomed Res Int ; 2016: 3281590, 2016.

Article in English | MEDLINE | ID: mdl-27066500

ABSTRACT

Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.

Subject(s)

Computational Biology/methods , Decision Trees , Proteins/chemistry , Proteins/metabolism , Sequence Analysis, Protein/methods , Software , Models, Statistical , Phosphorylation , Proteins/analysis

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL