Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 40(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38662579

RESUMO

MOTIVATION: Recent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. RESULTS: Herein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer's encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. AVAILABILITY AND IMPLEMENTATION: LMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.


Assuntos
Proteínas , Proteínas/química , Proteínas/metabolismo , Processamento de Linguagem Natural , Biologia Computacional/métodos , Bases de Dados de Proteínas , Software , Processamento de Proteína Pós-Traducional , Sequência de Aminoácidos
2.
Int J Mol Sci ; 24(21)2023 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-37958983

RESUMO

O-linked ß-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site's web server and source code are publicly available to the community.


Assuntos
Processamento de Proteína Pós-Traducional , Proteínas , Proteínas/química , Sequência de Aminoácidos , Acetilglucosamina/metabolismo , N-Acetilglucosaminiltransferases/metabolismo
3.
J Proteome Res ; 22(8): 2548-2557, 2023 08 04.
Artigo em Inglês | MEDLINE | ID: mdl-37459437

RESUMO

Phosphorylation is one of the most important post-translational modifications and plays a pivotal role in various cellular processes. Although there exist several computational tools to predict phosphorylation sites, existing tools have not yet harnessed the knowledge distilled by pretrained protein language models. Herein, we present a novel deep learning-based approach called LMPhosSite for the general phosphorylation site prediction that integrates embeddings from the local window sequence and the contextualized embedding obtained using global (overall) protein sequence from a pretrained protein language model to improve the prediction performance. Thus, the LMPhosSite consists of two base-models: one for capturing effective local representation and the other for capturing global per-residue contextualized embedding from a pretrained protein language model. The output of these base-models is integrated using a score-level fusion approach. LMPhosSite achieves a precision, recall, Matthew's correlation coefficient, and F1-score of 38.78%, 67.12%, 0.390, and 49.15%, for the combined serine and threonine independent test data set and 34.90%, 62.03%, 0.298, and 44.67%, respectively, for the tyrosine independent test data set, which is better than the compared approaches. These results demonstrate that LMPhosSite is a robust computational tool for the prediction of the general phosphorylation sites in proteins.


Assuntos
Aprendizado Profundo , Fosforilação , Proteínas/metabolismo , Processamento de Proteína Pós-Traducional , Sequência de Aminoácidos
4.
Methods Mol Biol ; 2499: 65-104, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35696075

RESUMO

Machine learning has become one of the most popular choices for developing computational approaches in protein structural bioinformatics. The ability to extract features from protein sequence/structure often becomes one of the crucial steps for the development of machine learning-based approaches. Over the years, various sequence, structural, and physicochemical descriptors have been developed for proteins and these descriptors have been used to predict/solve various bioinformatics problems. Hence, several feature extraction tools have been developed over the years to help researchers to generate numeric features from protein sequences. Most of these tools have some limitations regarding the number of sequences they can handle and the subsequent preprocessing that is required for the generated features before they can be fed to machine learning methods. Here, we present Feature Extraction from Protein Sequences (FEPS), a toolkit for feature extraction. FEPS is a versatile software package for generating various descriptors from protein sequences and can handle several sequences: the number of which is limited only by the computational resources. In addition, the features extracted from FEPS do not require subsequent processing and are ready to be fed to the machine learning techniques as it provides various output formats as well as the ability to concatenate these generated features. FEPS is made freely available via an online web server as well as a stand-alone toolkit. FEPS, a comprehensive toolkit for feature extraction, will help spur the development of machine learning-based models for various bioinformatics problems.


Assuntos
Biologia Computacional , Software , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Aprendizado de Máquina , Proteínas/química
5.
Int J Mol Sci ; 22(15)2021 Jul 28.
Artigo em Inglês | MEDLINE | ID: mdl-34360812

RESUMO

This review provides insight into the importance of understanding NETosis in cows, sheep, and goats in light of the importance to their health, welfare and use as animal models. Neutrophils are essential to innate immunity, pathogen infection, and inflammatory diseases. The relevance of NETosis as a conserved innate immune response mechanism and the translational implications for public health are presented. Increased understanding of NETosis in ruminants will contribute to the prediction of pathologies and design of strategic interventions targeting NETs. This will help to control pathogens such as coronaviruses and inflammatory diseases such as mastitis that impact all mammals, including humans. Definition of unique attributes of NETosis in ruminants, in comparison to what has been observed in humans, has significant translational implications for one health and global food security, and thus warrants further study.


Assuntos
Armadilhas Extracelulares/imunologia , Imunidade Inata , Neutrófilos/imunologia , Ruminantes/imunologia , Animais , Humanos , Neutrófilos/citologia
6.
Front Cell Dev Biol ; 9: 662983, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34249915

RESUMO

Phosphorylation, which is mediated by protein kinases and opposed by protein phosphatases, is an important post-translational modification that regulates many cellular processes, including cellular metabolism, cell migration, and cell division. Due to its essential role in cellular physiology, a great deal of attention has been devoted to identifying sites of phosphorylation on cellular proteins and understanding how modification of these sites affects their cellular functions. This has led to the development of several computational methods designed to predict sites of phosphorylation based on a protein's primary amino acid sequence. In contrast, much less attention has been paid to dephosphorylation and its role in regulating the phosphorylation status of proteins inside cells. Indeed, to date, dephosphorylation site prediction tools have been restricted to a few tyrosine phosphatases. To fill this knowledge gap, we have employed a transfer learning strategy to develop a deep learning-based model to predict sites that are likely to be dephosphorylated. Based on independent test results, our model, which we termed DTL-DephosSite, achieved efficiency scores for phosphoserine/phosphothreonine residues of 84%, 84% and 0.68 with respect to sensitivity (SN), specificity (SP) and Matthew's correlation coefficient (MCC). Similarly, DTL-DephosSite exhibited efficiency scores of 75%, 88% and 0.64 for phosphotyrosine residues with respect to SN, SP, and MCC.

7.
IEEE/ACM Trans Comput Biol Bioinform ; 15(6): 1844-1852, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29990125

RESUMO

The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental, and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study, we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-NR sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition, and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamilies. RF-NR is freely available at http://bcb.ncat.edu/RF_NR.


Assuntos
Biologia Computacional/métodos , Receptores Citoplasmáticos e Nucleares/química , Receptores Citoplasmáticos e Nucleares/classificação , Algoritmos , Bases de Dados de Proteínas , Aprendizado de Máquina
8.
BMC Bioinformatics ; 18(Suppl 16): 577, 2017 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-29297322

RESUMO

BACKGROUND: The ß-Lactamase (BL) enzyme family is an important class of enzymes that plays a key role in bacterial resistance to antibiotics. As the newly identified number of BL enzymes is increasing daily, it is imperative to develop a computational tool to classify the newly identified BL enzymes into one of its classes. There are two types of classification of BL enzymes: Molecular Classification and Functional Classification. Existing computational methods only address Molecular Classification and the performance of these existing methods is unsatisfactory. RESULTS: We addressed the unsatisfactory performance of the existing methods by implementing a Deep Learning approach called Convolutional Neural Network (CNN). We developed CNN-BLPred, an approach for the classification of BL proteins. The CNN-BLPred uses Gradient Boosted Feature Selection (GBFS) in order to select the ideal feature set for each BL classification. Based on the rigorous benchmarking of CCN-BLPred using both leave-one-out cross-validation and independent test sets, CCN-BLPred performed better than the other existing algorithms. Compared with other architectures of CNN, Recurrent Neural Network, and Random Forest, the simple CNN architecture with only one convolutional layer performs the best. After feature extraction, we were able to remove ~95% of the 10,912 features using Gradient Boosted Trees. During 10-fold cross validation, we increased the accuracy of the classic BL predictions by 7%. We also increased the accuracy of Class A, Class B, Class C, and Class D performance by an average of 25.64%. The independent test results followed a similar trend. CONCLUSIONS: We implemented a deep learning algorithm known as Convolutional Neural Network (CNN) to develop a classifier for BL classification. Combined with feature selection on an exhaustive feature set and using balancing method such as Random Oversampling (ROS), Random Undersampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE), CNN-BLPred performs significantly better than existing algorithms for BL classification.


Assuntos
Algoritmos , Redes Neurais de Computação , beta-Lactamases/classificação , Sequência de Aminoácidos , Bases de Dados de Proteínas , Modelos Moleculares , Curva ROC , Reprodutibilidade dos Testes
9.
Genom Data ; 10: 15-8, 2016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27656413

RESUMO

Probiotic supplements are beneficial for animal health and rumen function; and lipopolysaccharides (LPS) from gram negative bacteria have been associated with inflammatory diseases. In this study the transcriptional profile in whole blood collected from probiotics-treated cows was investigated in response to stimulation with lipopolysaccharides (LPS) in vitro. Microarray experiment was performed between LPS-treated and control samples using the Agilent one-color bovine v2 bovine (v2) 4x44K array slides. Global gene expression analysis identified 13,658 differentially expressed genes (fold change cutoff ≥ 2, P < 0.05), 3816 upregulated genes and 9842 downregulated genes in blood in response to LPS. Treatment with LPS resulted in increased expression of TLR4 (Fold change (FC) = 3.16) and transcription factor NFkB (FC = 5.4) and decreased the expression of genes including TLR1 (FC = - 2.54), TLR3 (FC = - 2.43), TLR10 (FC = - 3.88), NOD2 (FC = - 2.4), NOD1 (FC = - 2.45) and pro-inflammatory cytokine IL1B (- 3.27). The regulation of the genes involved in inflammation signaling pathway suggests that probiotics may stimulate the innate immune response of animal against parasitic and bacterial infections. We have provided a detailed description of the experimental design, microarray experiment and normalization and analysis of data which have been deposited into NCBI Gene Expression Omnibus (GEO): GSE75240.

10.
Mol Biosyst ; 12(8): 2427-35, 2016 07 19.
Artigo em Inglês | MEDLINE | ID: mdl-27292874

RESUMO

Protein hydroxylation is an emerging posttranslational modification involved in both normal cellular processes and a growing number of pathological states, including several cancers. Protein hydroxylation is mediated by members of the hydroxylase family of enzymes, which catalyze the conversion of an alkyne group at select lysine or proline residues on their target substrates to a hydroxyl. Traditionally, hydroxylation has been identified using expensive and time-consuming experimental methods, such as tandem mass spectrometry. Therefore, to facilitate identification of putative hydroxylation sites and to complement existing experimental approaches, computational methods designed to predict the hydroxylation sites in protein sequences have recently been developed. Building on these efforts, we have developed a new method, termed RF-hydroxysite, that uses random forest to identify putative hydroxylysine and hydroxyproline residues in proteins using only the primary amino acid sequence as input. RF-Hydroxysite integrates features previously shown to contribute to hydroxylation site prediction with several new features that we found to augment the performance remarkably. These include features that capture physicochemical, structural, sequence-order and evolutionary information from the protein sequences. The features used in the final model were selected based on their contribution to the prediction. Physicochemical information was found to contribute the most to the model. The present study also sheds light on the contribution of evolutionary, sequence order, and protein disordered region information to hydroxylation site prediction. The web server for RF-hydroxysite is available online at .


Assuntos
Biologia Computacional/métodos , Lisina/química , Prolina/química , Proteínas/química , Sequência de Aminoácidos , Aminoácidos/química , Aminoácidos/metabolismo , Interações Hidrofóbicas e Hidrofílicas , Hidroxilação , Lisina/metabolismo , Prolina/metabolismo , Proteínas/metabolismo , Curva ROC
11.
Biomed Res Int ; 2016: 3281590, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27066500

RESUMO

Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.


Assuntos
Biologia Computacional/métodos , Árvores de Decisões , Proteínas/química , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Software , Modelos Estatísticos , Fosforilação , Proteínas/análise
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...