Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38855913

ABSTRACT

MOTIVATION: Coding and noncoding RNA molecules participate in many important biological processes. Noncoding RNAs fold into well-defined secondary structures to exert their functions. However, the computational prediction of the secondary structure from a raw RNA sequence is a long-standing unsolved problem, which after decades of almost unchanged performance has now re-emerged due to deep learning. Traditional RNA secondary structure prediction algorithms have been mostly based on thermodynamic models and dynamic programming for free energy minimization. More recently deep learning methods have shown competitive performance compared with the classical ones, but there is still a wide margin for improvement. RESULTS: In this work we present sincFold, an end-to-end deep learning approach, that predicts the nucleotides contact matrix using only the RNA sequence as input. The model is based on 1D and 2D residual neural networks that can learn short- and long-range interaction patterns. We show that structures can be accurately predicted with minimal physical assumptions. Extensive experiments were conducted on several benchmark datasets, considering sequence homology and cross-family validation. sincFold was compared with classical methods and recent deep learning models, showing that it can outperform the state-of-the-art methods.


Subject(s)
Computational Biology , Deep Learning , Nucleic Acid Conformation , RNA , RNA/chemistry , RNA/genetics , Computational Biology/methods , Algorithms , Neural Networks, Computer , Thermodynamics
2.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38706315

ABSTRACT

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Subject(s)
Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning
3.
Patterns (N Y) ; 4(2): 100691, 2023 Feb 10.
Article in English | MEDLINE | ID: mdl-36873903

ABSTRACT

The automatic annotation of the protein universe is still an unresolved challenge. Today, there are 229,149,489 entries in the UniProtKB database, but only 0.25% of them have been functionally annotated. This manual process integrates knowledge from the protein families database Pfam, annotating family domains using sequence alignments and hidden Markov models. This approach has grown the Pfam annotations at a low rate in the last years. Recently, deep learning models appeared with the capability of learning evolutionary patterns from unaligned protein sequences. However, this requires large-scale data, while many families contain just a few sequences. Here, we contend this limitation can be overcome by transfer learning, exploiting the full potential of self-supervised learning on large unannotated data and then supervised learning on a small labeled dataset. We show results where errors in protein family prediction can be reduced by 55% with respect to standard methods.

4.
Brief Bioinform ; 23(4)2022 07 18.
Article in English | MEDLINE | ID: mdl-35758229

ABSTRACT

A representation method is an algorithm that calculates numerical feature vectors for samples in a dataset. Such vectors, also known as embeddings, define a relatively low-dimensional space able to efficiently encode high-dimensional data. Very recently, many types of learned data representations based on machine learning have appeared and are being applied to several tasks in bioinformatics. In particular, protein representation learning methods integrate different types of protein information (sequence, domains, etc.), in supervised or unsupervised learning approaches, and provide embeddings of protein sequences that can be used for downstream tasks. One task that is of special interest is the automatic function prediction of the huge number of novel proteins that are being discovered nowadays and are still totally uncharacterized. However, despite its importance, up to date there is not a fair benchmark study of the predictive performance of existing proposals on the same large set of proteins and for very concrete and common bioinformatics tasks. Therefore, this lack of benchmark studies prevent the community from using adequate predictive methods for accelerating the functional characterization of proteins. In this study, we performed a detailed comparison of protein sequence representation learning methods, explaining each approach and comparing them with an experimental benchmark on several bioinformatics tasks: (i) determining protein sequence similarity in the embedding space; (ii) inferring protein domains and (iii) predicting ontology-based protein functions. We examine the advantages and disadvantages of each representation approach over the benchmark results. We hope the results and the discussion of this study can help the community to select the most adequate machine learning-based technique for protein representation according to the bioinformatics task at hand.


Subject(s)
Computational Biology , Proteins , Algorithms , Amino Acid Sequence , Computational Biology/methods , Machine Learning
5.
J Immunol ; 207(8): 1965-1977, 2021 10 15.
Article in English | MEDLINE | ID: mdl-34507950

ABSTRACT

Parasite-specific CD8 T cell responses play a key role in mediating immunity against Theileria parva in cattle (Bos taurus), and there is evidence that efficient induction of these responses requires CD4 T cell responses. However, information on the antigenic specificity of the CD4 T cell response is lacking. The current study used a high-throughput system for Ag identification using CD4 T cells from immune animals to screen a library of ∼40,000 synthetic peptides representing 499 T. parva gene products. Use of CD4 T cells from 12 immune cattle, representing 12 MHC class II types, identified 26 Ags. Unlike CD8 T cell responses, which are focused on a few dominant Ags, multiple Ags were recognized by CD4 T cell responses of individual animals. The Ags had diverse properties, but included proteins encoded by two multimember gene families: five haloacid dehalogenases and five subtelomere-encoded variable secreted proteins. Most Ags had predicted signal peptides and/or were encoded by abundantly transcribed genes, but neither parameter on their own was reliable for predicting antigenicity. Mapping of the epitopes confirmed presentation by DR or DQ class II alleles and comparison of available T. parva genome sequences demonstrated that they included both conserved and polymorphic epitopes. Immunization of animals with vaccine vectors expressing two of the Ags demonstrated induction of CD4 T cell responses capable of recognizing parasitized cells. The results of this study provide detailed insight into the CD4 T cell responses induced by T. parva and identify Ags suitable for use in vaccine development.


Subject(s)
CD4-Positive T-Lymphocytes/immunology , Protozoan Vaccines/immunology , Theileria parva/physiology , Theileriasis/immunology , Animals , Antigen Presentation , Antigens, Protozoan/immunology , Cattle , Cells, Cultured , Epitope Mapping , Epitopes, T-Lymphocyte/immunology , High-Throughput Screening Assays , Histocompatibility Antigens Class II , Lymphocyte Activation , Peptide Library , Peptides/chemical synthesis , Peptides/immunology , T-Cell Antigen Receptor Specificity
6.
Bioinformatics ; 35(7): 1098-1107, 2019 04 01.
Article in English | MEDLINE | ID: mdl-30169744

ABSTRACT

MOTIVATION: Understanding the specificity of protein receptor-ligand interactions is pivotal for our comprehension of biological mechanisms and systems. Receptor protein families often have a certain level of sequence diversity that converges into fewer conserved protein structures, allowing the exertion of well-defined functions. T and B cell receptors of the immune system and protein kinases that control the dynamic behaviour and decision processes in eukaryotic cells by catalysing phosphorylation represent prime examples. Driven by the large sequence diversity, the receptors within such protein families are often found to share specificities although divergent at the sequence level. This observation has led to the notion that prediction models of such systems are most effectively handled in a receptor-specific manner. RESULTS: We show that this approach in many cases is suboptimal, and describe an alternative improved framework for generating models with pan-receptor-predictive power for receptor protein families. The framework is based on deep artificial neural networks and integrates information from individual receptors into a single pan-receptor model, leveraging information across multiple receptor-specific datasets allowing predictions of the receptor specificity for all members of a given protein family including those described by limited or no ligand data. The approach was applied to the protein kinase superfamily, leading to the method NetPhosPan. The method was extensively validated and benchmarked against state-of-the-art prediction methods and was found to have unprecedented performance in particularly for kinase domains characterized by limited or no experimental data. AVAILABILITY AND IMPLEMENTATION: The method is freely available to non-commercial users and can be downloaded at http://www.cbs.dtu.dk/services/NetPhospan-1.0. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Neural Networks, Computer , Ligands , Phosphorylation , Protein Kinases , Proteins
7.
J Immunol ; 197(4): 1517-24, 2016 08 15.
Article in English | MEDLINE | ID: mdl-27402703

ABSTRACT

Binding of peptides to MHC class I (MHC-I) molecules is the most selective event in the processing and presentation of Ags to CTL, and insights into the mechanisms that govern peptide-MHC-I binding should facilitate our understanding of CTL biology. Peptide-MHC-I interactions have traditionally been quantified by the strength of the interaction, that is, the binding affinity, yet it has been shown that the stability of the peptide-MHC-I complex is a better correlate of immunogenicity compared with binding affinity. In this study, we have experimentally analyzed peptide-MHC-I complex stability of a large panel of human MHC-I allotypes and generated a body of data sufficient to develop a neural network-based pan-specific predictor of peptide-MHC-I complex stability. Integrating the neural network predictors of peptide-MHC-I complex stability with state-of-the-art predictors of peptide-MHC-I binding is shown to significantly improve the prediction of CTL epitopes. The method is publicly available at http://www.cbs.dtu.dk/services/NetMHCstabpan.


Subject(s)
Antigen Presentation/immunology , Epitopes, T-Lymphocyte/immunology , Histocompatibility Antigens Class I/immunology , Lymphocyte Activation/immunology , Neural Networks, Computer , Histocompatibility Antigens Class I/chemistry , Histocompatibility Antigens Class I/metabolism , Humans , Peptides/chemistry , Peptides/immunology , Peptides/metabolism , Protein Stability
SELECTION OF CITATIONS
SEARCH DETAIL
...