Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 17.796
Filter
1.
BMC Bioinformatics ; 25(1): 204, 2024 Jun 01.
Article in English | MEDLINE | ID: mdl-38824535

ABSTRACT

BACKGROUND: Protein solubility is a critically important physicochemical property closely related to protein expression. For example, it is one of the main factors to be considered in the design and production of antibody drugs and a prerequisite for realizing various protein functions. Although several solubility prediction models have emerged in recent years, many of these models are limited to capturing information embedded in one-dimensional amino acid sequences, resulting in unsatisfactory predictive performance. RESULTS: In this study, we introduce a novel Graph Attention network-based protein Solubility model, GATSol, which represents the 3D structure of proteins as a protein graph. In addition to the node features of amino acids extracted by the state-of-the-art protein large language model, GATSol utilizes amino acid distance maps generated using the latest AlphaFold technology. Rigorous testing on independent eSOL and the Saccharomyces cerevisiae test datasets has shown that GATSol outperforms most recently introduced models, especially with respect to the coefficient of determination R2, which reaches 0.517 and 0.424, respectively. It outperforms the current state-of-the-art GraphSol by 18.4% on the S. cerevisiae_test set. CONCLUSIONS: GATSol captures 3D dimensional features of proteins by building protein graphs, which significantly improves the accuracy of protein solubility prediction. Recent advances in protein structure modeling allow our method to incorporate spatial structure features extracted from predicted structures into the model by relying only on the input of protein sequences, which simplifies the entire graph neural network prediction process, making it more user-friendly and efficient. As a result, GATSol may help prioritize highly soluble proteins, ultimately reducing the cost and effort of experimental work. The source code and data of the GATSol model are freely available at https://github.com/binbinbinv/GATSol .


Subject(s)
Proteins , Solubility , Proteins/chemistry , Proteins/metabolism , Protein Conformation , Databases, Protein , Computational Biology/methods , Software , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae/chemistry , Algorithms , Models, Molecular , Amino Acid Sequence
2.
Sci Data ; 11(1): 568, 2024 Jun 01.
Article in English | MEDLINE | ID: mdl-38824125

ABSTRACT

Technological advances in massively parallel sequencing have led to an exponential growth in the number of known protein sequences. Much of this growth originates from metagenomic projects producing new sequences from environmental and clinical samples. The Unified Human Gastrointestinal Proteome (UHGP) catalogue is one of the most relevant metagenomic datasets with applications ranging from medicine to biology. However, the low levels of sequence annotation may impair its usability. This work aims to produce a family classification of UHGP sequences to facilitate downstream structural and functional annotation. This is achieved through the release of the DPCfam-UHGP50 dataset containing 10,778 putative protein families generated using DPCfam clustering, an unsupervised pipeline grouping sequences into single or multi-domain architectures. DPCfam-UHGP50 considerably improves family coverage at protein and residue levels compared to the manually curated repository Pfam. In the hope that DPCfam-UHGP50 will foster future discoveries in the field of metagenomics of the human gut, we release a FAIR-compliant database of our results that is easily accessible via a searchable web server and Zenodo repository.


Subject(s)
Proteome , Humans , Gastrointestinal Tract/metabolism , Cluster Analysis , Molecular Sequence Annotation , Metagenomics , Databases, Protein
3.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38695119

ABSTRACT

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Subject(s)
Algorithms , Computational Biology , Sequence Alignment , Sequence Alignment/methods , Computational Biology/methods , Software , Sequence Analysis, Protein/methods , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics , Deep Learning , Databases, Protein
4.
BMC Bioinformatics ; 25(1): 174, 2024 May 02.
Article in English | MEDLINE | ID: mdl-38698340

ABSTRACT

BACKGROUND: In last two decades, the use of high-throughput sequencing technologies has accelerated the pace of discovery of proteins. However, due to the time and resource limitations of rigorous experimental functional characterization, the functions of a vast majority of them remain unknown. As a result, computational methods offering accurate, fast and large-scale assignment of functions to new and previously unannotated proteins are sought after. Leveraging the underlying associations between the multiplicity of features that describe proteins could reveal functional insights into the diverse roles of proteins and improve performance on the automatic function prediction task. RESULTS: We present GO-LTR, a multi-view multi-label prediction model that relies on a high-order tensor approximation of model weights combined with non-linear activation functions. The model is capable of learning high-order relationships between multiple input views representing the proteins and predicting high-dimensional multi-label output consisting of protein functional categories. We demonstrate the competitiveness of our method on various performance measures. Experiments show that GO-LTR learns polynomial combinations between different protein features, resulting in improved performance. Additional investigations establish GO-LTR's practical potential in assigning functions to proteins under diverse challenging scenarios: very low sequence similarity to previously observed sequences, rarely observed and highly specific terms in the gene ontology. IMPLEMENTATION: The code and data used for training GO-LTR is available at https://github.com/aalto-ics-kepaco/GO-LTR-prediction .


Subject(s)
Computational Biology , Proteins , Proteins/chemistry , Proteins/metabolism , Computational Biology/methods , Databases, Protein , Algorithms
5.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38701416

ABSTRACT

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.


Subject(s)
Algorithms , Computational Biology , Neural Networks, Computer , Protein Structure, Secondary , Proteins , Proteins/chemistry , Proteins/metabolism , Proteins/genetics , Computational Biology/methods , Databases, Protein , Gene Ontology , Sequence Analysis, Protein/methods , Software
6.
Life Sci Alliance ; 7(7)2024 Jul.
Article in English | MEDLINE | ID: mdl-38719747

ABSTRACT

The differential expression of plasma membrane proteins is integrally analyzed for their diagnosis, prognosis, and therapeutic applications in diverse clinical manifestations. Necessarily, distinct membrane protein enrichment methods and mass spectrometry platforms are employed for their global and relative quantitation. First of its kind to explore, we compiled membrane-associated proteomes in human and mouse systems into a database named, Resource of Experimental Membrane-Enriched Mass spectrometry-derived Proteome (REMEMProt). It currently hosts 14,626 proteins (9,507 proteins in Homo sapiens; 5,119 proteins in Mus musculus) with information on their membrane-protein enrichment methods, experimental/physiological context of detection in cells or tissues, transmembrane domain analysis, and their current attribution as biomarkers. Based on these annotations and the transmembrane domain analysis in proteins or their binary/complex protein-protein interactors, REMEMProt facilitates the assessment of the plasma membrane localization potential of proteins through batch query. A cross-study enrichment analysis platform is enabled in REMEMProt for comparative analysis of proteomes using novel/modified membrane enrichment methods and evaluation of methods for targeted enrichment of membrane proteins. REMEMProt data are made freely accessible to explore and download at https://rememprot.ciods.in/.


Subject(s)
Biomarkers , Databases, Protein , Membrane Proteins , Proteome , Proteomics , Humans , Proteome/metabolism , Membrane Proteins/metabolism , Biomarkers/metabolism , Animals , Mice , Proteomics/methods , Cell Membrane/metabolism , Mass Spectrometry/methods
7.
Sci Rep ; 14(1): 10527, 2024 05 08.
Article in English | MEDLINE | ID: mdl-38719885

ABSTRACT

Plasmodium falciparum, the causative agent of malaria, poses a significant global health challenge, yet much of its biology remains elusive. A third of the genes in the P. falciparum genome lack annotations regarding their function, impeding our understanding of the parasite's biology. In this study, we employ structure predictions and the DALI search algorithm to analyse proteins encoded by uncharacterized genes in the reference strain 3D7 of P. falciparum. By comparing AlphaFold predictions to experimentally determined protein structures in the Protein Data Bank, we found similarities to known domains in 353 proteins of unknown function, shedding light on their potential functions. The lowest-scoring 5% of similarities were additionally validated using the size-independent TM-align algorithm, confirming the detected similarities in 88% of the cases. Notably, in over 70 P. falciparum proteins the presence of domains resembling heptatricopeptide repeats, which are typically involvement in RNA binding and processing, was detected. This suggests this family, which is important in transcription in mitochondria and apicoplasts, is much larger in Plasmodium parasites than previously thought. The results of this domain search provide a resource to the malaria research community that is expected to inform and enable experimental studies.


Subject(s)
Plasmodium falciparum , Protozoan Proteins , Plasmodium falciparum/genetics , Plasmodium falciparum/metabolism , Protozoan Proteins/genetics , Protozoan Proteins/metabolism , Protozoan Proteins/chemistry , Algorithms , Protein Domains , Databases, Protein , Models, Molecular
8.
Curr Protoc ; 4(5): e1047, 2024 May.
Article in English | MEDLINE | ID: mdl-38720559

ABSTRACT

Recent advancements in protein structure determination and especially in protein structure prediction techniques have led to the availability of vast amounts of macromolecular structures. However, the accessibility and integration of these structures into scientific workflows are hindered by the lack of standardization among publicly available data resources. To address this issue, we introduced the 3D-Beacons Network, a unified platform that aims to establish a standardized framework for accessing and displaying protein structure data. In this article, we highlight the importance of standardized approaches for accessing protein structure data and showcase the capabilities of 3D-Beacons. We describe four protocols for finding and accessing macromolecular structures from various specialist data resources via 3D-Beacons. First, we describe three scenarios for programmatically accessing and retrieving data using the 3D-Beacons API. Next, we show how to perform sequence-based searches to find structures from model providers. Then, we demonstrate how to search for structures and fetch them directly into a workflow using JalView. Finally, we outline the process of facilitating access to data from providers interested in contributing their structures to the 3D-Beacons Network. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Programmatic access to the 3D-Beacons API Basic Protocol 2: Sequence-based search using the 3D-Beacons API Basic Protocol 3: Accessing macromolecules from 3D-Beacons with JalView Basic Protocol 4: Enhancing data accessibility through 3D-Beacons.


Subject(s)
Protein Conformation , Proteins , Proteins/chemistry , Databases, Protein , Software
9.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38706315

ABSTRACT

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningThis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.


Subject(s)
Databases, Protein , Proteins , Proteins/chemistry , Molecular Sequence Annotation/methods , Computational Biology/methods , Machine Learning
10.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38706316

ABSTRACT

Protein-ligand interactions (PLIs) are essential for cellular activities and drug discovery. But due to the complexity and high cost of experimental methods, there is a great demand for computational approaches to recognize PLI patterns, such as protein-ligand docking. In recent years, more and more models based on machine learning have been developed to directly predict the root mean square deviation (RMSD) of a ligand docking pose with reference to its native binding pose. However, new scoring methods are pressingly needed in methodology for more accurate RMSD prediction. We present a new deep learning-based scoring method for RMSD prediction of protein-ligand docking poses based on a Graphormer method and Shell-like graph architecture, named GSScore. To recognize near-native conformations from a set of poses, GSScore takes atoms as nodes and then establishes the docking interface of protein-ligand into multiple bipartite graphs within different shell ranges. Benefiting from the Graphormer and Shell-like graph architecture, GSScore can effectively capture the subtle differences between energetically favorable near-native conformations and unfavorable non-native poses without extra information. GSScore was extensively evaluated on diverse test sets including a subset of PDBBind version 2019, CASF2016 as well as DUD-E, and obtained significant improvements over existing methods in terms of RMSE, $R$ (Pearson correlation coefficient), Spearman correlation coefficient and Docking power.


Subject(s)
Molecular Docking Simulation , Proteins , Ligands , Proteins/chemistry , Proteins/metabolism , Protein Binding , Software , Algorithms , Computational Biology/methods , Protein Conformation , Databases, Protein , Deep Learning
11.
BMC Bioinformatics ; 25(1): 176, 2024 May 04.
Article in English | MEDLINE | ID: mdl-38704533

ABSTRACT

BACKGROUND: Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT: In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION: Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.


Subject(s)
Proteins , Proteins/chemistry , Computational Biology/methods , Databases, Protein , Protein Conformation , Algorithms , Sequence Analysis, Protein/methods , Neural Networks, Computer
12.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38725156

ABSTRACT

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.


Subject(s)
Neural Networks, Computer , Protein Processing, Post-Translational , Acetylation , Computational Biology/methods , Databases, Protein , Software , Algorithms , Humans , Proteins/chemistry , Proteins/metabolism
13.
Protein Sci ; 33(6): e4988, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38757367

ABSTRACT

Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.


Subject(s)
Molecular Sequence Annotation , Protein Domains , Proteins , Proteins/chemistry , Proteins/metabolism , Proteins/genetics , Databases, Protein , Computational Biology/methods , Gene Ontology , Humans , Software
14.
Sci Data ; 11(1): 495, 2024 May 14.
Article in English | MEDLINE | ID: mdl-38744964

ABSTRACT

Single amino acid substitutions can profoundly affect protein folding, dynamics, and function. The ability to discern between benign and pathogenic substitutions is pivotal for therapeutic interventions and research directions. Given the limitations in experimental examination of these variants, AlphaMissense has emerged as a promising predictor of the pathogenicity of missense variants. Since heterogenous performance on different types of proteins can be expected, we assessed the efficacy of AlphaMissense across several protein groups (e.g. soluble, transmembrane, and mitochondrial proteins) and regions (e.g. intramembrane, membrane interacting, and high confidence AlphaFold segments) using ClinVar data for validation. Our comprehensive evaluation showed that AlphaMissense delivers outstanding performance, with MCC scores predominantly between 0.6 and 0.74. We observed low performance on disordered datasets and ClinVar data related to the CFTR ABC protein. However, a superior performance was shown when benchmarked against the high quality CFTR2 database. Our results with CFTR emphasizes AlphaMissense's potential in pinpointing functional hot spots, with its performance likely surpassing benchmarks calculated from ClinVar and ProteinGym datasets.


Subject(s)
Databases, Protein , Proteins , Humans , Amino Acid Substitution , Cystic Fibrosis Transmembrane Conductance Regulator/genetics , Cystic Fibrosis Transmembrane Conductance Regulator/chemistry , Mutation, Missense , Protein Folding , Proteins/chemistry , Proteins/genetics
15.
Comput Biol Chem ; 110: 108083, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38691894

ABSTRACT

Amino acid propensities for protein secondary structures are vital for protein structure prediction, understanding folding, and design, and have been studied using various theoretical and experimental methods. Traditional assessments of average propensities using statistical methods have been done on relatively smaller dataset for only a few secondary structures. They also involve averaging out the environmental factors and lack insights into consistency of preferences across diverse protein structures. While a few studies have explored variations in propensities across protein structural classes and folds, exploration of such variations across protein structures remains to be carried out. In this work, we have revised the average propensities for all six different secondary structures, namely α-helix, ß-strand, 310-helix, π-helix, turn and coil, analyzing the most exhaustive dataset available till date using two robust secondary structure assignment algorithms, DSSP and STRIDE. The propensities evaluated here can serve as a standard reference. Moreover, we present here, for the first time, the propensities within individual protein structures and investigated how the preferences of residues and more interestingly, of their groups formed based on their structural features, vary across different unique structures. We devised a novel approach- the minimal set analysis, based on the propensity distribution of residues, which along with the group propensities led us to the conclusion that a residue's preference for a specific secondary structure is primarily dictated by its side chain's structural features. The findings in this study provide a more insightful picture of residues propensities and can be useful in protein folding and design studies.


Subject(s)
Amino Acids , Databases, Protein , Protein Structure, Secondary , Proteins , Proteins/chemistry , Amino Acids/chemistry , Algorithms , Protein Folding
16.
Int J Food Microbiol ; 418: 110706, 2024 Jun 16.
Article in English | MEDLINE | ID: mdl-38696985

ABSTRACT

The metaproteomics field has recently gained more and more interest as a valuable tool for studying both the taxonomy and function of microbiomes, including those used in food fermentations. One crucial step in the metaproteomics pipeline is selecting a database to obtain high-quality taxonomical and functional information from microbial communities. One of the best strategies described for building protein databases is using sample-specific or study-specific protein databases obtained from metagenomic sequencing. While this is true for high-diversity microbiomes (such as gut and soil), there is still a lack of validation for different database construction strategies in low-diversity microbiomes, such as those found in fermented dairy products where starter cultures containing few species are used. In this study, we assessed the performance of various database construction strategies applied to metaproteomics on two low-diversity microbiomes obtained from cheese production using commercial starter cultures and analyzed by LC-MS/MS. Substantial differences were detected between the strategies, and the best performance in terms of the number of peptides and proteins identified from the spectra was achieved by metagenomic-derived databases. However, extensive databases constructed from a high number of available online genomes obtained a similar taxonomical and functional annotation of the metaproteome compared to the metagenomic-derived databases. Our results indicate that, in the case of low-diversity dairy microbiomes, the use of publically available genomes to construct protein databases can be considered as an alternative to metagenome-derived databases.


Subject(s)
Microbiota , Proteomics , Microbiota/genetics , Proteomics/methods , Bacteria/classification , Bacteria/genetics , Bacteria/isolation & purification , Tandem Mass Spectrometry , Metagenomics/methods , Cheese/microbiology , Dairy Products/microbiology , Databases, Protein , Chromatography, Liquid
17.
J Chem Inf Model ; 64(10): 4149-4157, 2024 May 27.
Article in English | MEDLINE | ID: mdl-38713459

ABSTRACT

A substantial portion of various organisms' proteomes comprises intrinsically disordered proteins (IDPs) that lack a defined three-dimensional structure. These IDPs exhibit a diverse array of conformations, displaying remarkable spatiotemporal heterogeneity and exceptional conformational flexibility. Characterizing the structure or structural ensemble of IDPs presents significant conceptual and methodological challenges owing to the absence of a well-defined native structure. While databases such as the Protein Ensemble Database (PED) provide IDP ensembles obtained through a combination of experimental data and molecular modeling, the absence of reaction coordinates poses challenges in comprehensively understanding pertinent aspects of the system. In this study, we leverage the energy landscape visualization method (JCTC, 6482, 2019) to scrutinize four IDP ensembles sourced from PED. ELViM, a methodology that circumvents the need for a priori reaction coordinates, aids in analyzing the ensembles. The specific IDP ensembles investigated are as follows: two fragments of nucleoporin (NUL: 884-993 and NUS: 1313-1390), yeast sic 1 N-terminal (1-90), and the N-terminal SH3 domain of Drk (1-59). Utilizing ELViM enables the comprehensive validation of ensembles, facilitating the detection of potential inconsistencies in the sampling process. Additionally, it allows for identifying and characterizing the most prevalent conformations within an ensemble. Moreover, ELViM facilitates the comparative analysis of ensembles obtained under diverse conditions, thereby providing a powerful tool for investigating the functional mechanisms of IDPs.


Subject(s)
Intrinsically Disordered Proteins , Protein Conformation , Intrinsically Disordered Proteins/chemistry , Thermodynamics , Models, Molecular , Databases, Protein , Saccharomyces cerevisiae Proteins/chemistry , Nuclear Pore Complex Proteins/chemistry , Nuclear Pore Complex Proteins/metabolism , Cyclin-Dependent Kinase Inhibitor Proteins
18.
Bioinformatics ; 40(5)2024 May 02.
Article in English | MEDLINE | ID: mdl-38718225

ABSTRACT

MOTIVATION: Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS: This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION: github.com/JudeWells/Chainsaw.


Subject(s)
Algorithms , Neural Networks, Computer , Protein Domains , Proteins , Proteins/chemistry , Databases, Protein , Computational Biology/methods , Software , Humans
19.
Int J Mol Sci ; 25(9)2024 Apr 28.
Article in English | MEDLINE | ID: mdl-38732022

ABSTRACT

The molecular weight (MW) of an enzyme is a critical parameter in enzyme-constrained models (ecModels). It is determined by two factors: the presence of subunits and the abundance of each subunit. Although the number of subunits (NS) can potentially be obtained from UniProt, this information is not readily available for most proteins. In this study, we addressed this gap by extracting and curating subunit information from the UniProt database to establish a robust benchmark dataset. Subsequently, we propose a novel model named DeepSub, which leverages the protein language model and Bi-directional Gated Recurrent Unit (GRU), to predict NS in homo-oligomers solely based on protein sequences. DeepSub demonstrates remarkable accuracy, achieving an accuracy rate as high as 0.967, surpassing the performance of QUEEN. To validate the effectiveness of DeepSub, we performed predictions for protein homo-oligomers that have been reported in the literature but are not documented in the UniProt database. Examples include homoserine dehydrogenase from Corynebacterium glutamicum, Matrilin-4 from Mus musculus and Homo sapiens, and the Multimerins protein family from M. musculus and H. sapiens. The predicted results align closely with the reported findings in the literature, underscoring the reliability and utility of DeepSub.


Subject(s)
Databases, Protein , Deep Learning , Protein Subunits , Protein Subunits/chemistry , Protein Subunits/metabolism , Animals , Humans , Protein Multimerization , Mice , Computational Biology/methods
20.
Nat Commun ; 15(1): 3956, 2024 May 10.
Article in English | MEDLINE | ID: mdl-38730277

ABSTRACT

Immunopeptidomics is crucial for immunotherapy and vaccine development. Because the generation of immunopeptides from their parent proteins does not adhere to clear-cut rules, rather than being able to use known digestion patterns, every possible protein subsequence within human leukocyte antigen (HLA) class-specific length restrictions needs to be considered during sequence database searching. This leads to an inflation of the search space and results in lower spectrum annotation rates. Peptide-spectrum match (PSM) rescoring is a powerful enhancement of standard searching that boosts the spectrum annotation performance. We analyze 302,105 unique synthesized non-tryptic peptides from the ProteomeTools project on a timsTOF-Pro to generate a ground-truth dataset containing 93,227 MS/MS spectra of 74,847 unique peptides, that is used to fine-tune the deep learning-based fragment ion intensity prediction model Prosit. We demonstrate up to 3-fold improvement in the identification of immunopeptides, as well as increased detection of immunopeptides from low input samples.


Subject(s)
Deep Learning , Peptides , Tandem Mass Spectrometry , Humans , Peptides/chemistry , Peptides/immunology , Tandem Mass Spectrometry/methods , Databases, Protein , Proteomics/methods , HLA Antigens/immunology , HLA Antigens/genetics , Software , Ions
SELECTION OF CITATIONS
SEARCH DETAIL
...