Pesquisa | Portal Regional da BVS

1.

Predicting human and viral protein variants affecting COVID-19 susceptibility and repurposing therapeutics.

Waman, Vaishali P; Ashford, Paul; Lam, Su Datt; Sen, Neeladri; Abbasian, Mahnaz; Woodridge, Laurel; Goldtzvik, Yonathan; Bordin, Nicola; Wu, Jiaxin; Sillitoe, Ian; Orengo, Christine A.

Sci Rep ; 14(1): 14208, 2024 06 20.

Artigo em Inglês | MEDLINE | ID: mdl-38902252

RESUMO

The COVID-19 disease is an ongoing global health concern. Although vaccination provides some protection, people are still susceptible to re-infection. Ostensibly, certain populations or clinical groups may be more vulnerable. Factors causing these differences are unclear and whilst socioeconomic and cultural differences are likely to be important, human genetic factors could influence susceptibility. Experimental studies indicate SARS-CoV-2 uses innate immune suppression as a strategy to speed-up entry and replication into the host cell. Therefore, it is necessary to understand the impact of variants in immunity-associated human proteins on susceptibility to COVID-19. In this work, we analysed missense coding variants in several SARS-CoV-2 proteins and their human protein interactors that could enhance binding affinity to SARS-CoV-2. We curated a dataset of 19 SARS-CoV-2: human protein 3D-complexes, from the experimentally determined structures in the Protein Data Bank and models built using AlphaFold2-multimer, and analysed the impact of missense variants occurring in the protein-protein interface region. We analysed 468 missense variants from human proteins and 212 variants from SARS-CoV-2 proteins and computationally predicted their impacts on binding affinities for the human viral protein complexes. We predicted a total of 26 affinity-enhancing variants from 13 human proteins implicated in increased binding affinity to SARS-CoV-2. These include key-immunity associated genes (TOMM70, ISG15, IFIH1, IFIT2, RPS3, PALS1, NUP98, AXL, ARF6, TRIMM, TRIM25) as well as important spike receptors (KREMEN1, AXL and ACE2). We report both common (e.g., Y13N in IFIH1) and rare variants in these proteins and discuss their likely structural and functional impact, using information on known and predicted functional sites. Potential mechanisms associated with immune suppression implicated by these variants are discussed. Occurrence of certain predicted affinity-enhancing variants should be monitored as they could lead to increased susceptibility and reduced immune response to SARS-CoV-2 infection in individuals/populations carrying them. Our analyses aid in understanding the potential impact of genetic variation in immunity-associated proteins on COVID-19 susceptibility and help guide drug-repurposing strategies.

Assuntos

COVID-19 , Mutação de Sentido Incorreto , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , SARS-CoV-2/imunologia , COVID-19/genética , COVID-19/virologia , COVID-19/imunologia , Reposicionamento de Medicamentos , Proteínas Virais/genética , Proteínas Virais/metabolismo , Ligação Proteica , Predisposição Genética para Doença , Suscetibilidade a Doenças , Tratamento Farmacológico da COVID-19

2.

Chainsaw: protein domain segmentation with fully convolutional neural networks.

Wells, Jude; Hawkins-Hooker, Alex; Bordin, Nicola; Sillitoe, Ian; Paige, Brooks; Orengo, Christine.

Bioinformatics ; 40(5)2024 May 02.

Artigo em Inglês | MEDLINE | ID: mdl-38718225

RESUMO

MOTIVATION: Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS: This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION: github.com/JudeWells/Chainsaw.

Assuntos

Algoritmos , Redes Neurais de Computação , Domínios Proteicos , Proteínas , Proteínas/química , Bases de Dados de Proteínas , Biologia Computacional/métodos , Software , Humanos

3.

CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds.

Waman, Vaishali P; Bordin, Nicola; Alcraft, Rachel; Vickerstaff, Robert; Rauer, Clemens; Chan, Qian; Sillitoe, Ian; Yamamori, Hazuki; Orengo, Christine.

J Mol Biol ; : 168551, 2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38548261

RESUMO

CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.

4.

KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units.

Adeyelu, Tolulope; Bordin, Nicola; Waman, Vaishali P; Sadlej, Marta; Sillitoe, Ian; Moya-Garcia, Aurelio A; Orengo, Christine A.

Biomolecules ; 13(2)2023 02 02.

Artigo em Inglês | MEDLINE | ID: mdl-36830646

RESUMO

Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.

Assuntos

Proteínas Quinases , Proteínas , Humanos , Proteínas Quinases/metabolismo , Proteínas/química , Bases de Dados de Proteínas , Homologia de Sequência de Aminoácidos

5.

ModelCIF: An Extension of PDBx/mmCIF Data Representation for Computed Structure Models.

Vallat, Brinda; Tauriello, Gerardo; Bienert, Stefan; Haas, Juergen; Webb, Benjamin M; Zídek, Augustin; Zheng, Wei; Peisach, Ezra; Piehl, Dennis W; Anischanka, Ivan; Sillitoe, Ian; Tolchard, James; Varadi, Mihaly; Baker, David; Orengo, Christine; Zhang, Yang; Hoch, Jeffrey C; Kurisu, Genji; Patwardhan, Ardan; Velankar, Sameer; Burley, Stephen K; Sali, Andrej; Schwede, Torsten; Berman, Helen M; Westbrook, John D.

J Mol Biol ; 435(14): 168021, 2023 07 15.

Artigo em Inglês | MEDLINE | ID: mdl-36828268

RESUMO

ModelCIF (github.com/ihmwg/ModelCIF) is a data information framework developed for and by computational structural biologists to enable delivery of Findable, Accessible, Interoperable, and Reusable (FAIR) data to users worldwide. ModelCIF describes the specific set of attributes and metadata associated with macromolecular structures modeled by solely computational methods and provides an extensible data representation for deposition, archiving, and public dissemination of predicted three-dimensional (3D) models of macromolecules. It is an extension of the Protein Data Bank Exchange / macromolecular Crystallographic Information Framework (PDBx/mmCIF), which is the global data standard for representing experimentally-determined 3D structures of macromolecules and associated metadata. The PDBx/mmCIF framework and its extensions (e.g., ModelCIF) are managed by the Worldwide Protein Data Bank partnership (wwPDB, wwpdb.org) in collaboration with relevant community stakeholders such as the wwPDB ModelCIF Working Group (wwpdb.org/task/modelcif). This semantically rich and extensible data framework for representing computed structure models (CSMs) accelerates the pace of scientific discovery. Herein, we describe the architecture, contents, and governance of ModelCIF, and tools and processes for maintaining and extending the data standard. Community tools and software libraries that support ModelCIF are also described.

Assuntos

Bases de Dados de Proteínas , Substâncias Macromoleculares/química , Conformação Proteica , Software

6.

AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms.

Bordin, Nicola; Sillitoe, Ian; Nallapareddy, Vamsi; Rauer, Clemens; Lam, Su Datt; Waman, Vaishali P; Sen, Neeladri; Heinzinger, Michael; Littmann, Maria; Kim, Stephanie; Velankar, Sameer; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Commun Biol ; 6(1): 160, 2023 02 08.

Artigo em Inglês | MEDLINE | ID: mdl-36755055

RESUMO

Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.

Assuntos

Furilfuramida , Proteínas , Humanos , Bases de Dados de Proteínas , Proteínas/química

7.

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.

Nallapareddy, Vamsi; Bordin, Nicola; Sillitoe, Ian; Heinzinger, Michael; Littmann, Maria; Waman, Vaishali P; Sen, Neeladri; Rost, Burkhard; Orengo, Christine.

Bioinformatics ; 39(1)2023 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-36648327

RESUMO

MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art hidden Markov model (HMM)-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein language models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6 ± 0.4% and 98.2 ± 0.3%, respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models is available on https://github.com/vam-sin/CATHe, and the datasets developed in this study can be accessed on https://zenodo.org/record/6327572. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Proteínas , Humanos , Homologia de Sequência de Aminoácidos , Proteínas/química , Bases de Dados de Proteínas

8.

InterPro in 2022.

Paysan-Lafosse, Typhaine; Blum, Matthias; Chuguransky, Sara; Grego, Tiago; Pinto, Beatriz Lázaro; Salazar, Gustavo A; Bileschi, Maxwell L; Bork, Peer; Bridge, Alan; Colwell, Lucy; Gough, Julian; Haft, Daniel H; Letunic, Ivica; Marchler-Bauer, Aron; Mi, Huaiyu; Natale, Darren A; Orengo, Christine A; Pandurangan, Arun P; Rivoire, Catherine; Sigrist, Christian J A; Sillitoe, Ian; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Wu, Cathy H; Bateman, Alex.

Nucleic Acids Res ; 51(D1): D418-D427, 2023 01 06.

Artigo em Inglês | MEDLINE | ID: mdl-36350672

RESUMO

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.

Assuntos

Bases de Dados de Proteínas , Humanos , Sequência de Aminoácidos , Inteligência Artificial , Internet , Proteínas/química , Software

9.

Dissecting peripheral protein-membrane interfaces.

Tubiana, Thibault; Sillitoe, Ian; Orengo, Christine; Reuter, Nathalie.

PLoS Comput Biol ; 18(12): e1010346, 2022 12.

Artigo em Inglês | MEDLINE | ID: mdl-36516231

RESUMO

Peripheral membrane proteins (PMPs) include a wide variety of proteins that have in common to bind transiently to the chemically complex interfacial region of membranes through their interfacial binding site (IBS). In contrast to protein-protein or protein-DNA/RNA interfaces, peripheral protein-membrane interfaces are poorly characterized. We collected a dataset of PMP domains representative of the variety of PMP functions: membrane-targeting domains (Annexin, C1, C2, discoidin C2, PH, PX), enzymes (PLA, PLC/D) and lipid-transfer proteins (START). The dataset contains 1328 experimental structures and 1194 AphaFold models. We mapped the amino acid composition and structural patterns of the IBS of each protein in this dataset, and evaluated which were more likely to be found at the IBS compared to the rest of the domains' accessible surface. In agreement with earlier work we find that about two thirds of the PMPs in the dataset have protruding hydrophobes (Leu, Ile, Phe, Tyr, Trp and Met) at their IBS. The three aromatic amino acids Trp, Tyr and Phe are a hallmark of PMPs IBS regardless of whether they protrude on loops or not. This is also the case for lysines but not arginines suggesting that, unlike for Arg-rich membrane-active peptides, the less membrane-disruptive lysine is preferred in PMPs. Another striking observation was the over-representation of glycines at the IBS of PMPs compared to the rest of their surface, possibly procuring IBS loops a much-needed flexibility to insert in-between membrane lipids. The analysis of the 9 superfamilies revealed amino acid distribution patterns in agreement with their known functions and membrane-binding mechanisms. Besides revealing novel amino acids patterns at protein-membrane interfaces, our work contributes a new PMP dataset and an analysis pipeline that can be further built upon for future studies of PMPs properties, or for developing PMPs prediction tools using for example, machine learning approaches.

Assuntos

Membrana Celular , Peptídeos , Aminoácidos/química , Sítios de Ligação , Peptídeos/química , Membrana Celular/química

10.

3D-Beacons: decreasing the gap between protein sequences and structures through a federated network of protein structure data resources.

Varadi, Mihaly; Nair, Sreenath; Sillitoe, Ian; Tauriello, Gerardo; Anyango, Stephen; Bienert, Stefan; Borges, Clemente; Deshpande, Mandar; Green, Tim; Hassabis, Demis; Hatos, Andras; Hegedus, Tamas; Hekkelman, Maarten L; Joosten, Robbie; Jumper, John; Laydon, Agata; Molodenskiy, Dmitry; Piovesan, Damiano; Salladini, Edoardo; Salzberg, Steven L; Sommer, Markus J; Steinegger, Martin; Suhajda, Erzsebet; Svergun, Dmitri; Tenorio-Ku, Luiggi; Tosatto, Silvio; Tunyasuvunakool, Kathryn; Waterhouse, Andrew Mark; Zídek, Augustin; Schwede, Torsten; Orengo, Christine; Velankar, Sameer.

Gigascience ; 112022 11 30.

Artigo em Inglês | MEDLINE | ID: mdl-36448847

RESUMO

While scientists can often infer the biological function of proteins from their 3-dimensional quaternary structures, the gap between the number of known protein sequences and their experimentally determined structures keeps increasing. A potential solution to this problem is presented by ever more sophisticated computational protein modeling approaches. While often powerful on their own, most methods have strengths and weaknesses. Therefore, it benefits researchers to examine models from various model providers and perform comparative analysis to identify what models can best address their specific use cases. To make data from a large array of model providers more easily accessible to the broader scientific community, we established 3D-Beacons, a collaborative initiative to create a federated network with unified data access mechanisms. The 3D-Beacons Network allows researchers to collate coordinate files and metadata for experimentally determined and theoretical protein models from state-of-the-art and specialist model providers and also from the Protein Data Bank.

Assuntos

Metadados , Registros , Sequência de Aminoácidos , Bases de Dados de Proteínas , Simulação por Computador

11.

Comprehensive Collection and Prediction of ABC Transmembrane Protein Structures in the AI Era of Structural Biology.

Tordai, Hedvig; Suhajda, Erzsebet; Sillitoe, Ian; Nair, Sreenath; Varadi, Mihaly; Hegedus, Tamas.

Int J Mol Sci ; 23(16)2022 Aug 09.

Artigo em Inglês | MEDLINE | ID: mdl-36012140

RESUMO

The number of unique transmembrane (TM) protein structures doubled in the last four years, which can be attributed to the revolution of cryo-electron microscopy. In addition, AlphaFold2 (AF2) also provided a large number of predicted structures with high quality. However, if a specific protein family is the subject of a study, collecting the structures of the family members is highly challenging in spite of existing general and protein domain-specific databases. Here, we demonstrate this and assess the applicability and usability of automatic collection and presentation of protein structures via the ABC protein superfamily. Our pipeline identifies and classifies transmembrane ABC protein structures using the PFAM search and also aims to determine their conformational states based on special geometric measures, conftors. Since the AlphaFold database contains structure predictions only for single polypeptide chains, we performed AF2-Multimer predictions for human ABC half transporters functioning as dimers. Our AF2 predictions warn of possibly ambiguous interpretation of some biochemical data regarding interaction partners and call for further experiments and experimental structure determination. We made our predicted ABC protein structures available through a web application, and we joined the 3D-Beacons Network to reach the broader scientific community through platforms such as PDBe-KB.

Assuntos

Transportadores de Cassetes de Ligação de ATP , Furilfuramida , Transportadores de Cassetes de Ligação de ATP/metabolismo , Inteligência Artificial , Biologia , Microscopia Crioeletrônica , Humanos , Modelos Moleculares , Conformação Proteica

12.

Contrastive learning on protein embeddings enlightens midnight zone.

Heinzinger, Michael; Littmann, Maria; Sillitoe, Ian; Bordin, Nicola; Orengo, Christine; Rost, Burkhard.

NAR Genom Bioinform ; 4(2): lqac043, 2022 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-35702380

RESUMO

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

13.

Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs.

Sen, Neeladri; Anishchenko, Ivan; Bordin, Nicola; Sillitoe, Ian; Velankar, Sameer; Baker, David; Orengo, Christine.

Brief Bioinform ; 23(4)2022 07 18.

Artigo em Inglês | MEDLINE | ID: mdl-35641150

RESUMO

Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein-protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.

Assuntos

Mutação de Sentido Incorreto , Proteínas , Bases de Dados de Proteínas , Humanos , Modelos Moleculares , Mutação , Proteínas/química , Proteínas/genética

14.

Assigning protein function from domain-function associations using DomFun.

Rojano, Elena; Jabato, Fernando M; Perkins, James R; Córdoba-Caballero, José; García-Criado, Federico; Sillitoe, Ian; Orengo, Christine; Ranea, Juan A G; Seoane-Zonjic, Pedro.

BMC Bioinformatics ; 23(1): 43, 2022 Jan 15.

Artigo em Inglês | MEDLINE | ID: mdl-35033002

RESUMO

BACKGROUND: Protein function prediction remains a key challenge. Domain composition affects protein function. Here we present DomFun, a Ruby gem that uses associations between protein domains and functions, calculated using multiple indices based on tripartite network analysis. These domain-function associations are combined at the protein level, to generate protein-function predictions. RESULTS: We analysed 16 tripartite networks connecting homologous superfamily and FunFam domains from CATH-Gene3D with functional annotations from the three Gene Ontology (GO) sub-ontologies, KEGG, and Reactome. We validated the results using the CAFA 3 benchmark platform for GO annotation, finding that out of the multiple association metrics and domain datasets tested, Simpson index for FunFam domain-function associations combined with Stouffer's method leads to the best performance in almost all scenarios. We also found that using FunFams led to better performance than superfamilies, and better results were found for GO molecular function compared to GO biological process terms. DomFun performed as well as the highest-performing method in certain CAFA 3 evaluation procedures in terms of [Formula: see text] and [Formula: see text] We also implemented our own benchmark procedure, Pathway Prediction Performance (PPP), which can be used to validate function prediction for additional annotations sources, such as KEGG and Reactome. Using PPP, we found similar results to those found with CAFA 3 for GO, moreover we found good performance for the other annotation sources. As with CAFA 3, Simpson index with Stouffer's method led to the top performance in almost all scenarios. CONCLUSIONS: DomFun shows competitive performance with other methods evaluated in CAFA 3 when predicting proteins function with GO, although results vary depending on the evaluation procedure. Through our own benchmark procedure, PPP, we have shown it can also make accurate predictions for KEGG and Reactome. It performs best when using FunFams, combining Simpson index derived domain-function associations using Stouffer's method. The tool has been implemented so that it can be easily adapted to incorporate other protein features, such as domain data from other sources, amino acid k-mers and motifs. The DomFun Ruby gem is available from https://rubygems.org/gems/DomFun . Code maintained at https://github.com/ElenaRojano/DomFun . Validation procedure scripts can be found at https://github.com/ElenaRojano/DomFun_project .

Assuntos

Biologia Computacional , Proteínas , Bases de Dados de Proteínas , Ontologia Genética , Anotação de Sequência Molecular , Proteínas/genética

15.

Tracing Evolution Through Protein Structures: Nature Captured in a Few Thousand Folds.

Bordin, Nicola; Sillitoe, Ian; Lees, Jonathan G; Orengo, Christine.

Front Mol Biosci ; 8: 668184, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34041266

RESUMO

This article is dedicated to the memory of Cyrus Chothia, who was a leading light in the world of protein structure evolution. His elegant analyses of protein families and their mechanisms of structural and functional evolution provided important evolutionary and biological insights and firmly established the value of structural perspectives. He was a mentor and supervisor to many other leading scientists who continued his quest to characterise structure and function space. He was also a generous and supportive colleague to those applying different approaches. In this article we review some of his accomplishments and the history of protein structure classifications, particularly SCOP and CATH. We also highlight some of the evolutionary insights these two classifications have brought. Finally, we discuss how the expansion and integration of protein sequence data into these structural families helps reveal the dark matter of function space and can inform the emergence of novel functions in Metazoa. Since we cover 25 years of structural classification, it has not been feasible to review all structure based evolutionary studies and hence we focus mainly on those undertaken by the SCOP and CATH groups and their collaborators.

16.

CATH: increased structural coverage of functional space.

Sillitoe, Ian; Bordin, Nicola; Dawson, Natalie; Waman, Vaishali P; Ashford, Paul; Scholes, Harry M; Pang, Camilla S M; Woodridge, Laurel; Rauer, Clemens; Sen, Neeladri; Abbasian, Mahnaz; Le Cornu, Sean; Lam, Su Datt; Berka, Karel; Varekova, Ivana Hutarová; Svobodova, Radka; Lees, Jon; Orengo, Christine A.

Nucleic Acids Res ; 49(D1): D266-D273, 2021 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-33237325

RESUMO

CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.

Assuntos

Biologia Computacional/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Domínios Proteicos , Proteínas/química , Sequência de Aminoácidos , COVID-19/epidemiologia , COVID-19/prevenção & controle , COVID-19/virologia , Biologia Computacional/métodos , Epidemias , Humanos , Internet , Anotação de Sequência Molecular , Proteínas/genética , Proteínas/metabolismo , SARS-CoV-2/genética , SARS-CoV-2/metabolismo , SARS-CoV-2/fisiologia , Análise de Sequência de Proteína/métodos , Homologia de Sequência de Aminoácidos , Proteínas Virais/química , Proteínas Virais/genética , Proteínas Virais/metabolismo

17.

The InterPro protein families and domains database: 20 years on.

Blum, Matthias; Chang, Hsin-Yu; Chuguransky, Sara; Grego, Tiago; Kandasaamy, Swaathi; Mitchell, Alex; Nuka, Gift; Paysan-Lafosse, Typhaine; Qureshi, Matloob; Raj, Shriya; Richardson, Lorna; Salazar, Gustavo A; Williams, Lowri; Bork, Peer; Bridge, Alan; Gough, Julian; Haft, Daniel H; Letunic, Ivica; Marchler-Bauer, Aron; Mi, Huaiyu; Natale, Darren A; Necci, Marco; Orengo, Christine A; Pandurangan, Arun P; Rivoire, Catherine; Sigrist, Christian J A; Sillitoe, Ian; Thanki, Narmada; Thomas, Paul D; Tosatto, Silvio C E; Wu, Cathy H; Bateman, Alex; Finn, Robert D.

Nucleic Acids Res ; 49(D1): D344-D354, 2021 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-33156333

RESUMO

The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.

Assuntos

Bases de Dados de Proteínas , Proteínas/química , Sequência de Aminoácidos , COVID-19/metabolismo , Internet , Anotação de Sequência Molecular , Domínios Proteicos , Mapas de Interação de Proteínas , SARS-CoV-2/metabolismo , Alinhamento de Sequência

18.

The Genome3D Consortium for Structural Annotations of Selected Model Organisms.

Waman, Vaishali P; Blundell, Tom L; Buchan, Daniel W A; Gough, Julian; Jones, David; Kelley, Lawrence; Murzin, Alexey; Pandurangan, Arun Prasad; Sillitoe, Ian; Sternberg, Michael; Torres, Pedro; Orengo, Christine.

Methods Mol Biol ; 2165: 27-67, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32621218

RESUMO

Genome3D consortium is a collaborative project involving protein structure prediction and annotation resources developed by six world-leading structural bioinformatics groups, based in the United Kingdom (namely Blundell, Murzin, Gough, Sternberg, Orengo, and Jones). The main objective of Genome3D serves as a common portal to provide both predicted models and annotations of proteins in model organisms, using several resources developed by these labs such as CATH-Gene3D, DOMSERF, pDomTHREADER, PHYRE, SUPERFAMILY, FUGUE/TOCATTA, and VIVACE. These resources primarily use SCOP- and/or CATH-based protein domain assignments. Another objective of Genome3D is to compare structural classifications of protein domains in CATH and SCOP databases and to provide a consensus mapping of CATH and SCOP protein superfamilies. CATH/SCOP mapping analyses led to the identification of total of 1429 consensus superfamilies.Currently, Genome3D provides structural annotations for ten model organisms, including Homo sapiens, Arabidopsis thaliana, Mus musculus, Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Plasmodium falciparum, Staphylococcus aureus, and Schizosaccharomyces pombe. Thus, Genome3D serves as a common gateway to each structure prediction/annotation resource and allows users to perform comparative assessment of the predictions. It, thus, assists researchers to broaden their perspective on structure/function predictions of their query protein of interest in selected model organisms.

Assuntos

Genômica/organização & administração , Bases de Conhecimento , Anotação de Sequência Molecular/métodos , Proteoma/química , Animais , Arabidopsis , Genoma , Genômica/métodos , Humanos , Disseminação de Informação , Alinhamento de Sequência/métodos , Reino Unido , Leveduras

19.

VarSite: Disease variants and protein structure.

Laskowski, Roman A; Stephenson, James D; Sillitoe, Ian; Orengo, Christine A; Thornton, Janet M.

Protein Sci ; 29(1): 111-119, 2020 01.

Artigo em Inglês | MEDLINE | ID: mdl-31606900

RESUMO

VarSite is a web server mapping known disease-associated variants from UniProt and ClinVar, together with natural variants from gnomAD, onto protein 3D structures in the Protein Data Bank. The analyses are primarily image-based and provide both an overview for each human protein, as well as a report for any specific variant of interest. The information can be useful in assessing whether a given variant might be pathogenic or benign. The structural annotations for each position in the protein include protein secondary structure, interactions with ligand, metal, DNA/RNA, or other protein, and various measures of a given variant's possible impact on the protein's function. The 3D locations of the disease-associated variants can be viewed interactively via the 3dmol.js JavaScript viewer, as well as in RasMol and PyMOL. Users can search for specific variants, or sets of variants, by providing the DNA coordinates of the base change(s) of interest. Additionally, various agglomerative analyses are given, such as the mapping of disease and natural variants onto specific Pfam or CATH domains. The server is freely accessible to all at: https://www.ebi.ac.uk/thornton-srv/databases/VarSite.

Assuntos

Bases de Dados Genéticas , Proteínas/química , Proteínas/genética , Computação em Nuvem , Biologia Computacional , Predisposição Genética para Doença , Variação Genética , Humanos , Modelos Moleculares , Conformação Proteica , Interface Usuário-Computador

20.

Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation.

Sillitoe, Ian; Andreeva, Antonina; Blundell, Tom L; Buchan, Daniel W A; Finn, Robert D; Gough, Julian; Jones, David; Kelley, Lawrence A; Paysan-Lafosse, Typhaine; Lam, Su Datt; Murzin, Alexey G; Pandurangan, Arun Prasad; Salazar, Gustavo A; Skwark, Marcin J; Sternberg, Michael J E; Velankar, Sameer; Orengo, Christine.

Nucleic Acids Res ; 48(D1): D314-D319, 2020 01 08.

Artigo em Inglês | MEDLINE | ID: mdl-31733063

RESUMO

Genome3D (https://www.genome3d.eu) is a freely available resource that provides consensus structural annotations for representative protein sequences taken from a selection of model organisms. Since the last NAR update in 2015, the method of data submission has been overhauled, with annotations now being 'pushed' to the database via an API. As a result, contributing groups are now able to manage their own structural annotations, making the resource more flexible and maintainable. The new submission protocol brings a number of additional benefits including: providing instant validation of data and avoiding the requirement to synchronise releases between resources. It also makes it possible to implement the submission of these structural annotations as an automated part of existing internal workflows. In turn, these improvements facilitate Genome3D being opened up to new prediction algorithms and groups. For the latest release of Genome3D (v2.1), the underlying dataset of sequences used as prediction targets has been updated using the latest reference proteomes available in UniProtKB. A number of new reference proteomes have also been added of particular interest to the wider scientific community: cow, pig, wheat and mycobacterium tuberculosis. These additions, along with improvements to the underlying predictions from contributing resources, has ensured that the number of annotations in Genome3D has nearly doubled since the last NAR update article. The new API has also been used to facilitate the dissemination of Genome3D data into InterPro, thereby widening the visibility of both the annotation data and annotation algorithms.

Assuntos

Proteínas/química , Bases de Dados de Proteínas , Proteínas/classificação , Proteínas/genética , Interface Usuário-Computador

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA