Search | VHL Regional Portal

1.

Illuminating enzyme design using deep learning.

Dallago, Christian; Yang, Kevin K.

Nat Chem ; 15(6): 749-750, 2023 06.

Article in English | MEDLINE | ID: mdl-37248345

Subject(s)

Deep Learning , Enzymes , Enzymes/chemistry , Protein Conformation

2.

Structural Analysis of Genomic and Proteomic Signatures Reveal Dynamic Expression of Intrinsically Disordered Regions in Breast Cancer and Tissue.

Zatorski, Nicole; Sun, Yifei; Elmas, Abdulkadir; Dallago, Christian; Karl, Timothy; Stein, David; Rost, Burkhard; Huang, Kuan-Lin; Walsh, Martin; Schlessinger, Avner.

bioRxiv ; 2023 Feb 24.

Article in English | MEDLINE | ID: mdl-36865220

ABSTRACT

Structural features of proteins capture underlying information about protein evolution and function, which enhances the analysis of proteomic and transcriptomic data. Here we develop Structural Analysis of Gene and protein Expression Signatures (SAGES), a method that describes expression data using features calculated from sequence-based prediction methods and 3D structural models. We used SAGES, along with machine learning, to characterize tissues from healthy individuals and those with breast cancer. We analyzed gene expression data from 23 breast cancer patients and genetic mutation data from the COSMIC database as well as 17 breast tumor protein expression profiles. We identified prominent expression of intrinsically disordered regions in breast cancer proteins as well as relationships between drug perturbation signatures and breast cancer disease signatures. Our results suggest that SAGES is generally applicable to describe diverse biological phenomena including disease states and drug effects.

3.

LambdaPP: Fast and accessible protein-specific phenotype predictions.

Olenyi, Tobias; Marquet, Céline; Heinzinger, Michael; Kröger, Benjamin; Nikolova, Tiha; Bernhofer, Michael; Sändig, Philip; Schütze, Konstantin; Littmann, Maria; Mirdita, Milot; Steinegger, Martin; Dallago, Christian; Rost, Burkhard.

Protein Sci ; 32(1): e4524, 2023 01.

Article in English | MEDLINE | ID: mdl-36454227

ABSTRACT

The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.

Subject(s)

Artificial Intelligence , Proteins , Proteins/chemistry , Amino Acid Sequence , Protein Structure, Secondary , Sequence Alignment , Software

4.

Novel machine learning approaches revolutionize protein knowledge.

Bordin, Nicola; Dallago, Christian; Heinzinger, Michael; Kim, Stephanie; Littmann, Maria; Rauer, Clemens; Steinegger, Martin; Rost, Burkhard; Orengo, Christine.

Trends Biochem Sci ; 48(4): 345-359, 2023 04.

Article in English | MEDLINE | ID: mdl-36504138

ABSTRACT

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.

Subject(s)

Machine Learning , Proteins , Proteins/chemistry , Computational Biology/methods , Protein Conformation

5.

From sequence to function through structure: Deep learning for protein design.

Ferruz, Noelia; Heinzinger, Michael; Akdel, Mehmet; Goncearenco, Alexander; Naef, Luca; Dallago, Christian.

Comput Struct Biotechnol J ; 21: 238-250, 2023.

Article in English | MEDLINE | ID: mdl-36544476

ABSTRACT

The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.

6.

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.

Zvyagin, Maxim; Brace, Alexander; Hippe, Kyle; Deng, Yuntian; Zhang, Bin; Bohorquez, Cindy Orozco; Clyde, Austin; Kale, Bharat; Perez-Rivera, Danilo; Ma, Heng; Mann, Carla M; Irvin, Michael; Pauloski, J Gregory; Ward, Logan; Hayot-Sasson, Valerie; Emani, Murali; Foreman, Sam; Xie, Zhen; Lin, Diangen; Shukla, Maulik; Nie, Weili; Romero, Josh; Dallago, Christian; Vahdat, Arash; Xiao, Chaowei; Gibbs, Thomas; Foster, Ian; Davis, James J; Papka, Michael E; Brettin, Thomas; Stevens, Rick; Anandkumar, Anima; Vishwanath, Venkatram; Ramanathan, Arvind.

bioRxiv ; 2022 Nov 23.

Article in English | MEDLINE | ID: mdl-36451881

ABSTRACT

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

7.

A roadmap for the functional annotation of protein families: a community perspective.

de Crécy-Lagard, Valérie; Amorin de Hegedus, Rocio; Arighi, Cecilia; Babor, Jill; Bateman, Alex; Blaby, Ian; Blaby-Haas, Crysten; Bridge, Alan J; Burley, Stephen K; Cleveland, Stacey; Colwell, Lucy J; Conesa, Ana; Dallago, Christian; Danchin, Antoine; de Waard, Anita; Deutschbauer, Adam; Dias, Raquel; Ding, Yousong; Fang, Gang; Friedberg, Iddo; Gerlt, John; Goldford, Joshua; Gorelik, Mark; Gyori, Benjamin M; Henry, Christopher; Hutinet, Geoffrey; Jaroch, Marshall; Karp, Peter D; Kondratova, Liudmyla; Lu, Zhiyong; Marchler-Bauer, Aron; Martin, Maria-Jesus; McWhite, Claire; Moghe, Gaurav D; Monaghan, Paul; Morgat, Anne; Mungall, Christopher J; Natale, Darren A; Nelson, William C; O'Donoghue, Seán; Orengo, Christine; O'Toole, Katherine H; Radivojac, Predrag; Reed, Colbie; Roberts, Richard J; Rodionov, Dmitri; Rodionova, Irina A; Rudolf, Jeffrey D; Saleh, Lana; Sheynkman, Gloria.

Database (Oxford) ; 20222022 08 12.

Article in English | MEDLINE | ID: mdl-35961013

ABSTRACT

Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.

Subject(s)

Genomics , Proteins , Base Sequence , Computational Biology , Genome , Molecular Sequence Annotation

8.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

Elnaggar, Ahmed; Heinzinger, Michael; Dallago, Christian; Rehawi, Ghalia; Wang, Yu; Jones, Llion; Gibbs, Tom; Feher, Tamas; Angerer, Christoph; Steinegger, Martin; Bhowmik, Debsindhu; Rost, Burkhard.

IEEE Trans Pattern Anal Mach Intell ; 44(10): 7112-7127, 2022 10.

Article in English | MEDLINE | ID: mdl-34232869

ABSTRACT

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.

Subject(s)

Algorithms , Natural Language Processing , Computational Biology/methods , Proteins/chemistry , Supervised Machine Learning

9.

ProteomicsDB: toward a FAIR open-source resource for life-science research.

Lautenbacher, Ludwig; Samaras, Patroklos; Muller, Julian; Grafberger, Andreas; Shraideh, Marwin; Rank, Johannes; Fuchs, Simon T; Schmidt, Tobias K; The, Matthew; Dallago, Christian; Wittges, Holger; Rost, Burkhard; Krcmar, Helmut; Kuster, Bernhard; Wilhelm, Mathias.

Nucleic Acids Res ; 50(D1): D1541-D1552, 2022 01 07.

Article in English | MEDLINE | ID: mdl-34791421

ABSTRACT

ProteomicsDB (https://www.ProteomicsDB.org) is a multi-omics and multi-organism resource for life science research. In this update, we present our efforts to continuously develop and expand ProteomicsDB. The major focus over the last two years was improving the findability, accessibility, interoperability and reusability (FAIR) of the data as well as its implementation. For this purpose, we release a new application programming interface (API) that provides systematic access to essentially all data in ProteomicsDB. Second, we release a new open-source user interface (UI) and show the advantages the scientific community gains from such software. With the new interface, two new visualizations of protein primary, secondary and tertiary structure as well an updated spectrum viewer were added. Furthermore, we integrated ProteomicsDB with our deep-neural-network Prosit that can predict the fragmentation characteristics and retention time of peptides. The result is an automatic processing pipeline that can be used to reevaluate database search engine results stored in ProteomicsDB. In addition, we extended the data content with experiments investigating different human biology as well as a newly supported organism.

Subject(s)

Databases, Protein , Proteins/classification , Proteomics/classification , Software , Biological Science Disciplines , Humans , Neural Networks, Computer , Proteins/chemistry

10.

Embeddings from protein language models predict conservation and variant effects.

Marquet, Céline; Heinzinger, Michael; Olenyi, Tobias; Dallago, Christian; Erckert, Kyra; Bernhofer, Michael; Nechaev, Dmitrii; Rost, Burkhard.

Hum Genet ; 141(10): 1629-1647, 2022 Oct.

Article in English | MEDLINE | ID: mdl-34967936

ABSTRACT

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Score Prediction without Alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently of the performance measure applied (Spearman and Pearson correlation). Finally, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20 k proteins) within 40 min on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , and PredictProtein.

Subject(s)

COVID-19 , SARS-CoV-2 , Algorithms , Amino Acids , COVID-19/genetics , Humans , Language , Proteome , SARS-CoV-2/genetics

11.

Protein embeddings and deep learning predict binding residues for various ligand classes.

Littmann, Maria; Heinzinger, Michael; Dallago, Christian; Weissenow, Konstantin; Rost, Burkhard.

Sci Rep ; 11(1): 23916, 2021 12 13.

Article in English | MEDLINE | ID: mdl-34903827

ABSTRACT

One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. Here, we proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1 = 48 ± 3% (95% CI) and MCC = 0.46 ± 0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable-neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.

Subject(s)

Deep Learning , Molecular Docking Simulation/methods , Sequence Analysis, Protein/methods , Binding Sites , Ligands , Metals/chemistry , Nucleic Acids/chemistry , Protein Binding , Protein Conformation , Software

12.

Author-sourced capture of pathway knowledge in computable form using Biofactoid.

Wong, Jeffrey V; Franz, Max; Siper, Metin Can; Fong, Dylan; Durupinar, Funda; Dallago, Christian; Luna, Augustin; Giorgi, John; Rodchenkov, Igor; Babur, Özgün; Bachman, John A; Gyori, Benjamin M; Demir, Emek; Bader, Gary D; Sander, Chris.

Elife ; 102021 12 03.

Article in English | MEDLINE | ID: mdl-34860157

ABSTRACT

Making the knowledge contained in scientific papers machine-readable and formally computable would allow researchers to take full advantage of this information by enabling integration with other knowledge sources to support data analysis and interpretation. Here we describe Biofactoid, a web-based platform that allows scientists to specify networks of interactions between genes, their products, and chemical compounds, and then translates this information into a representation suitable for computational analysis, search and discovery. We also report the results of a pilot study to encourage the wide adoption of Biofactoid by the scientific community.

Subject(s)

Computational Biology/methods , Genomics/methods , Computational Biology/instrumentation , Databases, Factual , Genomics/instrumentation , Pilot Projects

13.

Protein matchmaking through representation learning.

Heinzinger, Michael; Dallago, Christian; Rost, Burkhard.

Cell Syst ; 12(10): 948-950, 2021 10 20.

Article in English | MEDLINE | ID: mdl-34672956

ABSTRACT

Sledzieski, Singh, Cowen, and Berger employ representation learning to predict protein interactions and associations, additionally identifying binding residues between protein pairs. Generalizability is showcased by training on one organism while evaluating on others. The work exemplifies how transfer of AI-learned representations can advance knowledge in molecular biology.

Subject(s)

Knowledge , Machine Learning

14.

SARS-CoV-2 structural coverage map reveals viral protein assembly, mimicry, and hijacking mechanisms.

O'Donoghue, Seán I; Schafferhans, Andrea; Sikta, Neblina; Stolte, Christian; Kaur, Sandeep; Ho, Bosco K; Anderson, Stuart; Procter, James B; Dallago, Christian; Bordin, Nicola; Adcock, Matt; Rost, Burkhard.

Mol Syst Biol ; 17(9): e10079, 2021 09.

Article in English | MEDLINE | ID: mdl-34519429

ABSTRACT

We modeled 3D structures of all SARS-CoV-2 proteins, generating 2,060 models that span 69% of the viral proteome and provide details not available elsewhere. We found that Ë6% of the proteome mimicked human proteins, while Ë7% was implicated in hijacking mechanisms that reverse post-translational modifications, block host translation, and disable host defenses; a further Ë29% self-assembled into heteromeric states that provided insight into how the viral replication and translation complex forms. To make these 3D models more accessible, we devised a structural coverage map, a novel visualization method to show what is-and is not-known about the 3D structure of the viral proteome. We integrated the coverage map into an accompanying online resource (https://aquaria.ws/covid) that can be used to find and explore models corresponding to the 79 structural states identified in this work. The resulting Aquaria-COVID resource helps scientists use emerging structural data to understand the mechanisms underlying coronavirus infection and draws attention to the 31% of the viral proteome that remains structurally unknown or dark.

Subject(s)

Angiotensin-Converting Enzyme 2/metabolism , Host-Pathogen Interactions/genetics , Protein Processing, Post-Translational , SARS-CoV-2/metabolism , Spike Glycoprotein, Coronavirus/metabolism , Amino Acid Transport Systems, Neutral/chemistry , Amino Acid Transport Systems, Neutral/genetics , Amino Acid Transport Systems, Neutral/metabolism , Angiotensin-Converting Enzyme 2/chemistry , Angiotensin-Converting Enzyme 2/genetics , Binding Sites , COVID-19/genetics , COVID-19/metabolism , COVID-19/virology , Computational Biology/methods , Coronavirus Envelope Proteins/chemistry , Coronavirus Envelope Proteins/genetics , Coronavirus Envelope Proteins/metabolism , Coronavirus Nucleocapsid Proteins/chemistry , Coronavirus Nucleocapsid Proteins/genetics , Coronavirus Nucleocapsid Proteins/metabolism , Humans , Mitochondrial Membrane Transport Proteins/chemistry , Mitochondrial Membrane Transport Proteins/genetics , Mitochondrial Membrane Transport Proteins/metabolism , Mitochondrial Precursor Protein Import Complex Proteins , Models, Molecular , Molecular Mimicry , Neuropilin-1/chemistry , Neuropilin-1/genetics , Neuropilin-1/metabolism , Phosphoproteins/chemistry , Phosphoproteins/genetics , Phosphoproteins/metabolism , Protein Binding , Protein Conformation, alpha-Helical , Protein Conformation, beta-Strand , Protein Interaction Domains and Motifs , Protein Interaction Mapping/methods , Protein Multimerization , SARS-CoV-2/chemistry , SARS-CoV-2/genetics , Spike Glycoprotein, Coronavirus/chemistry , Spike Glycoprotein, Coronavirus/genetics , Viral Matrix Proteins/chemistry , Viral Matrix Proteins/genetics , Viral Matrix Proteins/metabolism , Viroporin Proteins/chemistry , Viroporin Proteins/genetics , Viroporin Proteins/metabolism , Virus Replication

15.

Clustering FunFams using sequence embeddings improves EC purity.

Littmann, Maria; Bordin, Nicola; Heinzinger, Michael; Schütze, Konstantin; Dallago, Christian; Orengo, Christine; Rost, Burkhard.

Bioinformatics ; 37(20): 3449-3455, 2021 Oct 25.

Article in English | MEDLINE | ID: mdl-33978744

ABSTRACT

MOTIVATION: Classifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be 'pure', i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations. RESULTS: We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes. AVAILABILITY AND IMPLEMENTATION: Code and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

16.

Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets.

Dallago, Christian; Schütze, Konstantin; Heinzinger, Michael; Olenyi, Tobias; Littmann, Maria; Lu, Amy X; Yang, Kevin K; Min, Seonwoo; Yoon, Sungroh; Morton, James T; Rost, Burkhard.

Curr Protoc ; 1(5): e113, 2021 May.

Article in English | MEDLINE | ID: mdl-33961736

ABSTRACT

Models from machine learning (ML) or artificial intelligence (AI) increasingly assist in guiding experimental design and decision making in molecular biology and medicine. Recently, Language Models (LMs) have been adapted from Natural Language Processing (NLP) to encode the implicit language written in protein sequences. Protein LMs show enormous potential in generating descriptive representations (embeddings) for proteins from just their sequences, in a fraction of the time with respect to previous approaches, yet with comparable or improved predictive ability. Researchers have trained a variety of protein LMs that are likely to illuminate different angles of the protein language. By leveraging the bio_embeddings pipeline and modules, simple and reproducible workflows can be laid out to generate protein embeddings and rich visualizations. Embeddings can then be leveraged as input features through machine learning libraries to develop methods predicting particular aspects of protein function and structure. Beyond the workflows included here, embeddings have been leveraged as proxies to traditional homology-based inference and even to align similar protein sequences. A wealth of possibilities remain for researchers to harness through the tools provided in the following protocols. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC. The following protocols are included in this manuscript: Basic Protocol 1: Generic use of the bio_embeddings pipeline to plot protein sequences and annotations Basic Protocol 2: Generate embeddings from protein sequences using the bio_embeddings pipeline Basic Protocol 3: Overlay sequence annotations onto a protein space visualization Basic Protocol 4: Train a machine learning classifier on protein embeddings Alternate Protocol 1: Generate 3D instead of 2D visualizations Alternate Protocol 2: Visualize protein solubility instead of protein subcellular localization Support Protocol: Join embedding generation and sequence space visualization in a pipeline.

Subject(s)

Artificial Intelligence , Deep Learning , Machine Learning , Natural Language Processing , Proteins

17.

PredictProtein - Predicting Protein Structure and Function for 29 Years.

Bernhofer, Michael; Dallago, Christian; Karl, Tim; Satagopam, Venkata; Heinzinger, Michael; Littmann, Maria; Olenyi, Tobias; Qiu, Jiajun; Schütze, Konstantin; Yachdav, Guy; Ashkenazy, Haim; Ben-Tal, Nir; Bromberg, Yana; Goldberg, Tatyana; Kajan, Laszlo; O'Donoghue, Sean; Sander, Chris; Schafferhans, Andrea; Schlessinger, Avner; Vriend, Gerrit; Mirdita, Milot; Gawron, Piotr; Gu, Wei; Jarosz, Yohan; Trefois, Christophe; Steinegger, Martin; Schneider, Reinhard; Rost, Burkhard.

Nucleic Acids Res ; 49(W1): W535-W540, 2021 07 02.

Article in English | MEDLINE | ID: mdl-33999203

ABSTRACT

Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre for Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. PredictProtein was the first Internet server for protein predictions. It pioneered combining evolutionary information and machine learning. Given a protein sequence as input, the server outputs multiple sequence alignments, predictions of protein structure in 1D and 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, protein flexibility, and disulfide bridges) and predictions of protein function (functional effects of sequence variation or point mutations, Gene Ontology (GO) terms, subcellular localization, and protein-, RNA-, and DNA binding). PredictProtein's infrastructure has moved to the LCSB increasing throughput; the use of MMseqs2 sequence search reduced runtime five-fold (apparently without lowering performance of prediction methods); user interface elements improved usability, and new prediction methods were added. PredictProtein recently included predictions from deep learning embeddings (GO and secondary structure) and a method for the prediction of proteins and residues binding DNA, RNA, or other proteins. PredictProtein.org aspires to provide reliable predictions to computational and experimental biologists alike. All scripts and methods are freely available for offline execution in high-throughput settings.

Subject(s)

Protein Conformation , Software , Binding Sites , Coronavirus Nucleocapsid Proteins/chemistry , DNA-Binding Proteins/chemistry , Phosphoproteins/chemistry , Protein Structure, Secondary , Proteins/chemistry , Proteins/physiology , RNA-Binding Proteins/chemistry , Sequence Alignment , Sequence Analysis, Protein

18.

Embeddings from deep learning transfer GO annotations beyond homology.

Littmann, Maria; Heinzinger, Michael; Dallago, Christian; Olenyi, Tobias; Rost, Burkhard.

Sci Rep ; 11(1): 1160, 2021 01 13.

Article in English | MEDLINE | ID: mdl-33441905

ABSTRACT

Knowing protein function is crucial to advance molecular and medical biology, yet experimental function annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically through homology-based annotation transfer by identifying sequence-similar proteins with known function or through prediction methods using evolutionary information. Here, we propose predicting GO terms through annotation transfer based on proximity of proteins in the SeqVec embedding rather than in sequence space. These embeddings originate from deep learned language models (LMs) for protein sequences (SeqVec) transferring the knowledge gained from predicting the next amino acid in 33 million protein sequences. Replicating the conditions of CAFA3, our method reaches an Fmax of 37 ± 2%, 50 ± 3%, and 57 ± 2% for BPO, MFO, and CCO, respectively. Numerically, this appears close to the top ten CAFA3 methods. When restricting the annotation transfer to proteins with < 20% pairwise sequence identity to the query, performance drops (Fmax BPO 33 ± 2%, MFO 43 ± 3%, CCO 53 ± 2%); this still outperforms naïve sequence-based transfer. Preliminary results from CAFA4 appear to confirm these findings. Overall, this new concept is likely to change the annotation of proteins, in particular for proteins from smaller families or proteins with intrinsically disordered regions.

Subject(s)

Computational Biology/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Amino Acids/chemistry , Deep Learning , Gene Ontology , Humans , Machine Learning , Molecular Sequence Annotation/methods , Proteins/chemistry , Sequence Homology, Amino Acid , Software

19.

Light attention predicts protein location from the language of life.

Stärk, Hannes; Dallago, Christian; Heinzinger, Michael; Rost, Burkhard.

Bioinform Adv ; 1(1): vbab035, 2021.

Article in English | MEDLINE | ID: mdl-36700108

ABSTRACT

Summary: Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expert-designed input features leveraging information from multiple sequence alignments (MSAs) that is resource expensive to generate. Here, we showcased using embeddings from protein language models for competitive localization prediction without MSAs. Our lightweight deep neural network architecture used a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention. The method significantly outperformed the state-of-the-art (SOTA) for 10 localization classes by about 8 percentage points (Q10). So far, this might be the highest improvement of just embeddings over MSAs. Our new test set highlighted the limits of standard static datasets: while inviting new models, they might not suffice to claim improvements over the SOTA. Availability and implementation: The novel models are available as a web-service at http://embed.protein.properties. Code needed to reproduce results is provided at https://github.com/HannesStark/protein-localization. Predictions for the human proteome are available at https://zenodo.org/record/5047020. Supplementary information: Supplementary data are available at Bioinformatics Advances online.

20.

AlignmentViewer: Sequence Analysis of Large Protein Families.

Reguant, Roc; Antipin, Yevgeniy; Sheridan, Rob; Dallago, Christian; Diamantoukos, Drew; Luna, Augustin; Sander, Chris; Gauthier, Nicholas Paul.

F1000Res ; 92020.

Article in English | MEDLINE | ID: mdl-33123346

ABSTRACT

AlignmentViewer is a web-based tool to view and analyze multiple sequence alignments of protein families. The particular strengths of AlignmentViewer include flexible visualization at different scales as well as analysis of conservation patterns and of the distribution of proteins in sequence space. The tool is directly accessible in web browsers without the need for software installation. It can handle protein families with tens of thousands of sequences and is particularly suitable for evolutionary coupling analysis, e.g. via EVcouplings.org.

Subject(s)

Proteins , Sequence Alignment , Software , Humans , Proteins/genetics , Sequence Analysis, Protein , Web Browser

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL