Search | VHL Regional Portal

1.

Designing proteins with language models.

Ruffolo, Jeffrey A; Madani, Ali.

Nat Biotechnol ; 42(2): 200-202, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38361067

Subject(s)

Protein Engineering

2.

Flexible protein-protein docking with a multitrack iterative transformer.

Chu, Lee-Shin; Ruffolo, Jeffrey A; Harmalkar, Ameya; Gray, Jeffrey J.

Protein Sci ; 33(2): e4862, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38148272

ABSTRACT

Conventional protein-protein docking algorithms usually rely on heavy candidate sampling and reranking, but these steps are time-consuming and hinder applications that require high-throughput complex structure prediction, for example, structure-based virtual screening. Existing deep learning methods for protein-protein docking, despite being much faster, suffer from low docking success rates. In addition, they simplify the problem to assume no conformational changes within any protein upon binding (rigid docking). This assumption precludes applications when binding-induced conformational changes play a role, such as allosteric inhibition or docking from uncertain unbound model structures. To address these limitations, we present GeoDock, a multitrack iterative transformer network to predict a docked structure from separate docking partners. Unlike deep learning models for protein structure prediction that input multiple sequence alignments, GeoDock inputs just the sequences and structures of the docking partners, which suits the tasks when the individual structures are given. GeoDock is flexible at the protein residue level, allowing the prediction of conformational changes upon binding. On the Database of Interacting Protein Structures (DIPS) test set, GeoDock achieves a 43% top-1 success rate, outperforming all other tested methods. However, in the standard DIPS train/test splits, we discovered contamination of close homologs in the training set. After decontaminating the training set, the success rate is 31%. On the DB5.5 test set and a benchmark dataset of antibody-antigen complexes, GeoDock outperforms the deep learning models trained using the same dataset but falls behind most of the conventional methods and AlphaFold-Multimer. GeoDock attains an average inference speed of under 1 s on a single GPU, enabling its application in large-scale structure screening. Although binding-induced conformational changes are still a challenge owing to limited training and evaluation data, our architecture sets up the foundation to capture this backbone flexibility. Code and a demonstration Jupyter notebook are available at https://github.com/Graylab/GeoDock.

Subject(s)

Algorithms , Proteins , Salicylates , Protein Conformation , Protein Binding , Proteins/chemistry , Molecular Docking Simulation

3.

IgLM: Infilling language modeling for antibody sequence design.

Shuai, Richard W; Ruffolo, Jeffrey A; Gray, Jeffrey J.

Cell Syst ; 14(11): 979-989.e4, 2023 11 15.

Article in English | MEDLINE | ID: mdl-37909045

ABSTRACT

Discovery and optimization of monoclonal antibodies for therapeutic applications relies on large sequence libraries but is hindered by developability issues such as low solubility, high aggregation, and high immunogenicity. Generative language models, trained on millions of protein sequences, are a powerful tool for the on-demand generation of realistic, diverse sequences. We present the Immunoglobulin Language Model (IgLM), a deep generative language model for creating synthetic antibody libraries. Compared with prior methods that leverage unidirectional context for sequence generation, IgLM formulates antibody design based on text-infilling in natural language, allowing it to re-design variable-length spans within antibody sequences using bidirectional context. We trained IgLM on 558 million (M) antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species of origin. We demonstrate that IgLM can generate full-length antibody sequences from a variety of species and its infilling formulation allows it to generate infilled complementarity-determining region (CDR) loop libraries with improved in silico developability profiles. A record of this paper's transparent peer review process is included in the supplemental information.

Subject(s)

Complementarity Determining Regions , Peptide Library , Amino Acid Sequence , Complementarity Determining Regions/genetics , Antibodies, Monoclonal

4.

ProGen2: Exploring the boundaries of protein language models.

Nijkamp, Erik; Ruffolo, Jeffrey A; Weinstein, Eli N; Naik, Nikhil; Madani, Ali.

Cell Syst ; 14(11): 968-978.e3, 2023 11 15.

Article in English | MEDLINE | ID: mdl-37909046

ABSTRACT

Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.

Subject(s)

Artificial Intelligence , Proteins , Proteins/genetics , Amino Acid Sequence , Language , Databases, Factual

5.

Flexible Protein-Protein Docking with a Multi-Track Iterative Transformer.

Chu, Lee-Shin; Ruffolo, Jeffrey A; Harmalkar, Ameya; Gray, Jeffrey J.

bioRxiv ; 2023 Jul 01.

Article in English | MEDLINE | ID: mdl-37425754

ABSTRACT

Conventional protein-protein docking algorithms usually rely on heavy candidate sampling and re-ranking, but these steps are time-consuming and hinder applications that require high-throughput complex structure prediction, e.g., structure-based virtual screening. Existing deep learning methods for protein-protein docking, despite being much faster, suffer from low docking success rates. In addition, they simplify the problem to assume no conformational changes within any protein upon binding (rigid docking). This assumption precludes applications when binding-induced conformational changes play a role, such as allosteric inhibition or docking from uncertain unbound model structures. To address these limitations, we present GeoDock, a multi-track iterative transformer network to predict a docked structure from separate docking partners. Unlike deep learning models for protein structure prediction that input multiple sequence alignments (MSAs), GeoDock inputs just the sequences and structures of the docking partners, which suits the tasks when the individual structures are given. GeoDock is flexible at the protein residue level, allowing the prediction of conformational changes upon binding. For a benchmark set of rigid targets, GeoDock obtains a 41% success rate, outperforming all the other tested methods. For a more challenging benchmark set of flexible targets, GeoDock achieves a similar number of top-model successes as the traditional method ClusPro [1], but fewer than ReplicaDock2 [2]. GeoDock attains an average inference speed of under one second on a single GPU, enabling its application in large-scale structure screening. Although binding-induced conformational changes are still a challenge owing to limited training and evaluation data, our architecture sets up the foundation to capture this backbone flexibility. Code and a demonstration Jupyter notebook are available at https://github.com/Graylab/GeoDock.

6.

Contextual protein and antibody encodings from equivariant graph transformers.

Mahajan, Sai Pooja; Ruffolo, Jeffrey A; Gray, Jeffrey J.

bioRxiv ; 2023 Jul 29.

Article in English | MEDLINE | ID: mdl-37503113

ABSTRACT

The optimal residue identity at each position in a protein is determined by its structural, evolutionary, and functional context. We seek to learn the representation space of the optimal amino-acid residue in different structural contexts in proteins. Inspired by masked language modeling (MLM), our training aims to transduce learning of amino-acid labels from non-masked residues to masked residues in their structural environments and from general (e.g., a residue in a protein) to specific contexts (e.g., a residue at the interface of a protein or antibody complex). Our results on native sequence recovery and forward folding with AlphaFold2 suggest that the amino acid label for a protein residue may be determined from its structural context alone (i.e., without knowledge of the sequence labels of surrounding residues). We further find that the sequence space sampled from our masked models recapitulate the evolutionary sequence neighborhood of the wildtype sequence. Remarkably, the sequences conditioned on highly plastic structures recapitulate the conformational flexibility encoded in the structures. Furthermore, maximum-likelihood interfaces designed with masked models recapitulate wildtype binding energies for a wide range of protein interfaces and binding strengths. We also propose and compare fine-tuning strategies to train models for designing CDR loops of antibodies in the structural context of the antibody-antigen interface by leveraging structural databases for proteins, antibodies (synthetic and experimental) and protein-protein complexes. We show that pretraining on more general contexts improves native sequence recovery for antibody CDR loops, especially for the hypervariable CDR H3, while fine-tuning helps to preserve patterns observed in special contexts.

7.

Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies.

Ruffolo, Jeffrey A; Chu, Lee-Shin; Mahajan, Sai Pooja; Gray, Jeffrey J.

Nat Commun ; 14(1): 2389, 2023 04 25.

Article in English | MEDLINE | ID: mdl-37185622

ABSTRACT

Antibodies have the capacity to bind a diverse set of antigens, and they have become critical therapeutics and diagnostic molecules. The binding of antibodies is facilitated by a set of six hypervariable loops that are diversified through genetic recombination and mutation. Even with recent advances, accurate structural prediction of these loops remains a challenge. Here, we present IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558 million natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under 25 s). Accurate structure prediction on this timescale makes possible avenues of investigation that were previously infeasible. As a demonstration of IgFold's capabilities, we predicted structures for 1.4 million paired antibody sequences, providing structural insights to 500-fold more antibodies than have experimentally determined structures.

Subject(s)

Deep Learning , Protein Conformation , Antibodies/chemistry , Complementarity Determining Regions/chemistry , Antigens

8.

Hallucinating structure-conditioned antibody libraries for target-specific binders.

Mahajan, Sai Pooja; Ruffolo, Jeffrey A; Frick, Rahel; Gray, Jeffrey J.

Front Immunol ; 13: 999034, 2022.

Article in English | MEDLINE | ID: mdl-36341416

ABSTRACT

Antibodies are widely developed and used as therapeutics to treat cancer, infectious disease, and inflammation. During development, initial leads routinely undergo additional engineering to increase their target affinity. Experimental methods for affinity maturation are expensive, laborious, and time-consuming and rarely allow the efficient exploration of the relevant design space. Deep learning (DL) models are transforming the field of protein engineering and design. While several DL-based protein design methods have shown promise, the antibody design problem is distinct, and specialized models for antibody design are desirable. Inspired by hallucination frameworks that leverage accurate structure prediction DL models, we propose the FvHallucinator for designing antibody sequences, especially the CDR loops, conditioned on an antibody structure. Such a strategy generates targeted CDR libraries that retain the conformation of the binder and thereby the mode of binding to the epitope on the antigen. On a benchmark set of 60 antibodies, FvHallucinator generates sequences resembling natural CDRs and recapitulates perplexity of canonical CDR clusters. Furthermore, the FvHallucinator designs amino acid substitutions at the VH-VL interface that are enriched in human antibody repertoires and therapeutic antibodies. We propose a pipeline that screens FvHallucinator designs to obtain a library enriched in binders for an antigen of interest. We apply this pipeline to the CDR H3 of the Trastuzumab-HER2 complex to generate in silico designs predicted to improve upon the binding affinity and interfacial properties of the original antibody. Thus, the FvHallucinator pipeline enables generation of inexpensive, diverse, and targeted antibody libraries enriched in binders for antibody affinity maturation.

Subject(s)

Antibodies , Complementarity Determining Regions , Humans , Complementarity Determining Regions/chemistry , Amino Acid Sequence , Antibody Affinity , Antigens , Hallucinations

9.

Simultaneous prediction of antibody backbone and side-chain conformations with deep learning.

Akpinaroglu, Deniz; Ruffolo, Jeffrey A; Mahajan, Sai Pooja; Gray, Jeffrey J.

PLoS One ; 17(6): e0258173, 2022.

Article in English | MEDLINE | ID: mdl-35704640

ABSTRACT

Antibody engineering is becoming increasingly popular in medicine for the development of diagnostics and immunotherapies. Antibody function relies largely on the recognition and binding of antigenic epitopes via the loops in the complementarity determining regions. Hence, accurate high-resolution modeling of these loops is essential for effective antibody engineering and design. Deep learning methods have previously been shown to effectively predict antibody backbone structures described as a set of inter-residue distances and orientations. However, antigen binding is also dependent on the specific conformations of surface side-chains. To address this shortcoming, we created DeepSCAb: a deep learning method that predicts inter-residue geometries as well as side-chain dihedrals of the antibody variable fragment. The network requires only sequence as input, rendering it particularly useful for antibodies without any known backbone conformations. Rotamer predictions use an interpretable self-attention layer, which learns to identify structurally conserved anchor positions across several species. We evaluate the performance of the model for discriminating near-native structures from sets of decoys and find that DeepSCAb outperforms similar methods lacking side-chain context. When compared to alternative rotamer repacking methods, which require an input backbone structure, DeepSCAb predicts side-chain conformations competitively. Our findings suggest that DeepSCAb improves antibody structure prediction with accurate side-chain modeling and is adaptable to applications in docking of antibody-antigen complexes and design of new therapeutic antibody sequences.

Subject(s)

Deep Learning , Antigen-Antibody Complex , Protein Conformation , Structural Homology, Protein

10.

Antibody structure prediction using interpretable deep learning.

Ruffolo, Jeffrey A; Sulam, Jeremias; Gray, Jeffrey J.

Patterns (N Y) ; 3(2): 100406, 2022 Feb 11.

Article in English | MEDLINE | ID: mdl-35199061

ABSTRACT

Therapeutic antibodies make up a rapidly growing segment of the biologics market. However, rational design of antibodies is hindered by reliance on experimental methods for determining antibody structures. Here, we present DeepAb, a deep learning method for predicting accurate antibody FV structures from sequence. We evaluate DeepAb on a set of structurally diverse, therapeutically relevant antibodies and find that our method consistently outperforms the leading alternatives. Previous deep learning methods have operated as "black boxes" and offered few insights into their predictions. By introducing a directly interpretable attention mechanism, we show our network attends to physically important residue pairs (e.g., proximal aromatics and key hydrogen bonding interactions). Finally, we present a novel mutant scoring metric derived from network confidence and show that for a particular antibody, all eight of the top-ranked mutations improve binding affinity. This model will be useful for a broad range of antibody prediction and design tasks.

11.

Geometric potentials from deep learning improve prediction of CDR H3 loop structures.

Ruffolo, Jeffrey A; Guerra, Carlos; Mahajan, Sai Pooja; Sulam, Jeremias; Gray, Jeffrey J.

Bioinformatics ; 36(Suppl_1): i268-i275, 2020 07 01.

Article in English | MEDLINE | ID: mdl-32657412

ABSTRACT

MOTIVATION: Antibody structure is largely conserved, except for a complementarity-determining region featuring six variable loops. Five of these loops adopt canonical folds which can typically be predicted with existing methods, while the remaining loop (CDR H3) remains a challenge due to its highly diverse set of observed conformations. In recent years, deep neural networks have proven to be effective at capturing the complex patterns of protein structure. This work proposes DeepH3, a deep residual neural network that learns to predict inter-residue distances and orientations from antibody heavy and light chain sequence. The output of DeepH3 is a set of probability distributions over distances and orientation angles between pairs of residues. These distributions are converted to geometric potentials and used to discriminate between decoy structures produced by RosettaAntibody and predict new CDR H3 loop structures de novo. RESULTS: When evaluated on the Rosetta antibody benchmark dataset of 49 targets, DeepH3-predicted potentials identified better, same and worse structures [measured by root-mean-squared distance (RMSD) from the experimental CDR H3 loop structure] than the standard Rosetta energy function for 33, 6 and 10 targets, respectively, and improved the average RMSD of predictions by 32.1% (1.4 Å). Analysis of individual geometric potentials revealed that inter-residue orientations were more effective than inter-residue distances for discriminating near-native CDR H3 loops. When applied to de novo prediction of CDR H3 loop structures, DeepH3 achieves an average RMSD of 2.2 ± 1.1 Å on the Rosetta antibody benchmark. AVAILABILITY AND IMPLEMENTATION: DeepH3 source code and pre-trained model parameters are freely available at https://github.com/Graylab/deepH3-distances-orientations. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Deep Learning , Antibodies , Complementarity Determining Regions , Models, Molecular , Protein Conformation

12.

Modeling of lamprey reticulospinal neurons: multiple distinct parameter sets yield realistic simulations.

Ruffolo, Jeffrey A; McClellan, Andrew D.

J Neurophysiol ; 124(3): 895-913, 2020 09 01.

Article in English | MEDLINE | ID: mdl-32697608

ABSTRACT

For the lamprey and other vertebrates, reticulospinal (RS) neurons project descending axons to the spinal cord and activate motor networks to initiate locomotion and other behaviors. In the present study, a biophysically detailed computer model of lamprey RS neurons was constructed consisting of three compartments: dendritic, somatic, and axon initial segment (AIS). All compartments included passive channels. In addition, the soma and AIS had fast potassium and sodium channels. The soma included three additional voltage-gated ion channels (slow sodium and high- and low-voltage-activated calcium) and calcium-activated potassium channels. An initial manually adjusted default parameter set, which was based, in part, on modified parameters from models of lamprey spinal neurons, generated simulations of single action potentials and repetitive firing that scored favorably (0.658; maximum = 0.964) compared with experimentally derived properties of lamprey RS neurons. Subsequently, a dual-annealing search paradigm identified 4,302 viable parameter sets at local maxima within parameter space that yielded higher scores than the default parameter set, including many with much higher scores of approximately 0.85-0.87 (i.e., ~30% improvement). In addition, 5- and 2-conductance grid searches identified a relatively large number of viable parameters sets for which significant correlations were present between maximum conductances for pairs of ion channels. The present results indicated that multiple model parameter sets ("solutions") generated action potentials and repetitive firing that mimicked many of the properties of lamprey RS neurons. To our knowledge, this is the first study to systematically explore parameter space for a biophysically detailed model of lamprey RS neurons.NEW & NOTEWORTHY A computer model of lamprey reticulospinal neurons with a default parameter set produced simulations of action potentials and repetitive firing that scored favorably compared with the properties of these neurons. A dual-annealing search algorithm explored ~50 million parameter sets and identified 4,302 distinct viable parameter sets within parameter space that yielded higher/much higher scores than the default parameter set. In addition, 5- and 2-conductance grid searches identified significant correlations between maximum conductances for pairs of ion channels.

Subject(s)

Action Potentials/physiology , Computer Simulation , Lampreys/physiology , Locomotion/physiology , Models, Biological , Nerve Net/physiology , Neurons/physiology , Spinal Cord/physiology , Animals , Behavior, Animal/physiology , Potassium Channels/physiology , Sodium Channels/physiology , Spinal Cord/cytology

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL