Search | VHL Regional Portal

A universal deep-learning model for zinc finger design enables transcription factor reprogramming.

Ichikawa, David M; Abdin, Osama; Alerasool, Nader; Kogenaru, Manjunatha; Mueller, April L; Wen, Han; Giganti, David O; Goldberg, Gregory W; Adams, Samantha; Spencer, Jeffrey M; Razavi, Rozita; Nim, Satra; Zheng, Hong; Gionco, Courtney; Clark, Finnegan T; Strokach, Alexey; Hughes, Timothy R; Lionnet, Timothee; Taipale, Mikko; Kim, Philip M; Noyes, Marcus B.

Nat Biotechnol ; 41(8): 1117-1129, 2023 08.

Article in English | MEDLINE | ID: mdl-36702896

ABSTRACT

Cys2His2 zinc finger (ZF) domains engineered to bind specific target sequences in the genome provide an effective strategy for programmable regulation of gene expression, with many potential therapeutic applications. However, the structurally intricate engagement of ZF domains with DNA has made their design challenging. Here we describe the screening of 49 billion protein-DNA interactions and the development of a deep-learning model, ZFDesign, that solves ZF design for any genomic target. ZFDesign is a modern machine learning method that models global and target-specific differences induced by a range of library environments and specifically takes into account compatibility of neighboring fingers using a novel hierarchical transformer architecture. We demonstrate the versatility of designed ZFs as nucleases as well as activators and repressors by seamless reprogramming of human transcription factors. These factors could be used to upregulate an allele of haploinsufficiency, downregulate a gain-of-function mutation or test the consequence of regulation of a single gene as opposed to the many genes that a transcription factor would normally influence.

Subject(s)

Deep Learning , Transcription Factors , Humans , Transcription Factors/genetics , Transcription Factors/metabolism , Zinc Fingers/genetics , Gene Expression Regulation , DNA/genetics

Deep generative modeling for protein design.

Strokach, Alexey; Kim, Philip M.

Curr Opin Struct Biol ; 72: 226-236, 2022 02.

Article in English | MEDLINE | ID: mdl-34963082

ABSTRACT

Deep learning approaches have produced substantial breakthroughs in fields such as image classification and natural language processing and are making rapid inroads in the area of protein design. Many generative models of proteins have been developed that encompass all known protein sequences, model specific protein families, or extrapolate the dynamics of individual proteins. Those generative models can learn protein representations that are often more informative of protein structure and function than hand-engineered features. Furthermore, they can be used to quickly propose millions of novel proteins that resemble the native counterparts in terms of expression level, stability, or other attributes. The protein design process can further be guided by discriminative oracles to select candidates with the highest probability of having the desired properties. In this review, we discuss five classes of generative models that have been most successful at modeling proteins and provide a framework for model guided protein design.

Subject(s)

Neural Networks, Computer , Proteins

Computational generation of proteins with predetermined three-dimensional shapes using ProteinSolver.

Strokach, Alexey; Becerra, David; Corbi-Verge, Carles; Perez-Riba, Albert; Kim, Philip M.

STAR Protoc ; 2(2): 100505, 2021 06 18.

Article in English | MEDLINE | ID: mdl-33997819

ABSTRACT

Computational generation of new proteins with a predetermined three-dimensional shape and computational optimization of existing proteins while maintaining their shape are challenging problems in structural biology. Here, we present a protocol that uses ProteinSolver, a pre-trained graph convolutional neural network, to quickly generate thousands of sequences matching a specific protein topology. We describe computational approaches that can be used to evaluate the generated sequences, and we show how select sequences can be validated experimentally. For complete details on the use and execution of this protocol, please refer to Strokach et al. (2020).

Subject(s)

Computational Biology , Databases, Protein , Neural Networks, Computer , Proteins , Software , Proteins/chemistry , Proteins/genetics

ELASPIC2 (EL2): Combining Contextualized Language Models and Graph Neural Networks to Predict Effects of Mutations.

Strokach, Alexey; Lu, Tian Yu; Kim, Philip M.

J Mol Biol ; 433(11): 166810, 2021 05 28.

Article in English | MEDLINE | ID: mdl-33450251

ABSTRACT

The ELASPIC web server allows users to evaluate the effect of mutations on protein folding and protein-protein interaction on a proteome-wide scale. It uses homology models of proteins and protein-protein interactions, which have been precalculated for several proteomes, and machine learning models, which integrate structural information with sequence conservation scores, in order to make its predictions. Since the original publication of the ELASPIC web server, several advances have motivated a revisiting of the problem of mutation effect prediction. First, progress in neural network architectures and self-supervised pre-trained has resulted in models which provide more informative embeddings of protein sequence and structure than those used by the original version of ELASPIC. Second, the amount of training data has increased several-fold, largely driven by advances in deep mutation scanning and other multiplexed assays of variant effect. Here, we describe two machine learning models which leverage the recent advances in order to achieve superior accuracy in predicting the effect of mutation on protein folding and protein-protein interaction. The models incorporate features generated using pre-trained transformer- and graph convolution-based neural networks, and are trained to optimize a ranking objective function, which permits the use of heterogeneous training data. The outputs from the new models have been incorporated into the ELASPIC web server, available at http://elaspic.kimlab.org.

Subject(s)

Computational Biology/methods , Language , Mutation/genetics , Neural Networks, Computer , Software , Algorithms , Databases, Protein , Internet , Protein Folding , Reproducibility of Results , User-Computer Interface

Fast and Flexible Protein Design Using Deep Graph Neural Networks.

Strokach, Alexey; Becerra, David; Corbi-Verge, Carles; Perez-Riba, Albert; Kim, Philip M.

Cell Syst ; 11(4): 402-411.e4, 2020 10 21.

Article in English | MEDLINE | ID: mdl-32971019

ABSTRACT

Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. We show that a deep graph neural network, ProteinSolver, can precisely design sequences that fold into a predetermined shape by phrasing this challenge as a constraint satisfaction problem (CSP), akin to Sudoku puzzles. We trained ProteinSolver on over 70,000,000 real protein sequences corresponding to over 80,000 structures. We show that our method rapidly designs new protein sequences and benchmark them in silico using energy-based scores, molecular dynamics, and structure prediction methods. As a proof-of-principle validation, we use ProteinSolver to generate sequences that match the structure of serum albumin, then synthesize the top-scoring design and validate it in vitro using circular dichroism. ProteinSolver is freely available at http://design.proteinsolver.org and https://gitlab.com/ostrokach/proteinsolver. A record of this paper's transparent peer review process is included in the Supplemental Information.

Subject(s)

Protein Engineering/methods , Sequence Analysis, Protein/methods , Algorithms , Amino Acid Sequence/genetics , Computer Simulation , Databases, Protein , Neural Networks, Computer , Proteins/metabolism , Software

Evaluating the predictions of the protein stability change upon single amino acid substitutions for the FXN CAGI5 challenge.

Savojardo, Castrense; Petrosino, Maria; Babbi, Giulia; Bovo, Samuele; Corbi-Verge, Carles; Casadio, Rita; Fariselli, Piero; Folkman, Lukas; Garg, Aditi; Karimi, Mostafa; Katsonis, Panagiotis; Kim, Philip M; Lichtarge, Olivier; Martelli, Pier Luigi; Pasquo, Alessandra; Pal, Debnath; Shen, Yang; Strokach, Alexey V; Turina, Paola; Zhou, Yaoqi; Andreoletti, Gaia; Brenner, Steven E; Chiaraluce, Roberta; Consalvi, Valerio; Capriotti, Emidio.

Hum Mutat ; 40(9): 1392-1399, 2019 09.

Article in English | MEDLINE | ID: mdl-31209948

ABSTRACT

Frataxin (FXN) is a highly conserved protein found in prokaryotes and eukaryotes that is required for efficient regulation of cellular iron homeostasis. Experimental evidence associates amino acid substitutions of the FXN to Friedreich Ataxia, a neurodegenerative disorder. Recently, new thermodynamic experiments have been performed to study the impact of somatic variations identified in cancer tissues on protein stability. The Critical Assessment of Genome Interpretation (CAGI) data provider at the University of Rome measured the unfolding free energy of a set of variants (FXN challenge data set) with far-UV circular dichroism and intrinsic fluorescence spectra. These values have been used to calculate the change in unfolding free energy between the variant and wild-type proteins at zero concentration of denaturant (ΔΔGH2O) . The FXN challenge data set, composed of eight amino acid substitutions, was used to evaluate the performance of the current computational methods for predicting the ΔΔGH2O value associated with the variants and to classify them as destabilizing and not destabilizing. For the fifth edition of CAGI, six independent research groups from Asia, Australia, Europe, and North America submitted 12 sets of predictions from different approaches. In this paper, we report the results of our assessment and discuss the limitations of the tested algorithms.

Subject(s)

Amino Acid Substitution , Iron-Binding Proteins/chemistry , Iron-Binding Proteins/genetics , Algorithms , Circular Dichroism , Humans , Models, Molecular , Protein Conformation , Protein Folding , Protein Stability , Frataxin

Predicting changes in protein stability caused by mutation using sequence-and structure-based methods in a CAGI5 blind challenge.

Strokach, Alexey; Corbi-Verge, Carles; Kim, Philip M.

Hum Mutat ; 40(9): 1414-1423, 2019 09.

Article in English | MEDLINE | ID: mdl-31243847

ABSTRACT

Predicting the impact of mutations on proteins remains an important problem. As part of the CAGI5 frataxin challenge, we evaluate the accuracy with which Provean, FoldX, and ELASPIC can predict changes in the Gibbs free energy of a protein using a limited data set of eight mutations. We find that different methods have distinct strengths and limitations, with no method being strictly superior to other methods on all metrics. ELASPIC achieves the highest accuracy while also providing a web interface which simplifies the evaluation and analysis of mutations. FoldX is slightly less accurate than ELASPIC but is easier to run locally, as it does not depend on external tools or datasets. Provean achieves reasonable results while being computational less expensive than the other methods and not requiring a structure of the protein. In addition to methods submitted to the CAGI5 community experiment, and with the aim to inform about other methods with high accuracy, we also evaluate predictions made by Rosetta's ddg_monomer protocol, Rosetta's cartesian_ddg protocol, and thermodynamic integration calculations using Amber package. ELASPIC still achieves the highest accuracy, while Rosetta's catesian_ddg protocol appears to perform best in capturing the overall trend in the data.

Subject(s)

Computational Biology/methods , Iron-Binding Proteins/chemistry , Iron-Binding Proteins/genetics , Mutation , Humans , Models, Molecular , Protein Conformation , Protein Folding , Protein Stability , Thermodynamics , Frataxin

Predicting the Effect of Mutations on Protein Folding and Protein-Protein Interactions.

Strokach, Alexey; Corbi-Verge, Carles; Teyra, Joan; Kim, Philip M.

Methods Mol Biol ; 1851: 1-17, 2019.

Article in English | MEDLINE | ID: mdl-30298389

ABSTRACT

The function of a protein is largely determined by its three-dimensional structure and its interactions with other proteins. Changes to a protein's amino acid sequence can alter its function by perturbing the energy landscapes of protein folding and binding. Many tools have been developed to predict the energetic effect of amino acid changes, utilizing features describing the sequence of a protein, the structure of a protein, or both. Those tools can have many applications, such as distinguishing between deleterious and benign mutations and designing proteins and peptides with attractive properties. In this chapter, we describe how to use one of such tools, ELASPIC, to predict the effect of mutations on the stability of proteins and the affinity between proteins, in the context of a human protein-protein interaction network. ELASPIC uses a wide range of sequential and structural features to predict the change in the Gibbs free energy for protein folding and protein-protein interactions. It can be used both through a web server and as a stand-alone application. Since ELASPIC was trained using homology models and not crystal structures, it can be applied to a much broader range of proteins than traditional methods. It can leverage precalculated sequence alignments, homology models, and other features, in order to drastically lower the amount of time required to evaluate individual mutations and make tractable the analysis of millions of mutations affecting the majority of proteins in a genome.

Subject(s)

Computational Biology/methods , Mutation/genetics , Proteins/metabolism , Protein Binding , Protein Engineering , Protein Folding , Protein Stability , Proteins/genetics , Thermodynamics

ELASPIC web-server: proteome-wide structure-based prediction of mutation effects on protein stability and binding affinity.

Witvliet, Daniel K; Strokach, Alexey; Giraldo-Forero, Andrés Felipe; Teyra, Joan; Colak, Recep; Kim, Philip M.

Bioinformatics ; 32(10): 1589-91, 2016 05 15.

Article in English | MEDLINE | ID: mdl-26801957

ABSTRACT

UNLABELLED: ELASPIC is a novel ensemble machine-learning approach that predicts the effects of mutations on protein folding and protein-protein interactions. Here, we present the ELASPIC webserver, which makes the ELASPIC pipeline available through a fast and intuitive interface. The webserver can be used to evaluate the effect of mutations on any protein in the Uniprot database, and allows all predicted results, including modeled wild-type and mutated structures, to be managed and viewed online and downloaded if needed. It is backed by a database which contains improved structural domain definitions, and a list of curated domain-domain interactions for all known proteins, as well as homology models of domains and domain-domain interactions for the human proteome. Homology models for proteins of other organisms are calculated on the fly, and mutations are evaluated within minutes once the homology model is available. AVAILABILITY AND IMPLEMENTATION: The ELASPIC webserver is available online at http://elaspic.kimlab.org CONTACT: pm.kim@utoronto.ca or pi@kimlab.orgSupplementary data: Supplementary data are available at Bioinformatics online.

Subject(s)

Proteome , Humans , Mutation , Protein Binding , Protein Folding , Protein Stability , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL