Search | VHL Regional Portal

Identifying promising sequences for protein engineering using a deep transformer protein language model.

Frisby, Trevor S; Langmead, Christopher James.

Proteins ; 91(11): 1471-1486, 2023 Nov.

Article in English | MEDLINE | ID: mdl-37337902

ABSTRACT

Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.

Subject(s)

Language , Protein Engineering , Mutagenesis, Site-Directed , Amino Acid Sequence , Research Design

Bayesian optimization with evolutionary and structure-based regularization for directed protein evolution.

Frisby, Trevor S; Langmead, Christopher James.

Algorithms Mol Biol ; 16(1): 13, 2021 Jul 01.

Article in English | MEDLINE | ID: mdl-34210336

ABSTRACT

BACKGROUND: Directed evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property, such as binding affinity to a specified target. Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. solubility, thermostability, etc). We address this issue by formulating DE as a regularized Bayesian optimization problem where the regularization term reflects evolutionary or structure-based constraints. RESULTS: We applied our approach to DE to three representative proteins, GB1, BRCA1, and SARS-CoV-2 Spike, and evaluated both evolutionary and structure-based regularization terms. The results of these experiments demonstrate that: (i) structure-based regularization usually leads to better designs (and never hurts), compared to the unregularized setting; (ii) evolutionary-based regularization tends to be least effective; and (iii) regularization leads to better designs because it effectively focuses the search in certain areas of sequence space, making better use of the experimental budget. Additionally, like previous work in Machine learning assisted DE, we find that our approach significantly reduces the experimental burden of DE, relative to model-free methods. CONCLUSION: Introducing regularization into a Bayesian ML-assisted DE framework alters the exploratory patterns of the underlying optimization routine, and can shift variant selections towards those with a range of targeted and desirable properties. In particular, we find that structure-based regularization often improves variant selection compared to unregularized approaches, and never hurts.

Asynchronous parallel Bayesian optimization for AI-driven cloud laboratories.

Frisby, Trevor S; Gong, Zhiyun; Langmead, Christopher James.

Bioinformatics ; 37(Suppl_1): i451-i459, 2021 07 12.

Article in English | MEDLINE | ID: mdl-34252975

ABSTRACT

MOTIVATION: The recent emergence of cloud laboratories-collections of automated wet-lab instruments that are accessed remotely, presents new opportunities to apply Artificial Intelligence and Machine Learning in scientific research. Among these is the challenge of automating the process of optimizing experimental protocols to maximize data quality. RESULTS: We introduce a new deterministic algorithm, called PaRallel OptimizaTiOn for ClOud Laboratories (PROTOCOL), that improves experimental protocols via asynchronous, parallel Bayesian optimization. The algorithm achieves exponential convergence with respect to simple regret. We demonstrate PROTOCOL in both simulated and real-world cloud labs. In the simulated lab, it outperforms alternative approaches to Bayesian optimization in terms of its ability to find optimal configurations, and the number of experiments required to find the optimum. In the real-world lab, the algorithm makes progress toward the optimal setting. DATA AVAILABILITY AND IMPLEMENTATION: PROTOCOL is available as both a stand-alone Python library, and as part of a R Shiny application at https://github.com/clangmead/PROTOCOL. Data are available at the same repository. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Artificial Intelligence , Software , Algorithms , Bayes Theorem , Laboratories

HARVESTMAN: a framework for hierarchical feature learning and selection from whole genome sequencing data.

Frisby, Trevor S; Baker, Shawn J; Marçais, Guillaume; Hoang, Quang Minh; Kingsford, Carl; Langmead, Christopher J.

BMC Bioinformatics ; 22(1): 174, 2021 Apr 01.

Article in English | MEDLINE | ID: mdl-33794760

ABSTRACT

BACKGROUND: Supervised learning from high-throughput sequencing data presents many challenges. For one, the curse of dimensionality often leads to overfitting as well as issues with scalability. This can bring about inaccurate models or those that require extensive compute time and resources. Additionally, variant calls may not be the optimal encoding for a given learning task, which also contributes to poor predictive capabilities. To address these issues, we present HARVESTMAN, a method that takes advantage of hierarchical relationships among the possible biological interpretations and representations of genomic variants to perform automatic feature learning, feature selection, and model building. RESULTS: We demonstrate that HARVESTMAN scales to thousands of genomes comprising more than 84 million variants by processing phase 3 data from the 1000 Genomes Project, one of the largest publicly available collection of whole genome sequences. Using breast cancer data from The Cancer Genome Atlas, we show that HARVESTMAN selects a rich combination of representations that are adapted to the learning task, and performs better than a binary representation of SNPs alone. We compare HARVESTMAN to existing feature selection methods and demonstrate that our method is more parsimonious-it selects smaller and less redundant feature subsets while maintaining accuracy of the resulting classifier. CONCLUSION: HARVESTMAN is a hierarchical feature selection approach for supervised model building from variant call data. By building a knowledge graph over genomic variants and solving an integer linear program , HARVESTMAN automatically and optimally finds the right encoding for genomic variants. Compared to other hierarchical feature selection methods, HARVESTMAN is faster and selects features more parsimoniously.

Subject(s)

Breast Neoplasms , Deep Learning , Whole Genome Sequencing , Breast Neoplasms/genetics , Genome , Genomics , Humans

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL