Search | VHL Regional Portal

Current genomic deep learning models display decreased performance in cell type specific accessible regions.

Kathail, Pooja; Shuai, Richard W; Chung, Ryan; Ye, Chun Jimmie; Loeb, Gabriel; Ioannidis, Nilah M.

bioRxiv ; 2024 Jul 10.

Article in English | MEDLINE | ID: mdl-39026761

ABSTRACT

Background: A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability. Results: We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. Conclusions: Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.

An all-atom protein generative model.

Chu, Alexander E; Kim, Jinho; Cheng, Lucy; El Nesr, Gina; Xu, Minkai; Shuai, Richard W; Huang, Po-Ssu.

Proc Natl Acad Sci U S A ; 121(27): e2311500121, 2024 Jul 02.

Article in English | MEDLINE | ID: mdl-38916999

ABSTRACT

Proteins mediate their functions through chemical interactions; modeling these interactions, which are typically through sidechains, is an important need in protein design. However, constructing an all-atom generative model requires an appropriate scheme for managing the jointly continuous and discrete nature of proteins encoded in the structure and sequence. We describe an all-atom diffusion model of protein structure, Protpardelle, which represents all sidechain states at once as a "superposition" state; superpositions defining a protein are collapsed into individual residue types and conformations during sample generation. When combined with sequence design methods, our model is able to codesign all-atom protein structure and sequence. Generated proteins are of good quality under the typical quality, diversity, and novelty metrics, and sidechains reproduce the chemical features and behavior of natural proteins. Finally, we explore the potential of our model to conduct all-atom protein design and scaffold functional motifs in a backbone- and rotamer-free way.

Subject(s)

Models, Molecular , Protein Conformation , Proteins , Proteins/chemistry , Amino Acid Sequence

IgLM: Infilling language modeling for antibody sequence design.

Shuai, Richard W; Ruffolo, Jeffrey A; Gray, Jeffrey J.

Cell Syst ; 14(11): 979-989.e4, 2023 11 15.

Article in English | MEDLINE | ID: mdl-37909045

ABSTRACT

Discovery and optimization of monoclonal antibodies for therapeutic applications relies on large sequence libraries but is hindered by developability issues such as low solubility, high aggregation, and high immunogenicity. Generative language models, trained on millions of protein sequences, are a powerful tool for the on-demand generation of realistic, diverse sequences. We present the Immunoglobulin Language Model (IgLM), a deep generative language model for creating synthetic antibody libraries. Compared with prior methods that leverage unidirectional context for sequence generation, IgLM formulates antibody design based on text-infilling in natural language, allowing it to re-design variable-length spans within antibody sequences using bidirectional context. We trained IgLM on 558 million (M) antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species of origin. We demonstrate that IgLM can generate full-length antibody sequences from a variety of species and its infilling formulation allows it to generate infilled complementarity-determining region (CDR) loop libraries with improved in silico developability profiles. A record of this paper's transparent peer review process is included in the supplemental information.

Subject(s)

Complementarity Determining Regions , Peptide Library , Amino Acid Sequence , Complementarity Determining Regions/genetics , Antibodies, Monoclonal

Personal transcriptome variation is poorly explained by current genomic deep learning models.

Huang, Connie; Shuai, Richard W; Baokar, Parth; Chung, Ryan; Rastogi, Ruchir; Kathail, Pooja; Ioannidis, Nilah M.

Nat Genet ; 55(12): 2056-2059, 2023 Dec.

Article in English | MEDLINE | ID: mdl-38036790

ABSTRACT

Genomic deep learning models can predict genome-wide epigenetic features and gene expression levels directly from DNA sequence. While current models perform well at predicting gene expression levels across genes in different cell types from the reference genome, their ability to explain expression variation between individuals due to cis-regulatory genetic variants remains largely unexplored. Here, we evaluate four state-of-the-art models on paired personal genome and transcriptome data and find limited performance when explaining variation in expression across individuals. In addition, models often fail to predict the correct direction of effect of cis-regulatory genetic variation on expression.

Subject(s)

Deep Learning , Transcriptome , Humans , Transcriptome/genetics , Genetic Variation/genetics , Genome , Genomics

Characterizing uncertainty in predictions of genomic sequence-to-activity models.

Bajwa, Ayesha; Rastogi, Ruchir; Kathail, Pooja; Shuai, Richard W; Ioannidis, Nilah M.

bioRxiv ; 2023 Dec 23.

Article in English | MEDLINE | ID: mdl-38187742

ABSTRACT

Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL