Search | VHL Regional Portal

Contrastive pre-training for sequence based genomics models.

Sokolova, Ksenia; Chen, Kathleen M; Troyanskaya, Olga.

bioRxiv ; 2024 Jun 12.

Article in English | MEDLINE | ID: mdl-38915667

ABSTRACT

MOTIVATION: In recent years deep learning has become one of the central approaches in a number of applications, including many tasks in genomics. However, as models grow in depth and complexity, they either require more data or a strategic initialization technique to improve performance. RESULTS: In this project, we introduce cGen, a novel unsupervised, model-agnostic contrastive pre-training method for sequence-based models. cGen can be used before training to initialize weights, reducing the size of the dataset needed. It works through learning the intrinsic features of the reference genome and makes no assumptions on the underlying structure. We show that the embeddings produced by the unsupervised model are already informative for gene expression prediction and that the sequence features provide a meaningful clustering. We demonstrate that cGen improves model performance in various sequence-based deep learning applications, such as chromatin profiling prediction and gene expression. Our findings suggest that using cGen, particularly in areas constrained by data availability, could improve the performance of deep learning genomic models without the need to modify the model architecture.

Deep Learning Sequence Models for Transcriptional Regulation.

Sokolova, Ksenia; Chen, Kathleen M; Hao, Yun; Zhou, Jian; Troyanskaya, Olga G.

Annu Rev Genomics Hum Genet ; 2024 Apr 09.

Article in English | MEDLINE | ID: mdl-38594933

ABSTRACT

Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.

A sequence-based global map of regulatory activity for deciphering human genetics.

Chen, Kathleen M; Wong, Aaron K; Troyanskaya, Olga G; Zhou, Jian.

Nat Genet ; 54(7): 940-949, 2022 07.

Article in English | MEDLINE | ID: mdl-35817977

ABSTRACT

Epigenomic profiling has enabled large-scale identification of regulatory elements, yet we still lack a systematic mapping from any sequence or variant to regulatory activities. We address this challenge with Sei, a framework for integrating human genetics data with sequence information to discover the regulatory basis of traits and diseases. Sei learns a vocabulary of regulatory activities, called sequence classes, using a deep learning model that predicts 21,907 chromatin profiles across >1,300 cell lines and tissues. Sequence classes provide a global classification and quantification of sequence and variant effects based on diverse regulatory activities, such as cell type-specific enhancer functions. These predictions are supported by tissue-specific expression, expression quantitative trait loci and evolutionary constraint data. Furthermore, sequence classes enable characterization of the tissue-specific, regulatory architecture of complex traits and generate mechanistic hypotheses for individual regulatory pathogenic mutations. We provide Sei as a resource to elucidate the regulatory basis of human health and disease.

Subject(s)

Quantitative Trait Loci , Regulatory Sequences, Nucleic Acid , Chromatin/genetics , Epigenomics , Human Genetics , Humans , Quantitative Trait Loci/genetics , Regulatory Sequences, Nucleic Acid/genetics

Genome-wide landscape of RNA-binding protein target site dysregulation reveals a major impact on psychiatric disorder risk.

Park, Christopher Y; Zhou, Jian; Wong, Aaron K; Chen, Kathleen M; Theesfeld, Chandra L; Darnell, Robert B; Troyanskaya, Olga G.

Nat Genet ; 53(2): 166-173, 2021 02.

Article in English | MEDLINE | ID: mdl-33462483

ABSTRACT

Despite the strong genetic basis of psychiatric disorders, the underlying molecular mechanisms are largely unmapped. RNA-binding proteins (RBPs) are responsible for most post-transcriptional regulation, from splicing to translation to localization. RBPs thus act as key gatekeepers of cellular homeostasis, especially in the brain. However, quantifying the pathogenic contribution of noncoding variants impacting RBP target sites is challenging. Here, we leverage a deep learning approach that can accurately predict the RBP target site dysregulation effects of mutations and discover that RBP dysregulation is a principal contributor to psychiatric disorder risk. RBP dysregulation explains a substantial amount of heritability not captured by large-scale molecular quantitative trait loci studies and has a stronger impact than common coding region variants. We share the genome-wide profiles of RBP dysregulation, which we use to identify DDHD2 as a candidate schizophrenia risk gene. This resource provides a new analytical framework to connect the full range of RNA regulation to complex disease.

Subject(s)

Mental Disorders/genetics , Phospholipases/genetics , RNA-Binding Proteins/genetics , 3' Untranslated Regions , Deep Learning , Gene Expression Regulation , Gene Frequency , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Mutation , Nuclear Factor 90 Proteins/genetics , Peptide Elongation Factors/genetics , Polymorphism, Single Nucleotide , Quantitative Trait Loci , RNA Helicases/genetics , RNA Processing, Post-Transcriptional , Ribonucleoprotein, U5 Small Nuclear/genetics , Schizophrenia/genetics , Trans-Activators/genetics

Genomic analyses implicate noncoding de novo variants in congenital heart disease.

Richter, Felix; Morton, Sarah U; Kim, Seong Won; Kitaygorodsky, Alexander; Wasson, Lauren K; Chen, Kathleen M; Zhou, Jian; Qi, Hongjian; Patel, Nihir; DePalma, Steven R; Parfenov, Michael; Homsy, Jason; Gorham, Joshua M; Manheimer, Kathryn B; Velinder, Matthew; Farrell, Andrew; Marth, Gabor; Schadt, Eric E; Kaltman, Jonathan R; Newburger, Jane W; Giardini, Alessandro; Goldmuntz, Elizabeth; Brueckner, Martina; Kim, Richard; Porter, George A; Bernstein, Daniel; Chung, Wendy K; Srivastava, Deepak; Tristani-Firouzi, Martin; Troyanskaya, Olga G; Dickel, Diane E; Shen, Yufeng; Seidman, Jonathan G; Seidman, Christine E; Gelb, Bruce D.

Nat Genet ; 52(8): 769-777, 2020 08.

Article in English | MEDLINE | ID: mdl-32601476

ABSTRACT

A genetic etiology is identified for one-third of patients with congenital heart disease (CHD), with 8% of cases attributable to coding de novo variants (DNVs). To assess the contribution of noncoding DNVs to CHD, we compared genome sequences from 749 CHD probands and their parents with those from 1,611 unaffected trios. Neural network prediction of noncoding DNV transcriptional impact identified a burden of DNVs in individuals with CHD (n = 2,238 DNVs) compared to controls (n = 4,177; P = 8.7 × 10-4). Independent analyses of enhancers showed an excess of DNVs in associated genes (27 genes versus 3.7 expected, P = 1 × 10-5). We observed significant overlap between these transcription-based approaches (odds ratio (OR) = 2.5, 95% confidence interval (CI) 1.1-5.0, P = 5.4 × 10-3). CHD DNVs altered transcription levels in 5 of 31 enhancers assayed. Finally, we observed a DNV burden in RNA-binding-protein regulatory sites (OR = 1.13, 95% CI 1.1-1.2, P = 8.8 × 10-5). Our findings demonstrate an enrichment of potentially disruptive regulatory noncoding DNVs in a fraction of CHD at least as high as that observed for damaging coding DNVs.

Subject(s)

Genetic Variation/genetics , Heart Defects, Congenital/genetics , RNA, Untranslated/genetics , Adolescent , Adult , Animals , Female , Genetic Predisposition to Disease/genetics , Genomics , Heart/physiology , Humans , Male , Mice , Middle Aged , Open Reading Frames/genetics , RNA-Binding Proteins/genetics , Transcription, Genetic/genetics , Young Adult

Selene: a PyTorch-based deep learning library for sequence data.

Chen, Kathleen M; Cofer, Evan M; Zhou, Jian; Troyanskaya, Olga G.

Nat Methods ; 16(4): 315-318, 2019 04.

Article in English | MEDLINE | ID: mdl-30923381

ABSTRACT

To enable the application of deep learning in biology, we present Selene (https://selene.flatironinstitute.org/), a PyTorch-based deep learning library for fast and easy development, training, and application of deep learning model architectures for any biological sequence data. We demonstrate on DNA sequences how Selene allows researchers to easily train a published architecture on new data, develop and evaluate a new architecture, and use a trained model to answer biological questions of interest.

Subject(s)

Computational Biology/methods , Deep Learning , Neural Networks, Computer , Sequence Analysis, DNA , Algorithms , Alzheimer Disease/metabolism , Area Under Curve , Gene Library , Genomics , Humans , Models, Statistical , Mutagenesis , Mutation , Normal Distribution , Programming Languages , Software

PathCORE-T: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia.

Chen, Kathleen M; Tan, Jie; Way, Gregory P; Doing, Georgia; Hogan, Deborah A; Greene, Casey S.

BioData Min ; 11: 14, 2018.

Article in English | MEDLINE | ID: mdl-29988723

ABSTRACT

BACKGROUND: Investigators often interpret genome-wide data by analyzing the expression levels of genes within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect the activity of other pathways. Past efforts to identify relationships between biological processes have evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments. Individual experiments can highlight condition-specific pathway-pathway relationships; however, constructing a complete network of such relationships across many conditions requires analyzing results from many studies. RESULTS: We developed PathCORE-T framework by implementing existing methods to identify pathway-pathway transcriptional relationships evident across a broad data compendium. PathCORE-T is applied to the output of feature construction algorithms; it identifies pairs of pathways observed in features more than expected by chance as functionally co-occurring. We demonstrate PathCORE-T by analyzing an existing eADAGE model of a microbial compendium and building and analyzing NMF features from the TCGA dataset of 33 cancer types. The PathCORE-T framework includes a demonstration web interface, with source code, that users can launch to (1) visualize the network and (2) review the expression levels of associated genes in the original data. PathCORE-T creates and displays the network of globally co-occurring pathways based on features observed in a machine learning analysis of gene expression data. CONCLUSIONS: The PathCORE-T framework identifies transcriptionally co-occurring pathways from the results of unsupervised analysis of gene expression data and visualizes the relationships between pathways as a network. PathCORE-T recapitulated previously described pathway-pathway relationships and suggested experimentally testable additional hypotheses that remain to be explored.

Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk.

Zhou, Jian; Theesfeld, Chandra L; Yao, Kevin; Chen, Kathleen M; Wong, Aaron K; Troyanskaya, Olga G.

Nat Genet ; 50(8): 1171-1179, 2018 08.

Article in English | MEDLINE | ID: mdl-30013180

ABSTRACT

Key challenges for human genetics, precision medicine and evolutionary biology include deciphering the regulatory code of gene expression and understanding the transcriptional effects of genome variation. However, this is extremely difficult because of the enormous scale of the noncoding mutation space. We developed a deep learning-based framework, ExPecto, that can accurately predict, ab initio from a DNA sequence, the tissue-specific transcriptional effects of mutations, including those that are rare or that have not been observed. We prioritized causal variants within disease- or trait-associated loci from all publicly available genome-wide association studies and experimentally validated predictions for four immune-related diseases. By exploiting the scalability of ExPecto, we characterized the regulatory mutation space for human RNA polymerase II-transcribed genes by in silico saturation mutagenesis and profiled > 140 million promoter-proximal mutations. This enables probing of evolutionary constraints on gene expression and ab initio prediction of mutation disease effects, making ExPecto an end-to-end computational framework for the in silico prediction of expression and disease risk.

Subject(s)

Deep Learning , Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Mutation , Algorithms , Computer Simulation , Gene Expression , Humans , Models, Genetic , Polymorphism, Single Nucleotide , Promoter Regions, Genetic , Quantitative Trait Loci/genetics

Unsupervised Extraction of Stable Expression Signatures from Public Compendia with an Ensemble of Neural Networks.

Tan, Jie; Doing, Georgia; Lewis, Kimberley A; Price, Courtney E; Chen, Kathleen M; Cady, Kyle C; Perchuk, Barret; Laub, Michael T; Hogan, Deborah A; Greene, Casey S.

Cell Syst ; 5(1): 63-71.e6, 2017 07 26.

Article in English | MEDLINE | ID: mdl-28711280

ABSTRACT

Cross-experiment comparisons in public data compendia are challenged by unmatched conditions and technical noise. The ADAGE method, which performs unsupervised integration with denoising autoencoder neural networks, can identify biological patterns, but because ADAGE models, like many neural networks, are over-parameterized, different ADAGE models perform equally well. To enhance model robustness and better build signatures consistent with biological pathways, we developed an ensemble ADAGE (eADAGE) that integrated stable signatures across models. We applied eADAGE to a compendium of Pseudomonas aeruginosa gene expression profiling experiments performed in 78 media. eADAGE revealed a phosphate starvation response controlled by PhoB in media with moderate phosphate and predicted that a second stimulus provided by the sensor kinase, KinB, is required for this PhoB activation. We validated this relationship using both targeted and unbiased genetic approaches. eADAGE, which captures stable biological patterns, enables cross-experiment comparisons that can highlight measured but undiscovered relationships.

Subject(s)

Bacterial Proteins/metabolism , Neural Networks, Computer , Pseudomonas aeruginosa/physiology , Gene Expression Profiling , Gene Expression Regulation , Health Knowledge, Attitudes, Practice , Humans , Information Storage and Retrieval/trends , Public Sector , Starvation , Systems Integration , Transcriptome

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL