Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 102
Filter
1.
Proc Natl Acad Sci U S A ; 121(23): e2322376121, 2024 Jun 04.
Article in English | MEDLINE | ID: mdl-38809705

ABSTRACT

In this article, we develop CausalEGM, a deep learning framework for nonlinear dimension reduction and generative modeling of the dependency among covariate features affecting treatment and response. CausalEGM can be used for estimating causal effects in both binary and continuous treatment settings. By learning a bidirectional transformation between the high-dimensional covariate space and a low-dimensional latent space and then modeling the dependencies of different subsets of the latent variables on the treatment and response, CausalEGM can extract the latent covariate features that affect both treatment and response. By conditioning on these features, one can mitigate the confounding effect of the high dimensional covariate on the estimation of the causal relation between treatment and response. In a series of experiments, the proposed method is shown to achieve superior performance over existing methods in both binary and continuous treatment settings. The improvement is substantial when the sample size is large and the covariate is of high dimension. Finally, we established excess risk bounds and consistency results for our method, and discuss how our approach is related to and improves upon other dimension reduction approaches in causal inference.

2.
Res Sq ; 2024 May 10.
Article in English | MEDLINE | ID: mdl-38766095

ABSTRACT

Rare variants, comprising a vast majority of human genetic variations, are likely to have more deleterious impact on human diseases compared to common variants. Here we present carrier statistic, a statistical framework to prioritize disease-related rare variants by integrating gene expression data. By quantifying the impact of rare variants on gene expression, carrier statistic can prioritize those rare variants that have large functional consequence in the diseased patients. Through simulation studies and analyzing real multi-omics dataset, we demonstrated that carrier statistic is applicable in studies with limited sample size (a few hundreds) and achieves substantially higher sensitivity than existing rare variants association methods. Application to Alzheimer's disease reveals 16 rare variants within 15 genes with extreme carrier statistics. We also found strong excess of rare variants among the top prioritized genes in diseased patients compared to that in healthy individuals. The carrier statistic method can be applied to various rare variant types and is adaptable to other omics data modalities, offering a powerful tool for investigating the molecular mechanisms underlying complex diseases.

4.
bioRxiv ; 2024 Mar 21.
Article in English | MEDLINE | ID: mdl-38562756

ABSTRACT

Rare variants, comprising a vast majority of human genetic variations, are likely to have more deleterious impact on human diseases compared to common variants. Here we present carrier statistic, a statistical framework to prioritize disease-related rare variants by integrating gene expression data. By quantifying the impact of rare variants on gene expression, carrier statistic can prioritize those rare variants that have large functional consequence in the diseased patients. Through simulation studies and analyzing real multi-omics dataset, we demonstrated that carrier statistic is applicable in studies with limited sample size (a few hundreds) and achieves substantially higher sensitivity than existing rare variants association methods. Application to Alzheimer's disease reveals 16 rare variants within 15 genes with extreme carrier statistics. The carrier statistic method can be applied to various rare variant types and is adaptable to other omics data modalities, offering a powerful tool for investigating the molecular mechanisms underlying complex diseases.

5.
Genome Biol ; 25(1): 1, 2024 Jan 02.
Article in English | MEDLINE | ID: mdl-38167462

ABSTRACT

BACKGROUND: The vast majority of findings from human genome-wide association studies (GWAS) map to non-coding sequences, complicating their mechanistic interpretations and clinical translations. Non-coding sequences that are evolutionarily conserved and biochemically active could offer clues to the mechanisms underpinning GWAS discoveries. However, genetic effects of such sequences have not been systematically examined across a wide range of human tissues and traits, hampering progress to fully understand regulatory causes of human complex traits. RESULTS: Here we develop a simple yet effective strategy to identify functional elements exhibiting high levels of human-mouse sequence conservation and enhancer-like biochemical activity, which scales well to 313 epigenomic datasets across 106 human tissues and cell types. Combined with 468 GWAS of European (EUR) and East Asian (EAS) ancestries, these elements show tissue-specific enrichments of heritability and causal variants for many traits, which are significantly stronger than enrichments based on enhancers without sequence conservation. These elements also help prioritize candidate genes that are functionally relevant to body mass index (BMI) and schizophrenia but were not reported in previous GWAS with large sample sizes. CONCLUSIONS: Our findings provide a comprehensive assessment of how sequence-conserved enhancer-like elements affect complex traits in diverse tissues and demonstrate a generalizable strategy of integrating evolutionary and biochemical data to elucidate human disease genetics.


Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance , Humans , Mice , Animals , Epigenomics , Phenotype , Enhancer Elements, Genetic , Polymorphism, Single Nucleotide
6.
bioRxiv ; 2024 Feb 03.
Article in English | MEDLINE | ID: mdl-37502861

ABSTRACT

The inherent similarities between natural language and biological sequences have given rise to great interest in adapting the transformer-based large language models (LLMs) underlying recent breakthroughs in natural language processing (references), for applications in genomics. However, current LLMs for genomics suffer from several limitations such as the inability to include chromatin interactions in the training data, and the inability to make prediction in new cellular contexts not represented in the training data. To mitigate these problems, we propose EpiGePT, a transformer-based pretrained language model for predicting context-specific epigenomic signals and chromatin contacts. By taking the context-specific activities of transcription factors (TFs) and 3D genome interactions into consideration, EpiGePT offers wider applicability and deeper biological insights than models trained on DNA sequence only. In a series of experiments, EpiGePT demonstrates superior performance in a diverse set of epigenomic signals prediction tasks when compared to existing methods. In particular, our model enables cross-cell-type prediction of long-range interactions and offers insight on the functional impact of genetic variants under different cellular contexts. These new capabilities will enhance the usefulness of LLM in the study of gene regulatory mechanisms. We provide free online prediction service of EpiGePT through http://health.tsinghua.edu.cn/epigept/.

7.
Hum Mol Genet ; 32(21): 3105-3120, 2023 10 17.
Article in English | MEDLINE | ID: mdl-37584462

ABSTRACT

DNA methyltransferase type 1 (DNMT1) is a major enzyme involved in maintaining the methylation pattern after DNA replication. Mutations in DNMT1 have been associated with autosomal dominant cerebellar ataxia, deafness and narcolepsy (ADCA-DN). We used fibroblasts, induced pluripotent stem cells (iPSCs) and induced neurons (iNs) generated from patients with ADCA-DN and controls, to explore the epigenomic and transcriptomic effects of mutations in DNMT1. We show cell type-specific changes in gene expression and DNA methylation patterns. DNA methylation and gene expression changes were negatively correlated in iPSCs and iNs. In addition, we identified a group of genes associated with clinical phenotypes of ADCA-DN, including PDGFB and PRDM8 for cerebellar ataxia, psychosis and dementia and NR2F1 for deafness and optic atrophy. Furthermore, ZFP57, which is required to maintain gene imprinting through DNA methylation during early development, was hypomethylated in promoters and exhibited upregulated expression in patients with ADCA-DN in both iPSC and iNs. Our results provide insight into the functions of DNMT1 and the molecular changes associated with ADCA-DN, with potential implications for genes associated with related phenotypes.


Subject(s)
Cerebellar Ataxia , Deafness , Humans , Cerebellar Ataxia/genetics , DNA (Cytosine-5-)-Methyltransferases/genetics , Transcriptome/genetics , Epigenomics , DNA (Cytosine-5-)-Methyltransferase 1/genetics , DNA Methylation/genetics , Deafness/genetics , Mutation , DNA
8.
Proc Natl Acad Sci U S A ; 120(28): e2305236120, 2023 07 11.
Article in English | MEDLINE | ID: mdl-37399400

ABSTRACT

Plasma cell-free DNA (cfDNA) is a noninvasive biomarker for cell death of all organs. Deciphering the tissue origin of cfDNA can reveal abnormal cell death because of diseases, which has great clinical potential in disease detection and monitoring. Despite the great promise, the sensitive and accurate quantification of tissue-derived cfDNA remains challenging to existing methods due to the limited characterization of tissue methylation and the reliance on unsupervised methods. To fully exploit the clinical potential of tissue-derived cfDNA, here we present one of the largest comprehensive and high-resolution methylation atlas based on 521 noncancer tissue samples spanning 29 major types of human tissues. We systematically identified fragment-level tissue-specific methylation patterns and extensively validated them in orthogonal datasets. Based on the rich tissue methylation atlas, we develop the first supervised tissue deconvolution approach, a deep-learning-powered model, cfSort, for sensitive and accurate tissue deconvolution in cfDNA. On the benchmarking data, cfSort showed superior sensitivity and accuracy compared to the existing methods. We further demonstrated the clinical utilities of cfSort with two potential applications: aiding disease diagnosis and monitoring treatment side effects. The tissue-derived cfDNA fraction estimated from cfSort reflected the clinical outcomes of the patients. In summary, the tissue methylation atlas and cfSort enhanced the performance of tissue deconvolution in cfDNA, thus facilitating cfDNA-based disease detection and longitudinal treatment monitoring.


Subject(s)
Cell-Free Nucleic Acids , Deep Learning , Humans , Cell-Free Nucleic Acids/genetics , DNA Methylation , Biomarkers , Promoter Regions, Genetic , Biomarkers, Tumor/genetics
9.
IEEE/ACM Trans Comput Biol Bioinform ; 20(2): 1384-1394, 2023.
Article in English | MEDLINE | ID: mdl-35503836

ABSTRACT

Deciphering the free energy landscape of biomolecular structure space is crucial for understanding many complex molecular processes, such as protein-protein interaction, RNA folding, and protein folding. A major source of current dynamic structure data is Molecular Dynamics (MD) simulations. Several methods have been proposed to investigate the free energy landscape from MD data, but all of them rely on the assumption that kinetic similarity is associated with global geometric similarity, which may lead to unsatisfactory results. In this paper, we proposed a new method called Conditional Angle Partition Tree to reveal the hierarchical free energy landscape by correlating local geometric similarity with kinetic similarity. Its application on the benchmark alanine dipeptide MD data showed a much better performance than existing methods in exploring and understanding the free energy landscape. We also applied it to the MD data of Villin HP35. Our results are more reasonable on various aspects than those from other methods and very informative on the hierarchical structure of its energy landscape.


Subject(s)
Benchmarking , Trees , Dipeptides , Kinetics , Molecular Dynamics Simulation , Protein Folding , Thermodynamics
10.
Nucleic Acids Res ; 51(D1): D159-D166, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36215037

ABSTRACT

Elucidating the role of 3D architecture of DNA in gene regulation is crucial for understanding cell differentiation, tissue homeostasis and disease development. Among various chromatin conformation capture methods, HiChIP has received increasing attention for its significant improvement over other methods in profiling of regulatory (e.g. H3K27ac) and structural (e.g. cohesin) interactions. To facilitate the studies of 3D regulatory interactions, we developed a HiChIP interactions database, HiChIPdb (http://health.tsinghua.edu.cn/hichipdb/). The current version of HiChIPdb contains ∼262M annotated HiChIP interactions from 200 high-throughput HiChIP samples across 108 cell types. The functionalities of HiChIPdb include: (i) standardized categorization of HiChIP interactions in a hierarchical structure based on organ, tissue and cell line and (ii) comprehensive annotations of HiChIP interactions with regulatory genes and GWAS Catalog SNPs. To the best of our knowledge, HiChIPdb is the first comprehensive database that utilizes a unified pipeline to map the functional interactions across diverse cell types and tissues in different resolutions. We believe this database has the potential to advance cutting-edge research in regulatory mechanisms in development and disease by removing the barrier in data aggregation, preprocessing, and analysis.


Subject(s)
Chromatin , DNA , Cell Line , Chromatin/genetics , Gene Expression Regulation , Sequence Analysis, DNA/methods , Databases, Genetic
11.
Elife ; 112022 12 16.
Article in English | MEDLINE | ID: mdl-36525361

ABSTRACT

Systems genetics holds the promise to decipher complex traits by interpreting their associated SNPs through gene regulatory networks derived from comprehensive multi-omics data of cell types, tissues, and organs. Here, we propose SpecVar to integrate paired chromatin accessibility and gene expression data into context-specific regulatory network atlas and regulatory categories, conduct heritability enrichment analysis with genome-wide association studies (GWAS) summary statistics, identify relevant tissues, and estimate relevance correlation to depict common genetic factors acting in the shared regulatory networks between traits. Our method improves power upon existing approaches by associating SNPs with context-specific regulatory elements to assess heritability enrichments and by explicitly prioritizing gene regulations underlying relevant tissues. Ablation studies, independent data validation, and comparison experiments with existing methods on GWAS of six phenotypes show that SpecVar can improve heritability enrichment, accurately detect relevant tissues, and reveal causal regulations. Furthermore, SpecVar correlates the relevance patterns for pairs of phenotypes and better reveals shared SNP-associated regulations of phenotypes than existing methods. Studying GWAS of 206 phenotypes in UK Biobank demonstrates that SpecVar leverages the context-specific regulatory network atlas to prioritize phenotypes' relevant tissues and shared heritability for biological and therapeutic insights. SpecVar provides a powerful way to interpret SNPs via context-specific regulatory networks and is available at https://github.com/AMSSwanglab/SpecVar, copy archived at swh:1:rev:cf27438d3f8245c34c357ec5f077528e6befe829.


Subject(s)
Gene Regulatory Networks , Genome-Wide Association Study , Phenotype , Gene Expression Regulation , Multifactorial Inheritance/genetics , Polymorphism, Single Nucleotide
13.
Nat Commun ; 13(1): 5566, 2022 09 29.
Article in English | MEDLINE | ID: mdl-36175411

ABSTRACT

Early cancer detection by cell-free DNA faces multiple challenges: low fraction of tumor cell-free DNA, molecular heterogeneity of cancer, and sample sizes that are not sufficient to reflect diverse patient populations. Here, we develop a cancer detection approach to address these challenges. It consists of an assay, cfMethyl-Seq, for cost-effective sequencing of the cell-free DNA methylome (with > 12-fold enrichment over whole genome bisulfite sequencing in CpG islands), and a computational method to extract methylation information and diagnose patients. Applying our approach to 408 colon, liver, lung, and stomach cancer patients and controls, at 97.9% specificity we achieve 80.7% and 74.5% sensitivity in detecting all-stage and early-stage cancer, and 89.1% and 85.0% accuracy for locating tissue-of-origin of all-stage and early-stage cancer, respectively. Our approach cost-effectively retains methylome profiles of cancer abnormalities, allowing us to learn new features and expand to other cancer types as training cohorts grow.


Subject(s)
Cell-Free Nucleic Acids , Stomach Neoplasms , Cell-Free Nucleic Acids/genetics , Cost-Benefit Analysis , Early Detection of Cancer , Epigenome , Humans , Stomach Neoplasms/diagnosis , Stomach Neoplasms/genetics
14.
iScience ; 25(8): 104790, 2022 Aug 19.
Article in English | MEDLINE | ID: mdl-35992073

ABSTRACT

Complex traits such as cardiovascular diseases (CVD) are the results of complicated processes jointly affected by genetic and environmental factors. Genome-wide association studies (GWAS) identified genetic variants associated with diseases but usually did not reveal the underlying mechanisms. There could be many intermediate steps at epigenetic, transcriptomic, and cellular scales inside the black box of genotype-phenotype associations. In this article, we present a machine-learning-based cross-scale framework GRPath to decipher putative causal paths (pcPaths) from genetic variants to disease phenotypes by integrating multiple omics data. Applying GRPath on CVD, we identified 646 and 549 pcPaths linking putative causal regions, variants, and gene expressions in specific cell types for two types of heart failure, respectively. The findings suggest new understandings of coronary heart disease. Our work promoted the modeling of tissue- and cell type-specific cross-scale regulation to uncover mechanisms behind disease-associated variants, and provided new findings on the molecular mechanisms of CVD.

15.
Science ; 377(6610): 1077-1085, 2022 09 02.
Article in English | MEDLINE | ID: mdl-35951677

ABSTRACT

Mammalian genomes have multiple enhancers spanning an ultralong distance (>megabases) to modulate important genes, but it is unclear how these enhancers coordinate to achieve this task. We combine multiplexed CRISPRi screening with machine learning to define quantitative enhancer-enhancer interactions. We find that the ultralong distance enhancer network has a nested multilayer architecture that confers functional robustness of gene expression. Experimental characterization reveals that enhancer epistasis is maintained by three-dimensional chromosomal interactions and BRD4 condensation. Machine learning prediction of synergistic enhancers provides an effective strategy to identify noncoding variant pairs associated with pathogenic genes in diseases beyond genome-wide association studies analysis. Our work unveils nested epistasis enhancer networks, which can better explain enhancer functions within cells and in diseases.


Subject(s)
Disease , Enhancer Elements, Genetic , Epistasis, Genetic , Machine Learning , Cell Cycle Proteins , Disease/genetics , Genome-Wide Association Study , Humans , K562 Cells , Nuclear Proteins/genetics , Transcription Factors/genetics
16.
Genome Biol ; 23(1): 114, 2022 05 16.
Article in English | MEDLINE | ID: mdl-35578363

ABSTRACT

Technological development has enabled the profiling of gene expression and chromatin accessibility from the same cell. We develop scREG, a dimension reduction methodology, based on the concept of cis-regulatory potential, for single cell multiome data. This concept is further used for the construction of subpopulation-specific cis-regulatory networks. The capability of inferring useful regulatory network is demonstrated by the two-fold increment on network inference accuracy compared to the Pearson correlation-based method and the 27-fold enrichment of GWAS variants for inflammatory bowel disease in the cis-regulatory elements. The R package scREG provides comprehensive functions for single cell multiome data analysis.


Subject(s)
Chromatin , Regulatory Sequences, Nucleic Acid , Chromatin/genetics , Gene Expression , Gene Regulatory Networks , Single-Cell Analysis
17.
Genomics Proteomics Bioinformatics ; 20(3): 496-507, 2022 06.
Article in English | MEDLINE | ID: mdl-35293310

ABSTRACT

Although computational approaches have been complementing high-throughput biological experiments for the identification of functional regions in the human genome, it remains a great challenge to systematically decipher interactions between transcription factors (TFs) and regulatory elements to achieve interpretable annotations of chromatin accessibility across diverse cellular contexts. To solve this problem, we propose DeepCAGE, a deep learning framework that integrates sequence information and binding statuses of TFs, for the accurate prediction of chromatin accessible regions at a genome-wide scale in a variety of cell types. DeepCAGE takes advantage of a densely connected deep convolutional neural network architecture to automatically learn sequence signatures of known chromatin accessible regions and then incorporates such features with expression levels and binding activities of human core TFs to predict novel chromatin accessible regions. In a series of systematic comparisons with existing methods, DeepCAGE exhibits superior performance in not only the classification but also the regression of chromatin accessibility signals. In a detailed analysis of TF activities, DeepCAGE successfully extracts novel binding motifs and measures the contribution of a TF to the regulation with respect to a specific locus in a certain cell type. When applied to whole-genome sequencing data analysis, our method successfully prioritizes putative deleterious variants underlying a human complex trait and thus provides insights into the understanding of disease-associated genetic variants. DeepCAGE can be downloaded from https://github.com/kimmo1019/DeepCAGE.


Subject(s)
Chromatin Assembly and Disassembly , Chromatin , Deep Learning , Transcription Factors , Humans , Binding Sites , Chromatin/genetics , Chromatin/metabolism , Genome, Human , Protein Binding , Regulatory Sequences, Nucleic Acid , Transcription Factors/genetics , Transcription Factors/metabolism
18.
Proc Natl Acad Sci U S A ; 119(1)2022 01 04.
Article in English | MEDLINE | ID: mdl-34930827

ABSTRACT

Abdominal aortic aneurysm (AAA) is a common degenerative cardiovascular disease whose pathobiology is not clearly understood. The cellular heterogeneity and cell-type-specific gene regulation of vascular cells in human AAA have not been well-characterized. Here, we performed analysis of whole-genome sequencing data in AAA patients versus controls with the aim of detecting disease-associated variants that may affect gene regulation in human aortic smooth muscle cells (AoSMC) and human aortic endothelial cells (HAEC), two cell types of high relevance to AAA disease. To support this analysis, we generated H3K27ac HiChIP data for these cell types and inferred cell-type-specific gene regulatory networks. We observed that AAA-associated variants were most enriched in regulatory regions in AoSMC, compared with HAEC and CD4+ cells. The cell-type-specific regulation defined by this HiChIP data supported the importance of ERG and the KLF family of transcription factors in AAA disease. The analysis of regulatory elements that contain noncoding variants and also are differentially open between AAA patients and controls revealed the significance of the interleukin-6-mediated signaling pathway. This finding was further validated by including information from the deleteriousness effect of nonsynonymous single-nucleotide variants in AAA patients and additional control data from the Medical Genome Reference Bank dataset. These results shed important insights into AAA pathogenesis and provide a model for cell-type-specific analysis of disease-associated variants.


Subject(s)
Aortic Aneurysm, Abdominal/genetics , Gene Regulatory Networks , Case-Control Studies , Cells, Cultured , Down-Regulation , Humans , Interleukin-6/metabolism , Kruppel-Like Transcription Factors/genetics , Transcriptional Regulator ERG/genetics
19.
Nat Commun ; 12(1): 4763, 2021 08 06.
Article in English | MEDLINE | ID: mdl-34362918

ABSTRACT

The comparison of gene regulatory networks between diseased versus healthy individuals or between two different treatments is an important scientific problem. Here, we propose sc-compReg as a method for the comparative analysis of gene expression regulatory networks between two conditions using single cell gene expression (scRNA-seq) and single cell chromatin accessibility data (scATAC-seq). Our software, sc-compReg, can be used as a stand-alone package that provides joint clustering and embedding of the cells from both scRNA-seq and scATAC-seq, and the construction of differential regulatory networks across two conditions. We apply the method to compare the gene regulatory networks of an individual with chronic lymphocytic leukemia (CLL) versus a healthy control. The analysis reveals a tumor-specific B cell subpopulation in the CLL patient and identifies TOX2 as a potential regulator of this subpopulation.


Subject(s)
Gene Regulatory Networks , Leukemia, Lymphocytic, Chronic, B-Cell/genetics , Single-Cell Analysis/methods , B-Lymphocytes , Chromatin , Gene Expression Regulation, Neoplastic , HMGB Proteins , Humans , RNA, Small Cytoplasmic , Software
20.
Nat Commun ; 12(1): 4172, 2021 07 07.
Article in English | MEDLINE | ID: mdl-34234141

ABSTRACT

Cell-free DNA (cfDNA) is attractive for many applications, including detecting cancer, identifying the tissue of origin, and monitoring. A fundamental task underlying these applications is SNV calling from cfDNA, which is hindered by the very low tumor content. Thus sensitive and accurate detection of low-frequency mutations (<5%) remains challenging for existing SNV callers. Here we present cfSNV, a method incorporating multi-layer error suppression and hierarchical mutation calling, to address this challenge. Furthermore, by leveraging cfDNA's comprehensive coverage of tumor clonal landscape, cfSNV can profile mutations in subclones. In both simulated and real patient data, cfSNV outperforms existing tools in sensitivity while maintaining high precision. cfSNV enhances the clinical utilities of cfDNA by improving mutation detection performance in medium-depth sequencing data, therefore making Whole-Exome Sequencing a viable option. As an example, we demonstrate that the tumor mutation profile from cfDNA WES data can provide an effective biomarker to predict immunotherapy outcomes.


Subject(s)
Circulating Tumor DNA/genetics , DNA Mutational Analysis/methods , Exome Sequencing/methods , Immune Checkpoint Inhibitors/pharmacology , Neoplasms/genetics , Adult , Antibodies, Monoclonal, Humanized/pharmacology , Antibodies, Monoclonal, Humanized/therapeutic use , Biomarkers, Tumor/blood , Biomarkers, Tumor/genetics , Biopsy , Circulating Tumor DNA/blood , Computer Simulation , Datasets as Topic , Drug Resistance, Neoplasm/genetics , Female , Humans , Immune Checkpoint Inhibitors/therapeutic use , Male , Middle Aged , Mutation , Neoplasms/blood , Neoplasms/drug therapy , Neoplasms/mortality , Polymorphism, Single Nucleotide , Prognosis , Programmed Cell Death 1 Receptor/antagonists & inhibitors , Progression-Free Survival , Sensitivity and Specificity
SELECTION OF CITATIONS
SEARCH DETAIL
...