Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 40
Filter
1.
bioRxiv ; 2024 Mar 29.
Article in English | MEDLINE | ID: mdl-37066352

ABSTRACT

Knowledge of locations and activities of cis -regulatory elements (CREs) is needed to decipher basic mechanisms of gene regulation and to understand the impact of genetic variants on complex traits. Previous studies identified candidate CREs (cCREs) using epigenetic features in one species, making comparisons difficult between species. In contrast, we conducted an interspecies study defining epigenetic states and identifying cCREs in blood cell types to generate regulatory maps that are comparable between species, using integrative modeling of eight epigenetic features jointly in human and mouse in our V al i dated S ystematic I ntegrati on (VISION) Project. The resulting catalogs of cCREs are useful resources for further studies of gene regulation in blood cells, indicated by high overlap with known functional elements and strong enrichment for human genetic variants associated with blood cell phenotypes. The contribution of each epigenetic state in cCREs to gene regulation, inferred from a multivariate regression, was used to estimate epigenetic state Regulatory Potential (esRP) scores for each cCRE in each cell type, which were used to categorize dynamic changes in cCREs. Groups of cCREs displaying similar patterns of regulatory activity in human and mouse cell types, obtained by joint clustering on esRP scores, harbored distinctive transcription factor binding motifs that were similar between species. An interspecies comparison of cCREs revealed both conserved and species-specific patterns of epigenetic evolution. Finally, we showed that comparisons of the epigenetic landscape between species can reveal elements with similar roles in regulation, even in the absence of genomic sequence alignment.

2.
bioRxiv ; 2023 Jun 09.
Article in English | MEDLINE | ID: mdl-37333291

ABSTRACT

Spatial transcriptomics (ST) profiles gene expression in intact tissues. However, ST data measured at each spatial location may represent gene expression of multiple cell types, making it difficult to identify cell-type-specific transcriptional variation across spatial contexts. Existing cell-type deconvolutions of ST data often require single-cell transcriptomic references, which can be limited by availability, completeness and platform effect of such references. We present RETROFIT, a reference-free Bayesian method that produces sparse and interpretable solutions to deconvolve cell types underlying each location independent of single-cell transcriptomic references. Results from synthetic and real ST datasets acquired by Slide-seq and Visium platforms demonstrate that RETROFIT outperforms existing reference-based and reference-free methods in estimating cell-type composition and reconstructing gene expression. Applying RETROFIT to human intestinal development ST data reveals spatiotemporal patterns of cellular composition and transcriptional specificity. RETROFIT is available at https://bioconductor.org/packages/release/bioc/html/retrofit.html.

3.
PLoS Comput Biol ; 19(1): e1010758, 2023 01.
Article in English | MEDLINE | ID: mdl-36607897

ABSTRACT

Inferring gene co-expression networks is a useful process for understanding gene regulation and pathway activity. The networks are usually undirected graphs where genes are represented as nodes and an edge represents a significant co-expression relationship. When expression data of multiple (p) genes in multiple (K) conditions (e.g., treatments, tissues, strains) are available, joint estimation of networks harnessing shared information across them can significantly increase the power of analysis. In addition, examining condition-specific patterns of co-expression can provide insights into the underlying cellular processes activated in a particular condition. Condition adaptive fused graphical lasso (CFGL) is an existing method that incorporates condition specificity in a fused graphical lasso (FGL) model for estimating multiple co-expression networks. However, with computational complexity of O(p2K log K), the current implementation of CFGL is prohibitively slow even for a moderate number of genes and can only be used for a maximum of three conditions. In this paper, we propose a faster alternative of CFGL named rapid condition adaptive fused graphical lasso (RCFGL). In RCFGL, we incorporate the condition specificity into another popular model for joint network estimation, known as fused multiple graphical lasso (FMGL). We use a more efficient algorithm in the iterative steps compared to CFGL, enabling faster computation with complexity of O(p2K) and making it easily generalizable for more than three conditions. We also present a novel screening rule to determine if the full network estimation problem can be broken down into estimation of smaller disjoint sub-networks, thereby reducing the complexity further. We demonstrate the computational advantage and superior performance of our method compared to two non-condition adaptive methods, FGL and FMGL, and one condition adaptive method, CFGL in both simulation study and real data analysis. We used RCFGL to jointly estimate the gene co-expression networks in different brain regions (conditions) using a cohort of heterogeneous stock rats. We also provide an accommodating C and Python based package that implements RCFGL.


Subject(s)
Algorithms , Brain , Animals , Rats , Computer Simulation , Gene Regulatory Networks/genetics
4.
Biometrics ; 79(3): 2272-2285, 2023 09.
Article in English | MEDLINE | ID: mdl-36056911

ABSTRACT

High-throughput biological experiments are essential tools for identifying biologically interesting candidates in large-scale omics studies. The results of a high-throughput biological experiment rely heavily on the operational factors chosen in its experimental and data-analytic procedures. Understanding how these operational factors influence the reproducibility of the experimental outcome is critical for selecting the optimal parameter settings and designing reliable high-throughput workflows. However, the influence of an operational factor may differ between strong and weak candidates in a high-throughput experiment, complicating the selection of parameter settings. To address this issue, we propose a novel segmented regression model, called segmented correspondence curve regression, to assess the influence of operational factors on the reproducibility of high-throughput experiments. Our model dissects the heterogeneous effects of operational factors on strong and weak candidates, providing a principled way to select operational parameters. Based on this framework, we also develop a sup-likelihood ratio test for the existence of heterogeneity. Simulation studies show that our estimation and testing procedures yield well-calibrated type I errors and are substantially more powerful in detecting and locating the differences in reproducibility across workflows than the existing method. Using this model, we investigated an important design question for ChIP-seq experiments: How many reads should one sequence to obtain reliable results in a cost-effective way? Our results reveal new insights into the impact of sequencing depth on the binding-site identification reproducibility, helping biologists determine the most cost-effective sequencing depth to achieve sufficient reproducibility for their study goals.


Subject(s)
High-Throughput Nucleotide Sequencing , Reproducibility of Results , Computer Simulation , High-Throughput Nucleotide Sequencing/methods
5.
Nat Commun ; 13(1): 6874, 2022 11 12.
Article in English | MEDLINE | ID: mdl-36371401

ABSTRACT

Joint analyses of genomic datasets obtained in multiple different conditions are essential for understanding the biological mechanism that drives tissue-specificity and cell differentiation, but they still remain computationally challenging. To address this we introduce CLIMB (Composite LIkelihood eMpirical Bayes), a statistical methodology that learns patterns of condition-specificity present in genomic data. CLIMB provides a generic framework facilitating a host of analyses, such as clustering genomic features sharing similar condition-specific patterns and identifying which of these features are involved in cell fate commitment. We apply CLIMB to three sets of hematopoietic data, which examine CTCF ChIP-seq measured in 17 different cell populations, RNA-seq measured across constituent cell populations in three committed lineages, and DNase-seq in 38 cell populations. Our results show that CLIMB improves upon existing alternatives in statistical precision, while capturing interpretable and biologically relevant clusters in the data.


Subject(s)
Genome , Genomics , Bayes Theorem , Cluster Analysis , Sequence Analysis, DNA/methods
6.
Tohoku J Exp Med ; 258(3): 225-236, 2022 Oct 26.
Article in English | MEDLINE | ID: mdl-36047132

ABSTRACT

The therapeutic effects and mechanisms of action of total glucosides of paeony (TGP) in treating ulcerative colitis remain to be clarified. Mouse model of ulcerative colitis was treated with TGP and the indexes including scores of disease activity index, gross morphologic damage and histological damage, and inflammatory and oxidative stress markers were determined. Patients with ulcerative colitis received TGP capsule therapy and the indexes including efficacy of colonoscopy and histology, scores of Ulcerative Colitis Activity Index (UCAI) and Short Inflammatory Bowel Disease Questionnaire (SIBDQ), and inflammatory parameters were assessed. The expressions of toll-like receptor 4 (TLR4) and nuclear factor-kappa B (NF-κB) were measured in colonic tissues of mice and patients. TGP treatment significantly increased weight, decreased scores of disease activity index, gross morphologic damage and histological damage, and reduced the levels of tumor necrosis factor-α, interleukin-1ß, malondialdehyde and myeloperoxidase in mouse model. Patients treated with TGP capsule had significantly higher relief rates of diarrhea, abdominal pain, and bloody purulent stool, decreased UCAI and increased SIBDQ scores, and lower levels of erythrocyte sedimentation rate, C-reactive protein and CD4+/CD8+ T-cell ratio than those patients with routine therapy. The overall response rate of TGP capsule was significantly higher than that of routine therapy. TGP treatment significantly suppressed the expressions of TLR4 and NF-κB in colonic tissues of both mouse model and patients with UC. TGP shows a good therapeutic effect on ulcerative colitis in animals and human patients, and the underlying mechanisms may be related to the inhibition of TLR4/NF-κB signaling by TGP.


Subject(s)
Colitis, Ulcerative , Glucosides , Paeonia , Animals , Humans , C-Reactive Protein , Colitis, Ulcerative/drug therapy , Glucosides/pharmacology , Glucosides/therapeutic use , Interleukin-1beta , Malondialdehyde , NF-kappa B/metabolism , Paeonia/chemistry , Peroxidase/metabolism , Signal Transduction , Toll-Like Receptor 4/metabolism , Tumor Necrosis Factor-alpha/metabolism , Mice
7.
Nutrients ; 14(8)2022 Apr 08.
Article in English | MEDLINE | ID: mdl-35458125

ABSTRACT

Vitamin A (VA) deficiency and diarrheal diseases are both serious public health issues worldwide. VA deficiency is associated with impaired intestinal barrier function and increased risk of mucosal infection-related mortality. The bioactive form of VA, retinoic acid, is a well-known regulator of mucosal integrity. Using Citrobacter rodentium-infected mice as a model for diarrheal diseases in humans, previous studies showed that VA-deficient (VAD) mice failed to clear C. rodentium as compared to their VA-sufficient (VAS) counterparts. However, the distinct intestinal gene responses that are dependent on the host's VA status still need to be discovered. The mRNAs extracted from the small intestine (SI) and the colon were sequenced and analyzed on three levels: differential gene expression, enrichment, and co-expression. C. rodentium infection interacted differentially with VA status to alter colon gene expression. Novel functional categories downregulated by this pathogen were identified, highlighted by genes related to the metabolism of VA, vitamin D, and ion transport, including improper upregulation of Cl- secretion and disrupted HCO3- metabolism. Our results suggest that derangement of micronutrient metabolism and ion transport, together with the compromised immune responses in VAD hosts, may be responsible for the higher mortality to C. rodentium under conditions of inadequate VA.


Subject(s)
Enterobacteriaceae Infections , Vitamin A Deficiency , Animals , Citrobacter rodentium , Colon/metabolism , Diarrhea/complications , Intestinal Mucosa/metabolism , Intestine, Small/metabolism , Mice , Mice, Inbred C57BL , Vitamin A/metabolism , Vitamin A Deficiency/complications
8.
Stat Med ; 41(10): 1884-1899, 2022 05 10.
Article in English | MEDLINE | ID: mdl-35178743

ABSTRACT

High-throughput experiments are an essential part of modern biological and biomedical research. The outcomes of high-throughput biological experiments often have a lot of missing observations due to signals below detection levels. For example, most single-cell RNA-seq (scRNA-seq) protocols experience high levels of dropout due to the small amount of starting material, leading to a majority of reported expression levels being zero. Though missing data contain information about reproducibility, they are often excluded in the reproducibility assessment, potentially generating misleading assessments. In this article, we develop a regression model to assess how the reproducibility of high-throughput experiments is affected by the choices of operational factors (eg, platform or sequencing depth) when a large number of measurements are missing. Using a latent variable approach, we extend correspondence curve regression, a recently proposed method for assessing the effects of operational factors to reproducibility, to incorporate missing values. Using simulations, we show that our method is more accurate in detecting differences in reproducibility than existing measures of reproducibility. We illustrate the usefulness of our method using a single-cell RNA-seq dataset collected on HCT116 cells. We compare the reproducibility of different library preparation platforms and study the effect of sequencing depth on reproducibility, thereby determining the cost-effective sequencing depth that is required to achieve sufficient reproducibility.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Humans , Reproducibility of Results , Sequence Analysis, RNA , Single-Cell Analysis/methods
9.
Methods Mol Biol ; 2301: 17-37, 2022.
Article in English | MEDLINE | ID: mdl-34415529

ABSTRACT

Hi-C experiments are costly to perform and involve multiple complex experimental steps. Reproducibility of Hi-C data is essential for ensuring the validity of the scientific conclusions drawn from the data. In this chapter, we describe several recently developed computational methods for assessing reproducibility of Hi-C replicate experiments. These methods can also be used to assess the similarity between any two Hi-C samples.


Subject(s)
Reproducibility of Results , Software
10.
J Nutr Biochem ; 98: 108814, 2021 12.
Article in English | MEDLINE | ID: mdl-34242724

ABSTRACT

Vitamin A (VA) deficiency remains prevalent in resource limited areas. Using Citrobacter rodentium infection in mice as a model for diarrheal diseases, previous reports showed reduced pathogen clearance and survival due to vitamin A deficient (VAD) status. To characterize the impact of preexisting VA deficiency on gene expression patterns in the intestines, and to discover novel target genes in VA-related biological pathways, VA deficiency in mice were induced by diet. Total mRNAs were extracted from small intestine (SI) and colon, and sequenced. Differentially Expressed Gene (DEG), Gene Ontology (GO) enrichment, and co-expression network analyses were performed. DEGs compared between VAS and VAD groups detected 49 SI and 94 colon genes. By GO information, SI DEGs were significantly enriched in categories relevant to retinoid metabolic process, molecule binding, and immune function. Three co-expression modules showed significant correlation with VA status in SI; these modules contained four known retinoic acid targets. In addition, other SI genes of interest (e.g., Mbl2, Cxcl14, and Nr0b2) in these modules were suggested as new candidate genes regulated by VA. Furthermore, our analysis showed that markers of two cell types in SI, mast cells and Tuft cells, were significantly altered by VA status. In colon, "cell division" was the only enriched category and was negatively associated with VA. Thus, these data suggested that SI and colon have distinct networks under the regulation of dietary VA, and that preexisting VA deficiency could have a significant impact on the host response to a variety of disease conditions.


Subject(s)
Colon/metabolism , Intestine, Small/metabolism , RNA-Seq/methods , Vitamin A Deficiency/genetics , Animals , Citrobacter rodentium , Enterobacteriaceae Infections/genetics , Enterobacteriaceae Infections/microbiology , Gene Expression Profiling/methods , Gene Ontology , Mice , Mice, Inbred C57BL , RNA, Messenger/genetics , Transcriptome , Tretinoin/metabolism , Vitamin A/genetics , Vitamin A/metabolism
11.
Nat Commun ; 12(1): 1964, 2021 03 30.
Article in English | MEDLINE | ID: mdl-33785739

ABSTRACT

Genome-wide association meta-analysis (GWAMA) is an effective approach to enlarge sample sizes and empower the discovery of novel associations between genotype and phenotype. Independent replication has been used as a gold-standard for validating genetic associations. However, as current GWAMA often seeks to aggregate all available datasets, it becomes impossible to find a large enough independent dataset to replicate new discoveries. Here we introduce a method, MAMBA (Meta-Analysis Model-based Assessment of replicability), for assessing the "posterior-probability-of-replicability" for identified associations by leveraging the strength and consistency of association signals between contributing studies. We demonstrate using simulations that MAMBA is more powerful and robust than existing methods, and produces more accurate genetic effects estimates. We apply MAMBA to a large-scale meta-analysis of addiction phenotypes with 1.2 million individuals. In addition to accurately identifying replicable common variant associations, MAMBA also pinpoints novel replicable rare variant associations from imputation-based GWAMA and hence greatly expands the set of analyzable variants.


Subject(s)
Algorithms , Computational Biology/methods , Genome-Wide Association Study/methods , Meta-Analysis as Topic , Models, Genetic , Polymorphism, Single Nucleotide , Genetic Association Studies/methods , Genotype , Phenotype , Reproducibility of Results , Sample Size , Software
12.
Genome Res ; 30(3): 472-484, 2020 03.
Article in English | MEDLINE | ID: mdl-32132109

ABSTRACT

Thousands of epigenomic data sets have been generated in the past decade, but it is difficult for researchers to effectively use all the data relevant to their projects. Systematic integrative analysis can help meet this need, and the VISION project was established for validated systematic integration of epigenomic data in hematopoiesis. Here, we systematically integrated extensive data recording epigenetic features and transcriptomes from many sources, including individual laboratories and consortia, to produce a comprehensive view of the regulatory landscape of differentiating hematopoietic cell types in mouse. By using IDEAS as our integrative and discriminative epigenome annotation system, we identified and assigned epigenetic states simultaneously along chromosomes and across cell types, precisely and comprehensively. Combining nuclease accessibility and epigenetic states produced a set of more than 200,000 candidate cis-regulatory elements (cCREs) that efficiently capture enhancers and promoters. The transitions in epigenetic states of these cCREs across cell types provided insights into mechanisms of regulation, including decreases in numbers of active cCREs during differentiation of most lineages, transitions from poised to active or inactive states, and shifts in nuclease accessibility of CTCF-bound elements. Regression modeling of epigenetic states at cCREs and gene expression produced a versatile resource to improve selection of cCREs potentially regulating target genes. These resources are available from our VISION website to aid research in genomics and hematopoiesis.


Subject(s)
Epigenesis, Genetic , Hematopoiesis/genetics , Hematopoietic Stem Cells/metabolism , Animals , Mice , Regulatory Elements, Transcriptional , Transcriptome
13.
Nucleic Acids Res ; 48(8): e43, 2020 05 07.
Article in English | MEDLINE | ID: mdl-32086521

ABSTRACT

Quantitative comparison of epigenomic data across multiple cell types or experimental conditions is a promising way to understand the biological functions of epigenetic modifications. However, differences in sequencing depth and signal-to-noise ratios in the data from different experiments can hinder our ability to identify real biological variation from raw epigenomic data. Proper normalization is required prior to data analysis to gain meaningful insights. Most existing methods for data normalization standardize signals by rescaling either background regions or peak regions, assuming that the same scale factor is applicable to both background and peak regions. While such methods adjust for differences in sequencing depths, they do not address differences in the signal-to-noise ratios across different experiments. We developed a new data normalization method, called S3norm, that normalizes the sequencing depths and signal-to-noise ratios across different data sets simultaneously by a monotonic nonlinear transformation. We show empirically that the epigenomic data normalized by our method, compared to existing methods, can better capture real biological variation, such as impact on gene expression regulation.


Subject(s)
Epigenomics/methods , Sequence Analysis, DNA/methods , Gene Expression , Histone Code , RNA-Seq , Software
14.
J Healthc Inform Res ; 4(1): 91-109, 2020 Mar.
Article in English | MEDLINE | ID: mdl-35415437

ABSTRACT

With wearable, relatively unobtrusive health monitors and smartphone sensors, it is increasingly easy to collect continuously streaming physiological data in a passive mode without placing much burden on participants. At the same time, smartphones provide the ability to survey participants to provide "ground-truth" reporting on psychological states, although this comes at an increased cost in participant burden. In this paper, we examined how analytical approaches from the field of machine learning could allow us to distill the collected physiological data into actionable decision rules about each individual's psychological state, with the eventual goal of identifying important psychological states (e.g., risk moments) without the need for ongoing burdensome active assessment (e.g., self-report). As a first step towards this goal, we compared two methods: (1) a k-nearest neighbor classifier that uses dynamic time warping distance, and (2) a random forests classifier to predict low and high states of affective arousal states based on features extracted using the tsfresh python package. Then, we compared random-forest-based predictive models tailored for the individual with individual-general models. Results showed that the individual-specific model outperformed the general one. Our results support the feasibility of using passively collected wearable data to predict psychological states, suggesting that by relying on both types of data, the active collection can be reduced or eliminated.

15.
IUBMB Life ; 72(1): 27-38, 2020 01.
Article in English | MEDLINE | ID: mdl-31769130

ABSTRACT

Members of the GATA family of transcription factors play key roles in the differentiation of specific cell lineages by regulating the expression of target genes. Three GATA factors play distinct roles in hematopoietic differentiation. In order to better understand how these GATA factors function to regulate genes throughout the genome, we are studying the epigenomic and transcriptional landscapes of hematopoietic cells in a model-driven, integrative fashion. We have formed the collaborative multi-lab VISION project to conduct ValIdated Systematic IntegratiON of epigenomic data in mouse and human hematopoiesis. The epigenomic data included nuclease accessibility in chromatin, CTCF occupancy, and histone H3 modifications for 20 cell types covering hematopoietic stem cells, multilineage progenitor cells, and mature cells across the blood cell lineages of mouse. The analysis used the Integrative and Discriminative Epigenome Annotation System (IDEAS), which learns all common combinations of features (epigenetic states) simultaneously in two dimensions-along chromosomes and across cell types. The result is a segmentation that effectively paints the regulatory landscape in readily interpretable views, revealing constitutively active or silent loci as well as the loci specifically induced or repressed in each stage and lineage. Nuclease accessible DNA segments in active chromatin states were designated candidate cis-regulatory elements in each cell type, providing one of the most comprehensive registries of candidate hematopoietic regulatory elements to date. Applications of VISION resources are illustrated for the regulation of genes encoding GATA1, GATA2, GATA3, and Ikaros. VISION resources are freely available from our website http://usevision.org.


Subject(s)
Chromatin/metabolism , Epigenome , GATA Transcription Factors/metabolism , Gene Expression Regulation , Hematopoiesis , Hematopoietic Stem Cells/cytology , Hematopoietic Stem Cells/metabolism , Animals , Cell Differentiation , Chromatin/genetics , GATA Transcription Factors/genetics , Humans
16.
Genome Biol ; 20(1): 282, 2019 12 18.
Article in English | MEDLINE | ID: mdl-31847870

ABSTRACT

The spatial organization of chromatin in the nucleus has been implicated in regulating gene expression. Maps of high-frequency interactions between different segments of chromatin have revealed topologically associating domains (TADs), within which most of the regulatory interactions are thought to occur. TADs are not homogeneous structural units but appear to be organized into a hierarchy. We present OnTAD, an optimized nested TAD caller from Hi-C data, to identify hierarchical TADs. OnTAD reveals new biological insights into the role of different TAD levels, boundary usage in gene regulation, the loop extrusion model, and compartmental domains. OnTAD is available at https://github.com/anlin00007/OnTAD.


Subject(s)
Chromatin Assembly and Disassembly , Chromatin/metabolism , Algorithms , Epigenesis, Genetic , Genomics , Software
17.
Genome Biol ; 20(1): 57, 2019 03 19.
Article in English | MEDLINE | ID: mdl-30890172

ABSTRACT

BACKGROUND: Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study. RESULTS: Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments. CONCLUSIONS: In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.


Subject(s)
Genomics/standards , High-Throughput Nucleotide Sequencing/standards , Neoplasms/genetics , Quality Control , Software , Humans , Reproducibility of Results , Tumor Cells, Cultured
18.
PLoS Comput Biol ; 14(11): e1006571, 2018 11.
Article in English | MEDLINE | ID: mdl-30485278

ABSTRACT

Sequencing of the T cell receptor (TCR) repertoire is a powerful tool for deeper study of immune response, but the unique structure of this type of data makes its meaningful quantification challenging. We introduce a new method, the Gamma-GPD spliced threshold model, to address this difficulty. This biologically interpretable model captures the distribution of the TCR repertoire, demonstrates stability across varying sequencing depths, and permits comparative analysis across any number of sampled individuals. We apply our method to several datasets and obtain insights regarding the differentiating features in the T cell receptor repertoire among sampled individuals across conditions. We have implemented our method in the open-source R package powerTCR.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Immune System , Receptors, Antigen, T-Cell/genetics , Alternative Splicing , Animals , Brain Neoplasms/metabolism , CD4-Positive T-Lymphocytes/cytology , Clone Cells , Cluster Analysis , Computer Simulation , Glioblastoma/metabolism , Humans , Likelihood Functions , Lung/metabolism , Mice , Programming Languages , Receptors, Antigen, T-Cell/chemistry , Sarcoidosis/metabolism , Software
19.
PLoS Comput Biol ; 14(9): e1006436, 2018 09.
Article in English | MEDLINE | ID: mdl-30240439

ABSTRACT

Co-expression network analysis provides useful information for studying gene regulation in biological processes. Examining condition-specific patterns of co-expression can provide insights into the underlying cellular processes activated in a particular condition. One challenge in this type of analysis is that the sample sizes in each condition are usually small, making the statistical inference of co-expression patterns highly underpowered. A joint network construction that borrows information from related structures across conditions has the potential to improve the power of the analysis. One possible approach to constructing the co-expression network is to use the Gaussian graphical model. Though several methods are available for joint estimation of multiple graphical models, they do not fully account for the heterogeneity between samples and between co-expression patterns introduced by condition specificity. Here we develop the condition-adaptive fused graphical lasso (CFGL), a data-driven approach to incorporate condition specificity in the estimation of co-expression networks. We show that this method improves the accuracy with which networks are learned. The application of this method on a rat multi-tissue dataset and The Cancer Genome Atlas (TCGA) breast cancer dataset provides interesting biological insights. In both analyses, we identify numerous modules enriched for Gene Ontology functions and observe that the modules that are upregulated in a particular condition are often involved in condition-specific activities. Interestingly, we observe that the genes strongly associated with survival time in the TCGA dataset are less likely to be network hubs, suggesting that genes associated with cancer progression are likely to govern specific functions or execute final biological functions in pathways, rather than regulating a large number of biological processes. Additionally, we observed that the tumor-specific hub genes tend to have few shared edges with normal tissue, revealing tumor-specific regulatory mechanism.


Subject(s)
Brain/metabolism , Breast Neoplasms/metabolism , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Myocardium/metabolism , Algorithms , Animals , Area Under Curve , Breast Neoplasms/genetics , Computer Graphics , Computer Simulation , Databases, Factual , Female , Heart , Humans , Male , Neoplasms/metabolism , Normal Distribution , Rats , Software
20.
Biometrics ; 74(3): 803-813, 2018 09.
Article in English | MEDLINE | ID: mdl-29192968

ABSTRACT

The outcome of high-throughput biological experiments is affected by many operational factors in the experimental and data-analytical procedures. Understanding how these factors affect the reproducibility of the outcome is critical for establishing workflows that produce replicable discoveries. In this article, we propose a regression framework, based on a novel cumulative link model, to assess the covariate effects of operational factors on the reproducibility of findings from high-throughput experiments. In contrast to existing graphical approaches, our method allows one to succinctly characterize the simultaneous and independent effects of covariates on reproducibility and to compare reproducibility while controlling for potential confounding variables. We also establish a connection between our model and certain Archimedean copula models. This connection not only offers our regression framework an interpretation in copula models, but also provides guidance on choosing the functional forms of the regression. Furthermore, it also opens a new way to interpret and utilize these copulas in the context of reproducibility. Using simulations, we show that our method produces calibrated type I error and is more powerful in detecting difference in reproducibility than existing measures of agreement. We illustrate the usefulness of our method using a ChIP-seq study and a microarray study.


Subject(s)
Confounding Factors, Epidemiologic , High-Throughput Screening Assays/statistics & numerical data , Regression Analysis , Algorithms , Binding Sites , CCCTC-Binding Factor/chemistry , Calibration , Computer Simulation , Gene Expression Profiling/statistics & numerical data , High-Throughput Screening Assays/standards , Humans , Microarray Analysis/statistics & numerical data , Models, Statistical , Reproducibility of Results
SELECTION OF CITATIONS
SEARCH DETAIL
...