Pesquisa | Portal Regional da BVS

A Bayesian model based computational analysis of the relationship between bisulfite accessible single-stranded DNA in chromatin and somatic hypermutation of immunoglobulin genes.

Yu, Guojun; Wu, Yingru; Duan, Zhi; Tang, Catherine; Xing, Haipeng; Scharff, Matthew D; MacCarthy, Thomas.

PLoS Comput Biol ; 17(9): e1009323, 2021 09.

Artigo em Inglês | MEDLINE | ID: mdl-34491985

RESUMO

The B cells in our body generate protective antibodies by introducing somatic hypermutations (SHM) into the variable region of immunoglobulin genes (IgVs). The mutations are generated by activation induced deaminase (AID) that converts cytosine to uracil in single stranded DNA (ssDNA) generated during transcription. Attempts have been made to correlate SHM with ssDNA using bisulfite to chemically convert cytosines that are accessible in the intact chromatin of mutating B cells. These studies have been complicated by using different definitions of "bisulfite accessible regions" (BARs). Recently, deep-sequencing has provided much larger datasets of such regions but computational methods are needed to enable this analysis. Here we leveraged the deep-sequencing approach with unique molecular identifiers and developed a novel Hidden Markov Model based Bayesian Segmentation algorithm to characterize the ssDNA regions in the IGHV4-34 gene of the human Ramos B cell line. Combining hierarchical clustering and our new Bayesian model, we identified recurrent BARs in certain subregions of both top and bottom strands of this gene. Using this new system, the average size of BARs is about 15 bp. We also identified potential G-quadruplex DNA structures in this gene and found that the BARs co-locate with G-quadruplex structures in the opposite strand. Using various correlation analyses, there is not a direct site-to-site relationship between the bisulfite accessible ssDNA and all sites of SHM but most of the highly AID mutated sites are within 15 bp of a BAR. In summary, we developed a novel platform to study single stranded DNA in chromatin at a base pair resolution that reveals potential relationships among BARs, SHM and G-quadruplexes. This platform could be applied to genome wide studies in the future.

Assuntos

Teorema de Bayes , Cromatina/química , Biologia Computacional/métodos , DNA de Cadeia Simples/química , Genes de Imunoglobulinas , Mutação , Sulfitos/química , Linhagem Celular , Quadruplex G , Humanos

Deciphering hierarchical organization of topologically associated domains through change-point testing.

Xing, Haipeng; Wu, Yingru; Zhang, Michael Q; Chen, Yong.

BMC Bioinformatics ; 22(1): 183, 2021 Apr 10.

Artigo em Inglês | MEDLINE | ID: mdl-33838653

RESUMO

BACKGROUND: The nucleus of eukaryotic cells spatially packages chromosomes into a hierarchical and distinct segregation that plays critical roles in maintaining transcription regulation. High-throughput methods of chromosome conformation capture, such as Hi-C, have revealed topologically associating domains (TADs) that are defined by biased chromatin interactions within them. RESULTS: We introduce a novel method, HiCKey, to decipher hierarchical TAD structures in Hi-C data and compare them across samples. We first derive a generalized likelihood-ratio (GLR) test for detecting change-points in an interaction matrix that follows a negative binomial distribution or general mixture distribution. We then employ several optimal search strategies to decipher hierarchical TADs with p values calculated by the GLR test. Large-scale validations of simulation data show that HiCKey has good precision in recalling known TADs and is robust against random collisions of chromatin interactions. By applying HiCKey to Hi-C data of seven human cell lines, we identified multiple layers of TAD organization among them, but the vast majority had no more than four layers. In particular, we found that TAD boundaries are significantly enriched in active chromosomal regions compared to repressed regions. CONCLUSIONS: HiCKey is optimized for processing large matrices constructed from high-resolution Hi-C experiments. The method and theoretical result of the GLR test provide a general framework for significance testing of similar experimental chromatin interaction data that may not fully follow negative binomial distributions but rather more general mixture distributions.

Assuntos

Cromatina , Cromossomos , Núcleo Celular , Cromatina/genética , Simulação por Computador , Regulação da Expressão Gênica , Humanos

Statistical Surveillance of Structural Breaks in Credit Rating Dynamics.

Xing, Haipeng; Wang, Ke; Li, Zhi; Chen, Ying.

Entropy (Basel) ; 22(10)2020 Sep 24.

Artigo em Inglês | MEDLINE | ID: mdl-33286841

RESUMO

The 2007-2008 financial crisis had severe consequences on the global economy and an intriguing question related to the crisis is whether structural breaks in the credit market can be detected. To address this issue, we chose firms' credit rating transition dynamics as a proxy of the credit market and discuss how statistical process control tools can be used to surveil structural breaks in firms' rating transition dynamics. After reviewing some commonly used Markovian models for firms' rating transition dynamics, we present several surveillance rules for detecting changes in generators of firms' rating migration matrices, including the likelihood ratio rule, the generalized likelihood ratio rule, the extended Shiryaev's detection rule, and a Bayesian detection rule for piecewise homogeneous Markovian models. The effectiveness of these rules was analyzed on the basis of Monte Carlo simulations. We also provide a real example that used the surveillance rules to analyze and detect structural breaks in the monthly credit rating migration of U.S. firms from January 1986 to February 2017.

A novel Bayesian change-point algorithm for genome-wide analysis of diverse ChIPseq data types.

Xing, Haipeng; Liao, Willey; Mo, Yifan; Zhang, Michael Q.

J Vis Exp ; (70): e4273, 2012 Dec 10.

Artigo em Inglês | MEDLINE | ID: mdl-23271069

RESUMO

ChIPseq is a widely used technique for investigating protein-DNA interactions. Read density profiles are generated by using next-sequencing of protein-bound DNA and aligning the short reads to a reference genome. Enriched regions are revealed as peaks, which often differ dramatically in shape, depending on the target protein(1). For example, transcription factors often bind in a site- and sequence-specific manner and tend to produce punctate peaks, while histone modifications are more pervasive and are characterized by broad, diffuse islands of enrichment(2). Reliably identifying these regions was the focus of our work. Algorithms for analyzing ChIPseq data have employed various methodologies, from heuristics(3-5) to more rigorous statistical models, e.g. Hidden Markov Models (HMMs)(6-8). We sought a solution that minimized the necessity for difficult-to-define, ad hoc parameters that often compromise resolution and lessen the intuitive usability of the tool. With respect to HMM-based methods, we aimed to curtail parameter estimation procedures and simple, finite state classifications that are often utilized. Additionally, conventional ChIPseq data analysis involves categorization of the expected read density profiles as either punctate or diffuse followed by subsequent application of the appropriate tool. We further aimed to replace the need for these two distinct models with a single, more versatile model, which can capably address the entire spectrum of data types. To meet these objectives, we first constructed a statistical framework that naturally modeled ChIPseq data structures using a cutting edge advance in HMMs(9), which utilizes only explicit formulas-an innovation crucial to its performance advantages. More sophisticated then heuristic models, our HMM accommodates infinite hidden states through a Bayesian model. We applied it to identifying reasonable change points in read density, which further define segments of enrichment. Our analysis revealed how our Bayesian Change Point (BCP) algorithm had a reduced computational complexity-evidenced by an abridged run time and memory footprint. The BCP algorithm was successfully applied to both punctate peak and diffuse island identification with robust accuracy and limited user-defined parameters. This illustrated both its versatility and ease of use. Consequently, we believe it can be implemented readily across broad ranges of data types and end users in a manner that is easily compared and contrasted, making it a great tool for ChIPseq data analysis that can aid in collaboration and corroboration between research groups. Here, we demonstrate the application of BCP to existing transcription factor(10,11) and epigenetic data(12) to illustrate its usefulness.

Assuntos

Algoritmos , Teorema de Bayes , Estudo de Associação Genômica Ampla/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , DNA/química , DNA/genética , Interpretação Estatística de Dados

Genome-wide localization of protein-DNA binding and histone modification by a Bayesian change-point method with ChIP-seq data.

Xing, Haipeng; Mo, Yifan; Liao, Will; Zhang, Michael Q.

PLoS Comput Biol ; 8(7): e1002613, 2012.

Artigo em Inglês | MEDLINE | ID: mdl-22844240

RESUMO

Next-generation sequencing (NGS) technologies have matured considerably since their introduction and a focus has been placed on developing sophisticated analytical tools to deal with the amassing volumes of data. Chromatin immunoprecipitation sequencing (ChIP-seq), a major application of NGS, is a widely adopted technique for examining protein-DNA interactions and is commonly used to investigate epigenetic signatures of diffuse histone marks. These datasets have notoriously high variance and subtle levels of enrichment across large expanses, making them exceedingly difficult to define. Windows-based, heuristic models and finite-state hidden Markov models (HMMs) have been used with some success in analyzing ChIP-seq data but with lingering limitations. To improve the ability to detect broad regions of enrichment, we developed a stochastic Bayesian Change-Point (BCP) method, which addresses some of these unresolved issues. BCP makes use of recent advances in infinite-state HMMs by obtaining explicit formulas for posterior means of read densities. These posterior means can be used to categorize the genome into enriched and unenriched segments, as is customarily done, or examined for more detailed relationships since the underlying subpeaks are preserved rather than simplified into a binary classification. BCP performs a near exhaustive search of all possible change points between different posterior means at high-resolution to minimize the subjectivity of window sizes and is computationally efficient, due to a speed-up algorithm and the explicit formulas it employs. In the absence of a well-established "gold standard" for diffuse histone mark enrichment, we corroborated BCP's island detection accuracy and reproducibility using various forms of empirical evidence. We show that BCP is especially suited for analysis of diffuse histone ChIP-seq data but also effective in analyzing punctate transcription factor ChIP datasets, making it widely applicable for numerous experiment types.

Assuntos

Imunoprecipitação da Cromatina/métodos , DNA/genética , DNA/metabolismo , Genoma Humano , Histonas/genética , Histonas/metabolismo , Análise de Sequência de DNA/métodos , Algoritmos , Teorema de Bayes , Sítios de Ligação , DNA/química , Epigenômica/métodos , Histonas/química , Humanos , Cadeias de Markov , Reprodutibilidade dos Testes , Fatores de Transcrição/química , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo

A Semiparametric Change-Point Regression Model for Longitudinal Observations.

Xing, Haipeng; Ying, Zhiliang.

J Am Stat Assoc ; 107(500)2012 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-24288420

RESUMO

Many longitudinal studies involve relating an outcome process to a set of possibly time-varying covariates, giving rise to the usual regression models for longitudinal data. When the purpose of the study is to investigate the covariate effects when experimental environment undergoes abrupt changes or to locate the periods with different levels of covariate effects, a simple and easy-to-interpret approach is to introduce change-points in regression coefficients. In this connection, we propose a semiparametric change-point regression model, in which the error process (stochastic component) is nonparametric and the baseline mean function (functional part) is completely unspecified, the observation times are allowed to be subject-specific, and the number, locations and magnitudes of change-points are unknown and need to be estimated. We further develop an estimation procedure which combines the recent advance in semiparametric analysis based on counting process argument and multiple change-points inference, and discuss its large sample properties, including consistency and asymptotic normality, under suitable regularity conditions. Simulation results show that the proposed methods work well under a variety of scenarios. An application to a real data set is also given.

Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays.

Chen, Hao; Xing, Haipeng; Zhang, Nancy R.

PLoS Comput Biol ; 7(1): e1001060, 2011 Jan 27.

Artigo em Inglês | MEDLINE | ID: mdl-21298078

RESUMO

Chromosomal gains and losses comprise an important type of genetic change in tumors, and can now be assayed using microarray hybridization-based experiments. Most current statistical models for DNA copy number estimate total copy number, which do not distinguish between the underlying quantities of the two inherited chromosomes. This latter information, sometimes called parent specific copy number, is important for identifying allele-specific amplifications and deletions, for quantifying normal cell contamination, and for giving a more complete molecular portrait of the tumor. We propose a stochastic segmentation model for parent-specific DNA copy number in tumor samples, and give an estimation procedure that is computationally efficient and can be applied to data from the current high density genotyping platforms. The proposed method does not require matched normal samples, and can estimate the unknown genotypes simultaneously with the parent specific copy number. The new method is used to analyze 223 glioblastoma samples from the Cancer Genome Atlas (TCGA) project, giving a more comprehensive summary of the copy number events in these samples. Detailed case studies on these samples reveal the additional insights that can be gained from an allele-specific copy number analysis, such as the quantification of fractional gains and losses, the identification of copy neutral loss of heterozygosity, and the characterization of regions of simultaneous changes of both inherited chromosomes.

Assuntos

DNA de Neoplasias/genética , Dosagem de Genes , Pais , Alelos , Genótipo , Humanos , Cadeias de Markov , Polimorfismo de Nucleotídeo Único

Stochastic segmentation models for array-based comparative genomic hybridization data analysis.

Lai, Tze Leung; Xing, Haipeng; Zhang, Nancy.

Biostatistics ; 9(2): 290-307, 2008 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-17855472

RESUMO

Array-based comparative genomic hybridization (array-CGH) is a high throughput, high resolution technique for studying the genetics of cancer. Analysis of array-CGH data typically involves estimation of the underlying chromosome copy numbers from the log fluorescence ratios and segmenting the chromosome into regions with the same copy number at each location. We propose for the analysis of array-CGH data, a new stochastic segmentation model and an associated estimation procedure that has attractive statistical and computational properties. An important benefit of this Bayesian segmentation model is that it yields explicit formulas for posterior means, which can be used to estimate the signal directly without performing segmentation. Other quantities relating to the posterior distribution that are useful for providing confidence assessments of any given segmentation can also be estimated by using our method. We propose an approximation method whose computation time is linear in sequence length which makes our method practically applicable to the new higher density arrays. Simulation studies and applications to real array-CGH data illustrate the advantages of the proposed approach.

Assuntos

Teorema de Bayes , Biologia Computacional/métodos , Análise Citogenética/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Sondas de Oligonucleotídeos/análise , Mapeamento Cromossômico/métodos , Dosagem de Genes , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Humanos , Neoplasias/genética , Análise de Sequência de DNA/métodos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA