Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 29
Filtrar
1.
Elife ; 112022 03 14.
Artigo em Inglês | MEDLINE | ID: mdl-35285799

RESUMO

The mammalian circadian clock exerts control of daily gene expression through cycles of DNA binding. Here, we develop a quantitative model of how a finite pool of BMAL1 protein can regulate thousands of target sites over daily time scales. We used quantitative imaging to track dynamic changes in endogenous labelled proteins across peripheral tissues and the SCN. We determine the contribution of multiple rhythmic processes coordinating BMAL1 DNA binding, including cycling molecular abundance, binding affinities, and repression. We find nuclear BMAL1 concentration determines corresponding CLOCK through heterodimerisation and define a DNA residence time of this complex. Repression of CLOCK:BMAL1 is achieved through rhythmic changes to BMAL1:CRY1 association and high-affinity interactions between PER2:CRY1 which mediates CLOCK:BMAL1 displacement from DNA. Finally, stochastic modelling reveals a dual role for PER:CRY complexes in which increasing concentrations of PER2:CRY1 promotes removal of BMAL1:CLOCK from genes consequently enhancing ability to move to new target sites.


Assuntos
Relógios Circadianos , Fatores de Transcrição ARNTL/genética , Fatores de Transcrição ARNTL/metabolismo , Animais , Proteínas CLOCK/genética , Proteínas CLOCK/metabolismo , Relógios Circadianos/genética , Ritmo Circadiano/genética , Mamíferos/metabolismo
2.
BMC Bioinformatics ; 20(1): 15, 2019 Jan 09.
Artigo em Inglês | MEDLINE | ID: mdl-30626338

RESUMO

BACKGROUND: Canonical correlation analysis (CCA) is a classic statistical tool for investigating complex multivariate data. Correspondingly, it has found many diverse applications, ranging from molecular biology and medicine to social science and finance. Intriguingly, despite the importance and pervasiveness of CCA, only recently a probabilistic understanding of CCA is developing, moving from an algorithmic to a model-based perspective and enabling its application to large-scale settings. RESULTS: Here, we revisit CCA from the perspective of statistical whitening of random variables and propose a simple yet flexible probabilistic model for CCA in the form of a two-layer latent variable generative model. The advantages of this variant of probabilistic CCA include non-ambiguity of the latent variables, provisions for negative canonical correlations, possibility of non-normal generative variables, as well as ease of interpretation on all levels of the model. In addition, we show that it lends itself to computationally efficient estimation in high-dimensional settings using regularized inference. We test our approach to CCA analysis in simulations and apply it to two omics data sets illustrating the integration of gene expression data, lipid concentrations and methylation levels. CONCLUSIONS: Our whitening approach to CCA provides a unifying perspective on CCA, linking together sphering procedures, multivariate regression and corresponding probabilistic generative models. Furthermore, we offer an efficient computer implementation in the "whitening" R package available at https://CRAN.R-project.org/package=whitening .


Assuntos
Bases de Dados Genéticas/estatística & dados numéricos , Análise de Componente Principal/métodos , Algoritmos , Humanos , Análise Multivariada
3.
Bioinformatics ; 31(19): 3156-62, 2015 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-26026136

RESUMO

MOTIVATION: Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. RESULTS: Here, we introduce a simple yet effective approach for identifying differentially expressed proteins using binary discriminant analysis. This approach works by data-adaptive thresholding of protein expression values and subsequent ranking of the dichotomized features using a relative entropy measure. Our framework may be viewed as a generalization of the 'peak probability contrast' approach of Tibshirani et al. (2004) and can be applied both in the two-group and the multi-group setting. Our approach is computationally inexpensive and shows in the analysis of a large-scale drug discovery test dataset equivalent prediction accuracy as a random forest. Furthermore, we were able to identify in the analysis of mass spectrometry data from a pancreas cancer study biological relevant and statistically predictive marker peaks unrecognized in the original study. AVAILABILITY AND IMPLEMENTATION: The methodology for binary discriminant analysis is implemented in the R package binda, which is freely available under the GNU General Public License (version 3 or later) from CRAN at URL http://cran.r-project.org/web/packages/binda/. R scripts reproducing all described analyzes are available from the web page http://strimmerlab.org/software/binda/. CONTACT: k.strimmer@imperial.ac.uk.


Assuntos
Biomarcadores Tumorais/metabolismo , Interpretação Estatística de Dados , Análise Discriminante , Espectrometria de Massas/métodos , Neoplasias Pancreáticas/metabolismo , Proteômica/métodos , Software , Humanos , Neoplasias Pancreáticas/diagnóstico
4.
Algorithms Mol Biol ; 10: 10, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25788974

RESUMO

BACKGROUND: Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments. RESULTS: We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. The likelihoods are then combined to a score that typically gives rise to a mixture distribution. From this we determine a decision threshold to separate potentially variant sites from the noisy background. CONCLUSIONS: In simulations we show that our simple model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences.

5.
Biostatistics ; 14(1): 129-43, 2013 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22962499

RESUMO

Signal identification in large-dimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show that the HC threshold may be viewed as an approximation to a natural class boundary (CB) in two-class discriminant analysis which in turn is expressible as the FDR threshold. We demonstrate that in a rare-weak setting in the region of the phase space where signal identification is possible, both thresholds are practicably indistinguishable, and thus HC thresholding is identical to using a simple local FDR cutoff. The relationship of the HC and CB thresholds and their properties are investigated both analytically and by simulations, and are further compared by the application to four cancer gene expression data sets.


Assuntos
Biometria/métodos , Perfilação da Expressão Gênica/métodos , Modelos Estatísticos , Simulação por Computador , Reações Falso-Positivas , Humanos , Projetos de Pesquisa
6.
BMC Bioinformatics ; 13: 284, 2012 Oct 31.
Artigo em Inglês | MEDLINE | ID: mdl-23113980

RESUMO

BACKGROUND: Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs. RESULTS: We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs. CONCLUSIONS: Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/.


Assuntos
Algoritmos , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único , Humanos , Análise de Regressão , Software
7.
Bioinformatics ; 28(17): 2270-1, 2012 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-22796955

RESUMO

UNLABELLED: MALDIquant is an R package providing a complete and modular analysis pipeline for quantitative analysis of mass spectrometry data. MALDIquant is specifically designed with application in clinical diagnostics in mind and implements sophisticated routines for importing raw data, preprocessing, non-linear peak alignment and calibration. It also handles technical replicates as well as spectra with unequal resolution. AVAILABILITY: MALDIquant and its associated R packages readBrukerFlexData and readMzXmlData are freely available from the R archive CRAN (http://cran.r-project.org). The software is distributed under the GNU General Public License (version 3 or later) and is accompanied by example files and data. Additional documentation is available from http://strimmerlab.org/software/maldiquant/.


Assuntos
Interpretação Estatística de Dados , Software , Espectrometria de Massas por Ionização e Dessorção a Laser Assistida por Matriz/métodos , Humanos
8.
Bioinformatics ; 26(16): 1990-8, 2010 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-20581402

RESUMO

MOTIVATION: In statistical bioinformatics research, different optimization mechanisms potentially lead to 'over-optimism' in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS: We present an empirical study on over-optimism using high-dimensional classification as example. Specifically, we consider a 'promising' new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we 'fish for significance'. The investigated sources of over-optimism include the optimization of datasets, of settings, of competing methods and, most importantly, of the method's characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data. AVAILABILITY: The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible.


Assuntos
Algoritmos , Biologia Computacional/métodos , Análise Discriminante , Perfilação da Expressão Gênica/métodos
9.
Bioinformatics ; 25(20): 2700-7, 2009 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-19648135

RESUMO

MOTIVATION: Biomarker discovery and gene ranking is a standard task in genomic high-throughput analysis. Typically, the ordering of markers is based on a stabilized variant of the t-score, such as the moderated t or the SAM statistic. However, these procedures ignore gene-gene correlations, which may have a profound impact on the gene orderings and on the power of the subsequent tests. RESULTS: We propose a simple procedure that adjusts gene-wise t-statistics to take account of correlations among genes. The resulting correlation-adjusted t-scores ('cat' scores) are derived from a predictive perspective, i.e. as a score for variable selection to discriminate group membership in two-class linear discriminant analysis. In the absence of correlation the cat score reduces to the standard t-score. Moreover, using the cat score it is straightforward to evaluate groups of features (i.e. gene sets). For computation of the cat score from small sample data, we propose a shrinkage procedure. In a comparative study comprising six different synthetic and empirical correlation structures, we show that the cat score improves estimation of gene orderings and leads to higher power for fixed true discovery rate, and vice versa. Finally, we also illustrate the cat score by analyzing metabolomic data. AVAILABILITY: The shrinkage cat score is implemented in the R package 'st', which is freely available under the terms of the GNU General Public License (version 3 or later) from CRAN (http://cran.r-project.org/web/packages/st/).


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Marcadores Genéticos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos
10.
BMC Bioinformatics ; 10: 47, 2009 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-19192285

RESUMO

BACKGROUND: Analysis of microarray and other high-throughput data on the basis of gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number of statistical approaches for detecting gene set enrichment have been proposed, but both the interrelations and the relative performance of the various methods are still very much unclear. RESULTS: We conduct an extensive survey of statistical approaches for gene set analysis and identify a common modular structure underlying most published methods. Based on this finding we propose a general framework for detecting gene set enrichment. This framework provides a meta-theory of gene set analysis that not only helps to gain a better understanding of the relative merits of each embedded approach but also facilitates a principled comparison and offers insights into the relative interplay of the methods. CONCLUSION: We use this framework to conduct a computer simulation comparing 261 different variants of gene set enrichment procedures and to analyze two experimental data sets. Based on the results we offer recommendations for best practices regarding the choice of effective procedures for gene set enrichment analysis.


Assuntos
Simulação por Computador , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Animais , Bases de Dados Genéticas , Humanos , Modelos Estatísticos
11.
BMC Bioinformatics ; 9: 303, 2008 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-18613966

RESUMO

BACKGROUND: False discovery rate (FDR) methods play an important role in analyzing high-dimensional data. There are two types of FDR, tail area-based FDR and local FDR, as well as numerous statistical algorithms for estimating or controlling FDR. These differ in terms of underlying test statistics and procedures employed for statistical learning. RESULTS: A unifying algorithm for simultaneous estimation of both local FDR and tail area-based FDR is presented that can be applied to a diverse range of test statistics, including p-values, correlations, z- and t-scores. This approach is semipararametric and is based on a modified Grenander density estimator. For test statistics other than p-values it allows for empirical null modeling, so that dependencies among tests can be taken into account. The inference of the underlying model employs truncated maximum-likelihood estimation, with the cut-off point chosen according to the false non-discovery rate. CONCLUSION: The proposed procedure generalizes a number of more specialized algorithms and thus offers a common framework for FDR estimation consistent across test statistics and types of FDR. In comparative study the unified approach performs on par with the best competing yet more specialized alternatives. The algorithm is implemented in R in the "fdrtool" package, available under the GNU GPL from http://strimmerlab.org/software/fdrtool/ and from the R package archive CRAN.


Assuntos
Modelos Estatísticos , Valor Preditivo dos Testes , Software , Algoritmos , Biometria/métodos , Neoplasias da Mama/genética , Intervalos de Confiança , Feminino , HIV/genética , Humanos , Funções Verossimilhança , Análise de Sequência com Séries de Oligonucleotídeos , Tamanho da Amostra
12.
Bioinformatics ; 24(12): 1461-2, 2008 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-18441000

RESUMO

UNLABELLED: False discovery rate (FDR) methodologies are essential in the study of high-dimensional genomic and proteomic data. The R package 'fdrtool' facilitates such analyses by offering a comprehensive set of procedures for FDR estimation. Its distinctive features include: (i) many different types of test statistics are allowed as input data, such as P-values, z-scores, correlations and t-scores; (ii) simultaneously, both local FDR and tail area-based FDR values are estimated for all test statistics and (iii) empirical null models are fit where possible, thereby taking account of potential over- or underdispersion of the theoretical null. In addition, 'fdrtool' provides readily interpretable graphical output, and can be applied to very large scale (in the order of millions of hypotheses) multiple testing problems. Consequently, 'fdrtool' implements a flexible FDR estimation scheme that is unified across different test statistics and variants of FDR. AVAILABILITY: The program is freely available from the Comprehensive R Archive Network (http://cran.r-project.org/) under the terms of the GNU General Public License (version 3 or later). CONTACT: strimmer@uni-leipzig.de.


Assuntos
Algoritmos , Intervalos de Confiança , Interpretação Estatística de Dados , Reações Falso-Positivas , Genômica/métodos , Linguagens de Programação , Software
13.
FASEB J ; 22(2): 437-44, 2008 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-17932027

RESUMO

With HIV persisting lifelong in infected persons, therapeutic vaccination is a novel alternative concept to control virus replication. Even though CD8 and CD4 cell responses to such immunizations have been demonstrated, their effects on virus replication are still unclear. In view of this fact, we studied the impact of a therapeutic vaccination with HIV nef delivered by a recombinant modified vaccinia Ankara vector on viral diversity. We investigated HIV sequences derived from chronically infected persons before and after therapeutic vaccination. Before immunization the mean +/- se pairwise variability of patient-derived Nef protein sequences was 0.1527 +/- 0.0041. After vaccination the respective value was 0.1249 +/- 0.0042, resulting in a significant (P<0.0001) difference between the two time points. The genes vif and 5'gag tested in parallel and nef sequences in control persons yielded a constant amino acid sequence variation. The data presented suggest that Nef immunization induced a selective pressure, limiting HIV sequence variability. To our knowledge this is the first report directly linking therapeutic HIV vaccination to decreasing diversity in patient-derived virus isolates.


Assuntos
Vacinas contra a AIDS/imunologia , Vacinas contra a AIDS/uso terapêutico , Síndrome da Imunodeficiência Adquirida/genética , Síndrome da Imunodeficiência Adquirida/imunologia , Síndrome da Imunodeficiência Adquirida/metabolismo , Síndrome da Imunodeficiência Adquirida/terapia , Sequência de Bases , Produtos do Gene nef/genética , Produtos do Gene nef/metabolismo , Variação Genética/genética , Humanos , Filogenia , Análise de Sequência de DNA , Linfócitos T/imunologia
14.
BMC Syst Biol ; 1: 37, 2007 Aug 06.
Artigo em Inglês | MEDLINE | ID: mdl-17683609

RESUMO

BACKGROUND: The use of correlation networks is widespread in the analysis of gene expression and proteomics data, even though it is known that correlations not only confound direct and indirect associations but also provide no means to distinguish between cause and effect. For "causal" analysis typically the inference of a directed graphical model is required. However, this is rather difficult due to the curse of dimensionality. RESULTS: We propose a simple heuristic for the statistical learning of a high-dimensional "causal" network. The method first converts a correlation network into a partial correlation graph. Subsequently, a partial ordering of the nodes is established by multiple testing of the log-ratio of standardized partial variances. This allows identifying a directed acyclic causal network as a subgraph of the partial correlation network. We illustrate the approach by analyzing a large Arabidopsis thaliana expression data set. CONCLUSION: The proposed approach is a heuristic algorithm that is based on a number of approximations, such as substituting lower order partial correlations by full order partial correlations. Nevertheless, for small samples and for sparse networks the algorithm not only yield sensible first order approximations of the causal structure in high-dimensional genomic data but is also computationally highly efficient. AVAILABILITY AND REQUIREMENTS: The method is implemented in the "GeneNet" R package (version 1.2.0), available from CRAN and from http://strimmerlab.org/software/genets/. The software includes an R script for reproducing the network analysis of the Arabidopsis thaliana data.


Assuntos
Algoritmos , Arabidopsis/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Genes de Plantas/genética , Internet , Software
15.
BMC Bioinformatics ; 8 Suppl 2: S3, 2007 May 03.
Artigo em Inglês | MEDLINE | ID: mdl-17493252

RESUMO

BACKGROUND: Causal networks based on the vector autoregressive (VAR) process are a promising statistical tool for modeling regulatory interactions in a cell. However, learning these networks is challenging due to the low sample size and high dimensionality of genomic data. RESULTS: We present a novel and highly efficient approach to estimate a VAR network. This proceeds in two steps: (i) improved estimation of VAR regression coefficients using an analytic shrinkage approach, and (ii) subsequent model selection by testing the associated partial correlations. In simulations this approach outperformed for small sample size all other considered approaches in terms of true discovery rate (number of correctly identified edges relative to the significant edges). Moreover, the analysis of expression time series data from Arabidopsis thaliana resulted in a biologically sensible network. CONCLUSION: Statistical learning of large-scale VAR causal models can be done efficiently by the proposed procedure, even in the difficult data situations prevalent in genomics and proteomics. AVAILABILITY: The method is implemented in R code that is available from the authors on request.


Assuntos
Algoritmos , Inteligência Artificial , Regulação da Expressão Gênica/fisiologia , Modelos Biológicos , Proteoma/metabolismo , Transdução de Sinais/fisiologia , Simulação por Computador , Reconhecimento Automatizado de Padrão/métodos , Análise de Regressão , Biologia de Sistemas/métodos , Fatores de Tempo
16.
Stat Appl Genet Mol Biol ; 6: Article9, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17402924

RESUMO

High-dimensional case-control analysis is encountered in many different settings in genomics. In order to rank genes accordingly, many different scores have been proposed, ranging from ad hoc modifications of the ordinary t statistic to complicated hierarchical Bayesian models. Here, we introduce the "shrinkage t" statistic that is based on a novel and model-free shrinkage estimate of the variance vector across genes. This is derived in a quasi-empirical Bayes setting. The new rank score is fully automatic and requires no specification of parameters or distributions. It is computationally inexpensive and can be written analytically in closed form. Using a series of synthetic and three real expression data we studied the quality of gene rankings produced by the "shrinkage t" statistic. The new score consistently leads to highly accurate rankings for the complete range of investigated data sets and all considered scenarios for across-gene variance structures.


Assuntos
Expressão Gênica , Teorema de Bayes , Funções Verossimilhança
17.
Brief Bioinform ; 8(1): 32-44, 2007 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-16772269

RESUMO

Partial least squares (PLS) is an efficient statistical regression technique that is highly suited for the analysis of genomic and proteomic data. In this article, we review both the theory underlying PLS as well as a host of bioinformatics applications of PLS. In particular, we provide a systematic comparison of the PLS approaches currently employed, and discuss analysis problems as diverse as, e.g. tumor classification from transcriptome data, identification of relevant genes, survival analysis and modeling of gene networks and transcription factor activities.


Assuntos
Biologia Computacional/métodos , Genômica/estatística & dados numéricos , Análise dos Mínimos Quadrados , Animais , Interpretação Estatística de Dados , Humanos , Modelos Estatísticos , Análise Multivariada , Software , Análise de Sobrevida
18.
Theor Biol Med Model ; 2: 23, 2005 Jun 24.
Artigo em Inglês | MEDLINE | ID: mdl-15978125

RESUMO

BACKGROUND: The study of the network between transcription factors and their targets is important for understanding the complex regulatory mechanisms in a cell. Unfortunately, with standard microarray experiments it is not possible to measure the transcription factor activities (TFAs) directly, as their own transcription levels are subject to post-translational modifications. RESULTS: Here we propose a statistical approach based on partial least squares (PLS) regression to infer the true TFAs from a combination of mRNA expression and DNA-protein binding measurements. This method is also statistically sound for small samples and allows the detection of functional interactions among the transcription factors via the notion of "meta"-transcription factors. In addition, it enables false positives to be identified in ChIP data and activation and suppression activities to be distinguished. CONCLUSION: The proposed method performs very well both for simulated data and for real expression and ChIP data from yeast and E. Coli experiments. It overcomes the limitations of previously used approaches to estimating TFAs. The estimated profiles may also serve as input for further studies, such as tests of periodicity or differential regulation. An R package "plsgenomics" implementing the proposed methods is available for download from the CRAN archive.


Assuntos
Análise em Microsséries , Modelos Genéticos , Análise de Sequência com Séries de Oligonucleotídeos , Fatores de Transcrição/fisiologia , Ativação Transcricional , Algoritmos , DNA/metabolismo , Proteínas de Ligação a DNA/metabolismo , Escherichia coli/genética , Análise dos Mínimos Quadrados , RNA Mensageiro/metabolismo , Saccharomyces cerevisiae/genética
19.
BMC Evol Biol ; 5: 6, 2005 Jan 21.
Artigo em Inglês | MEDLINE | ID: mdl-15663782

RESUMO

BACKGROUND: Coalescent theory is a general framework to model genetic variation in a population. Specifically, it allows inference about population parameters from sampled DNA sequences. However, most currently employed variants of coalescent theory only consider very simple demographic scenarios of population size changes, such as exponential growth. RESULTS: Here we develop a coalescent approach that allows Bayesian non-parametric estimation of the demographic history using genealogies reconstructed from sampled DNA sequences. In this framework inference and model selection is done using reversible jump Markov chain Monte Carlo (MCMC). This method is computationally efficient and overcomes the limitations of related non-parametric approaches such as the skyline plot. We validate the approach using simulated data. Subsequently, we reanalyze HIV-1 sequence data from Central Africa and Hepatitis C virus (HCV) data from Egypt. CONCLUSIONS: The new method provides a Bayesian procedure for non-parametric estimation of the demographic history. By construction it additionally provides confidence limits and may be used jointly with other MCMC-based coalescent approaches.


Assuntos
Evolução Molecular , Modelos Genéticos , Modelos Estatísticos , Algoritmos , Teorema de Bayes , DNA/genética , Interpretação Estatística de Dados , Genes Virais , Variação Genética , Genética Populacional , HIV-1/genética , Hepacivirus/genética , Cadeias de Markov , Método de Monte Carlo , Software
20.
Stat Appl Genet Mol Biol ; 4: Article32, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16646851

RESUMO

Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problem in bioinformatics. Clearly, the widely used standard covariance and correlation estimators are ill-suited for this purpose. As statistically efficient and computationally fast alternative we propose a novel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analytic calculation of the optimal shrinkage intensity. Subsequently, we apply this improved covariance estimator (which has guaranteed minimum mean squared error, is well-conditioned, and is always positive definite even for small sample sizes) to the problem of inferring large-scale gene association networks. We show that it performs very favorably compared to competing approaches both in simulations as well as in application to real expression data.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...