Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 23
Filter
1.
Mol Phylogenet Evol ; 180: 107689, 2023 03.
Article in English | MEDLINE | ID: mdl-36587884

ABSTRACT

Phylogenetic trees constructed from molecular sequence data rely on largely arbitrary assumptions about the substitution model, the distribution of substitution rates across sites, the version of the molecular clock, and, in the case of Bayesian inference, the prior distribution. Those assumptions affect results reported in the form of clade probabilities and error bars on divergence times and substitution rates. Overlooking the uncertainty in the assumptions leads to overly confident conclusions in the form of inflated clade probabilities and short confidence intervals or credible intervals. This paper demonstrates how to propagate that uncertainty by combining the models considered along with all of their assumptions, including their prior distributions. The combined models incorporate much more of the uncertainty than Bayesian model averages since the latter tend to settle on a single model due to the higher-level assumption that one of the models is true. Nucleotide sequence data illustrates the proposed model combination method.


Subject(s)
Evolution, Molecular , Models, Genetic , Phylogeny , Uncertainty , Bayes Theorem , Probability
2.
Mol Phylogenet Evol ; 167: 107357, 2022 02.
Article in English | MEDLINE | ID: mdl-34785383

ABSTRACT

Confidence intervals of divergence times and branch lengths do not reflect uncertainty about their clades or about the prior distributions and other model assumptions on which they are based. Uncertainty about the clade may be propagated to a confidence interval by multiplying its confidence level by the bootstrap proportion of its clade or by another probability that the clade is correct. (If the confidence level is 95% and the bootstrap proportion is 90%, then the uncertainty-adjusted confidence level is (0.95)(0.90) = 86%.) Uncertainty about the model can be propagated to the confidence interval by reporting the union of the confidence intervals from all the plausible models. Unless there is no overlap between the confidence intervals, that results in an uncertainty-adjusted interval that has as its lower and upper limits the most extreme limits of the models. The proposed methods of uncertainty quantification may be used together.


Subject(s)
Models, Statistical , Confidence Intervals , Phylogeny , Probability , Uncertainty
3.
Article in English | MEDLINE | ID: mdl-30113898

ABSTRACT

In a genome-wide association study (GWAS), the probability that a single nucleotide polymorphism (SNP) is not associated with a disease is its local false discovery rate (LFDR). The LFDR for each SNP is relative to a reference class of SNPs. For example, the LFDR of an exonic SNP can vary widely depending on whether it is considered relative to the separate reference class of other exonic SNPs or relative to the combined reference class of all SNPs in the data set. As a result, the analysis of the data based on the combined reference class might indicate that a specific exonic SNP is associated with the disease, while using the separate reference class indicates that it is not associated, or vice versa. To address that, we introduce empirical Bayes methods that simultaneously consider a combined reference class and a separate reference class. Our simulation studies indicate that the proposed methods lead to improved performance. The new maximum entropy method achieves that by depending on the separate class when it has enough SNPs for reliable LFDR estimation and depending solely on the combined class otherwise. We used the new methods to analyze data from a GWAS of 2,000 cases and 3,000 controls. R functions implementing the proposed methods are available on CRAN and Shiny .


Subject(s)
Computational Biology/methods , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide/genetics , Bayes Theorem , Coronary Artery Disease/genetics , Databases, Genetic , Entropy , Humans
4.
PLoS One ; 13(11): e0206902, 2018.
Article in English | MEDLINE | ID: mdl-30475807

ABSTRACT

Methods of estimating the local false discovery rate (LFDR) have been applied to different types of datasets such as high-throughput biological data, diffusion tensor imaging (DTI), and genome-wide association (GWA) studies. We present a model for LFDR estimation that incorporates a covariate into each test. Incorporating the covariates may improve the performance of testing procedures, because it contains additional information based on the biological context of the corresponding test. This method provides different estimates depending on a tuning parameter. We estimate the optimal value of that parameter by choosing the one that minimizes the estimated LFDR resulting from the bias and variance in a bootstrap approach. This estimation method is called an adaptive reference class (ARC) method. In this study, we consider the performance of ARC method under certain assumptions on the prior probability of each hypothesis test as a function of the covariate. We prove that, under these assumptions, the ARC method has a mean squared error asymptotically no greater than that of the other method where the entire set of hypotheses is used and assuming a large covariate effect. In addition, we conduct a simulation study to evaluate the performance of estimator associated with the ARC method for a finite number of hypotheses. Here, we apply the proposed method to coronary artery disease (CAD) data taken from a GWA study and diffusion tensor imaging (DTI) data.


Subject(s)
Data Interpretation, Statistical , Datasets as Topic , Bias , Coronary Artery Disease/diagnostic imaging , Coronary Artery Disease/genetics , Diffusion Tensor Imaging/statistics & numerical data , Genome-Wide Association Study/statistics & numerical data , High-Throughput Screening Assays/statistics & numerical data , Humans , Probability
5.
PLoS One ; 12(9): e0185174, 2017.
Article in English | MEDLINE | ID: mdl-28931044

ABSTRACT

The maximum entropy (ME) method is a recently-developed approach for estimating local false discovery rates (LFDR) that incorporates external information allowing assignment of a subset of tests to a category with a different prior probability of following the null hypothesis. Using this ME method, we have reanalyzed the findings from a recent large genome-wide association study of coronary artery disease (CAD), incorporating biologic annotations. Our revised LFDR estimates show many large reductions in LFDR, particularly among the genetic variants belonging to annotation categories that were known to be of particular interest for CAD. However, among SNPs with rare minor allele frequencies, the reductions in LFDR were modest in size.


Subject(s)
Coronary Artery Disease/genetics , Gene Frequency , Genome-Wide Association Study/methods , Polymorphism, Single Nucleotide , Genetic Predisposition to Disease , Genome-Wide Association Study/statistics & numerical data , Humans , Models, Genetic , Probability
7.
FASEB J ; 27(10): 4213-25, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23825224

ABSTRACT

Exercise substantially improves metabolic health, making the elicited mechanisms important targets for novel therapeutic strategies. Uncoupling protein 3 (UCP3) is a mitochondrial inner membrane protein highly selectively expressed in skeletal muscle. Here we report that moderate UCP3 overexpression (roughly 3-fold) in muscles of UCP3 transgenic (UCP3 Tg) mice acts as an exercise mimetic in many ways. UCP3 overexpression increased spontaneous activity (∼40%) and energy expenditure (∼5-10%) and decreased oxidative stress (∼15-20%), similar to exercise training in wild-type (WT) mice. The increase in complete fatty acid oxidation (FAO; ∼30% for WT and ∼70% for UCP3 Tg) and energy expenditure (∼8% for WT and 15% for UCP3 Tg) in response to endurance training was higher in UCP3 Tg than in WT mice, showing an additive effect of UCP3 and endurance training on these two parameters. Moreover, increases in circulating short-chain acylcarnitines in response to acute exercise in untrained WT mice were absent with training or in UCP3 Tg mice. UCP3 overexpression had the same effect as training in decreasing long-chain acylcarnitines. Outcomes coincided with a reduction in muscle carnitine acetyltransferase activity that catalyzes the formation of acylcarnitines. Overall, results are consistent with the conclusions that circulating acylcarnitines could be used as a marker of incomplete muscle FAO and that UCP3 is a potential target for the treatment of prevalent metabolic diseases in which muscle FAO is affected.


Subject(s)
Gene Expression Regulation/physiology , Ion Channels/metabolism , Mitochondrial Proteins/metabolism , Physical Endurance , Animals , Biomarkers , Eating , Energy Metabolism , Ion Channels/genetics , Male , Mice , Mice, Transgenic , Mitochondrial Proteins/genetics , Muscle, Skeletal/metabolism , Oxidation-Reduction , Oxidative Stress , Physical Conditioning, Animal , Uncoupling Protein 3
8.
Stat Appl Genet Mol Biol ; 12(4): 529-43, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23798617

ABSTRACT

Multiple comparison procedures that control a family-wise error rate or false discovery rate provide an achieved error rate as the adjusted p-value or q-value for each hypothesis tested. However, since achieved error rates are not understood as probabilities that the null hypotheses are true, empirical Bayes methods have been employed to estimate such posterior probabilities, called local false discovery rates (LFDRs) to emphasize that their priors are unknown and of the frequency type. The main approaches to LFDR estimation, relying either on fully parametric models to maximize likelihood or on the presence of enough hypotheses for nonparametric density estimation, lack the simplicity and generality of adjusted p-values. To begin filling the gap, this paper introduces simple methods of LFDR estimation with proven asymptotic conservatism without assuming the parameter distribution is in a parametric family. Simulations indicate that they remain conservative even for very small numbers of hypotheses. One of the proposed procedures enables interpreting the original FDR control rule in terms of LFDR estimation, thereby facilitating practical use of the former. The most conservative of the new procedures is applied to measured abundance levels of 20 proteins.


Subject(s)
Models, Genetic , Algorithms , Bayes Theorem , Case-Control Studies , Computer Simulation , Data Interpretation, Statistical , False Positive Reactions , Gene Expression , Humans , Models, Statistical , Monte Carlo Method , Proteomics
9.
Article in English | MEDLINE | ID: mdl-23702547

ABSTRACT

Abstract­Many genome-wide association studies have been conducted to identify single nucleotide polymorphisms (SNPs) that are associated with particular diseases or other traits. The local false discovery rate (LFDR) estimated using semiparametric models has enjoyed success in simultaneous inference. However, semiparametric LFDR estimators can be biased because they tend to overestimate the proportion of the nonassociated SNPs. We address the problem by adapting a simple parametric mixture model (PMM) and by comparing this model to the semiparametric mixture model (SMM) behind an LFDR estimator that is known to be conservatively biased. Then, we also compare the PMM with a parametric nonmixture model (PNM). In our simulation studies, we thoroughly analyze the performances of the three models under different values of p1, a prior probability that is approximately equal to the proportion of SNPs that are associated with the disease. When p1 > 10%, the PMM generally performs better than the SMM. When p1 < 0.1%, the SMM outperforms PMM. When p1 lies between 0.1 and 10 percent, both methods have about the same performance. In that setting, the PMM may be preferred since it has the advantage of supplying an estimate of the detectability level of the nonassociated SNPs.


Subject(s)
Computational Biology/methods , Genome-Wide Association Study/methods , Models, Genetic , Models, Statistical , Bayes Theorem , Computer Simulation , Humans , Polymorphism, Single Nucleotide
10.
BMC Bioinformatics ; 14: 87, 2013 Mar 06.
Article in English | MEDLINE | ID: mdl-23497228

ABSTRACT

BACKGROUND: In investigating differentially expressed genes or other selected features, researchers conduct hypothesis tests to determine which biological categories, such as those of the Gene Ontology (GO), are enriched for the selected features. Multiple comparison procedures (MCPs) are commonly used to prevent excessive false positive rates. Traditional MCPs, e.g., the Bonferroni method, go to the opposite extreme: strictly controlling a family-wise error rate, resulting in excessive false negative rates. Researchers generally prefer the more balanced approach of instead controlling the false discovery rate (FDR). However, the q-values that methods of FDR control assign to biological categories tend to be too low to reliably estimate the probability that a biological category is not enriched for the preselected features. Thus, we study an application of the other estimators of that probability, which is called the local FDR (LFDR). RESULTS: We considered five LFDR estimators for detecting enriched GO terms: a binomial-based estimator (BBE), a maximum likelihood estimator (MLE), a normalized MLE (NMLE), a histogram-based estimator assuming a theoretical null hypothesis (HBE), and a histogram-based estimator assuming an empirical null hypothesis (HBE-EN). Since NMLE depends not only on the data but also on the specified value of π0, the proportion of non-enriched GO terms, it is only advantageous when either π0 is already known with sufficient accuracy or there are data for only 1 GO term. By contrast, the other estimators work without specifying π0 but require data for at least 2 GO terms. Our simulation studies yielded the following summaries of the relative performance of each of those four estimators. HBE and HBE-EN produced larger biases for 2, 4, 8, 32, and 100 GO terms than BBE and MLE. BBE has the lowest bias if π0 is 1 and if the number of GO terms is between 2 and 32. The bias of MLE is no worse than that of BBE for 100 GO terms even when the ideal number of components in its underlying mixture model is unknown, but has high bias when the number of GO terms is small compared to the number of estimated parameters. For unknown values of π0, BBE has the lowest bias for a small number of GO terms (2-32 GO terms), and MLE has the lowest bias for a medium number of GO terms (100 GO terms). CONCLUSIONS: For enrichment detection, we recommend estimating the LFDR by MLE given at least a medium number of GO terms, by BBE given a small number of GO terms, and by NMLE given either only 1 GO term or precise knowledge of π0.


Subject(s)
Gene Expression Profiling/methods , Vocabulary, Controlled , Bayes Theorem , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Female , Genes , Humans , Likelihood Functions , Oligonucleotide Array Sequence Analysis , Probability
11.
Stat Appl Genet Mol Biol ; 11(5): 4, 2012 Oct 12.
Article in English | MEDLINE | ID: mdl-23079518

ABSTRACT

Histogram-based empirical Bayes methods developed for analyzing data for large numbers of genes, SNPs, or other biological features tend to have large biases when applied to data with a smaller number of features such as genes with expression measured conventionally, proteins, and metabolites. To analyze such small-scale and medium-scale data in an empirical Bayes framework, we introduce corrections of maximum likelihood estimators (MLEs) of the local false discovery rate (LFDR). In this context, the MLE estimates the LFDR, which is a posterior probability of null hypothesis truth, by estimating the prior distribution. The corrections lie in excluding each feature when estimating one or more parameters on which the prior depends. In addition, we propose the expected LFDR (ELFDR) in order to propagate the uncertainty involved in estimating the prior. We also introduce an optimally weighted combination of the best of the corrected MLEs with a previous estimator that, being based on a binomial distribution, does not require a parametric model of the data distribution across features. An application of the new estimators and previous estimators to protein abundance data illustrates the extent to which different estimators lead to different conclusions about which proteins are affected by cancer. A simulation study was conducted to approximate the bias of the new estimators relative to previous LFDR estimators. Data were simulated for two different numbers of features (N), two different noncentrality parameter values or detectability levels (dalt), and several proportions of unaffected features (p0). One of these previous estimators is a histogram-based estimator (HBE) designed for a large number of features. The simulations show that some of the corrected MLEs and the ELFDR that corrects the HBE reduce the negative bias relative to the MLE and the HBE, respectively. For every method, we defined the worst-case performance as the maximum of the absolute value of the bias over the two different dalt and over various p0. The best worst-case methods represent the safest methods to be used under given conditions. This analysis indicates that the binomial-based method has the lowest worst-case absolute bias for high p0 and for N = 3, 12. However, the corrected MLE that is based on the minimum description length (MDL) principle is the best worst-case method when the value of p0 is more uncertain since it has one of the lowest worst-case biases over all possible values of p0 and for N = 3, 12. Therefore, the safest estimator considered is the binomial-based method when a high proportion of unaffected features can be assumed and the MDL-based method otherwise. A second simulation study was conducted with additional values of N. We found that HBE requires N to be at least 6-12 features to perform as well as the estimators proposed here, with the precise minimum N depending on p0 and dalt.


Subject(s)
Data Interpretation, Statistical , Likelihood Functions , Models, Statistical , Bayes Theorem , Biology , Genes/genetics , Probability
12.
Stat Appl Genet Mol Biol ; 11(3): Article 7, 2012 Feb 21.
Article in English | MEDLINE | ID: mdl-22499708

ABSTRACT

Problems involving thousands of null hypotheses have been addressed by estimating the local false discovery rate (LFDR). A previous LFDR approach to reporting point and interval estimates of an effect-size parameter uses an estimate of the prior distribution of the parameter conditional on the alternative hypothesis. That estimated prior is often unreliable, and yet strongly influences the posterior intervals and point estimates, causing the posterior intervals to differ from fixed-parameter confidence intervals, even for arbitrarily small estimates of the LFDR. That influence of the estimated prior manifests the failure of the conditional posterior intervals, given the truth of the alternative hypothesis, to match the confidence intervals. Those problems are overcome by changing the posterior distribution conditional on the alternative hypothesis from a Bayesian posterior to a confidence posterior. Unlike the Bayesian posterior, the confidence posterior equates the posterior probability that the parameter lies in a fixed interval with the coverage rate of the coinciding confidence interval. The resulting confidence-Bayes hybrid posterior supplies interval and point estimates that shrink toward the null hypothesis value. The confidence intervals tend to be much shorter than their fixed-parameter counterparts, as illustrated with gene expression data. Simulations nonetheless confirm that the shrunken confidence intervals cover the parameter more frequently than stated. Generally applicable sufficient conditions for correct coverage are given. In addition to having those frequentist properties, the hybrid posterior can also be motivated from an objective Bayesian perspective by requiring coherence with some default prior conditional on the alternative hypothesis. That requirement generates a new class of approximate posteriors that supplement Bayes factors modified for improper priors and that dampen the influence of proper priors on the credibility intervals. While that class of posteriors intersects the class of confidence-Bayes posteriors, neither class is a subset of the other. In short, two first principles generate both classes of posteriors: a coherence principle and a relevance principle. The coherence principle requires that all effect size estimates comply with the same probability distribution. The relevance principle means effect size estimates given the truth of an alternative hypothesis cannot depend on whether that truth was known prior to observing the data or whether it was learned from the data.


Subject(s)
Gene Expression Profiling/methods , Models, Statistical , Algorithms , Bayes Theorem , Computer Simulation , Confidence Intervals
13.
Biometrics ; 67(2): 363-70, 2011 Jun.
Article in English | MEDLINE | ID: mdl-20880014

ABSTRACT

In a novel approach to the multiple testing problem, Efron (2004, Journal of the American Statistical Association 99, 96-104; 2007a Journal of the American Statistical Association 102, 93-103; 2007b, Annals of Statistics 35, 1351-1377) formulated estimators of the distribution of test statistics or nominal p-values under a null distribution suitable for modeling the data of thousands of unaffected genes, nonassociated single-nucleotide polymorphisms, or other biological features. Estimators of the null distribution can improve not only the empirical Bayes procedure for which it was originally intended, but also many other multiple-comparison procedures. Such estimators in some cases improve the proposed multiple-comparison procedure (MCP) based on a recent non-Bayesian framework of minimizing expected loss with respect to a confidence posterior, a probability distribution of confidence levels. The flexibility of that MCP is illustrated with a nonadditive loss function designed for genomic screening rather than for validation. The merit of estimating the null distribution is examined from the vantage point of the confidence-posterior MCP (CPMCP). In a generic simulation study of genome-scale multiple testing, conditioning the observed confidence level on the estimated null distribution as an approximate ancillary statistic markedly improved conditional inference. Specifically simulating gene expression data, however, indicates that estimation of the null distribution tends to exacerbate the conservative bias that results from modeling heavy-tailed data distributions with the normal family. To enable researchers to determine whether to rely on a particular estimated null distribution for inference or decision making, an information-theoretic score is provided. As the sum of the degree of ancillarity and the degree of inferential relevance, the score reflects the balance conditioning would strike between the two conflicting terms. The CPMCP and other methods introduced are applied to gene expression microarray data.


Subject(s)
Genomics/methods , Models, Statistical , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Genomics/statistics & numerical data , Oligonucleotide Array Sequence Analysis/methods , Statistical Distributions
14.
Stat Appl Genet Mol Biol ; 9: Article23, 2010.
Article in English | MEDLINE | ID: mdl-20597849

ABSTRACT

Research on analyzing microarray data has focused on the problem of identifying differentially expressed genes to the neglect of the problem of how to integrate evidence that a gene is differentially expressed with information on the extent of its differential expression. Consequently, researchers currently prioritize genes for further study either on the basis of volcano plots or, more commonly, according to simple estimates of the fold change after filtering the genes with an arbitrary statistical significance threshold. While the subjective and informal nature of the former practice precludes quantification of its reliability, the latter practice is equivalent to using a hard-threshold estimator of the expression ratio that is not known to perform well in terms of mean-squared error, the sum of estimator variance and squared estimator bias. On the basis of two distinct simulation studies and data from different microarray studies, we systematically compared the performance of several estimators representing both current practice and shrinkage. We find that the threshold-based estimators usually perform worse than the maximum-likelihood estimator (MLE) and they often perform far worse as quantified by estimated mean-squared risk. By contrast, the shrinkage estimators tend to perform as well as or better than the MLE and never much worse than the MLE, as expected from what is known about shrinkage. However, a Bayesian measure of performance based on the prior information that few genes are differentially expressed indicates that hard-threshold estimators perform about as well as the local false discovery rate (FDR), the best of the shrinkage estimators studied. Based on the ability of the latter to leverage information across genes, we conclude that the use of the local-FDR estimator of the fold change instead of informal or threshold-based combinations of statistical tests and non-shrinkage estimators can be expected to substantially improve the reliability of gene prioritization at very little risk of doing so less reliably. Since the proposed replacement of post-selection estimates with shrunken estimates applies as well to other types of high-dimensional data, it could also improve the analysis of SNP data from genome-wide association studies.


Subject(s)
Gene Expression Profiling/statistics & numerical data , Likelihood Functions , Models, Statistical , False Positive Reactions , Gene Expression Regulation , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Polymorphism, Single Nucleotide
15.
PLoS One ; 5(3): e9834, 2010 Mar 24.
Article in English | MEDLINE | ID: mdl-20352092

ABSTRACT

BACKGROUND/AIM: Incomplete or limited long-chain fatty acid (LCFA) combustion in skeletal muscle has been associated with insulin resistance. Signals that are responsive to shifts in LCFA beta-oxidation rate or degree of intramitochondrial catabolism are hypothesized to regulate second messenger systems downstream of the insulin receptor. Recent evidence supports a causal link between mitochondrial LCFA combustion in skeletal muscle and insulin resistance. We have used unbiased metabolite profiling of mouse muscle mitochondria with the aim of identifying candidate metabolites within or effluxed from mitochondria and that are shifted with LCFA combustion rate. METHODOLOGY/PRINCIPAL FINDINGS: Large-scale unbiased metabolomics analysis was performed using GC/TOF-MS on buffer and mitochondrial matrix fractions obtained prior to and after 20 min of palmitate catabolism (n = 7 mice/condition). Three palmitate concentrations (2, 9 and 19 microM; corresponding to low, intermediate and high oxidation rates) and 9 microM palmitate plus tricarboxylic acid (TCA) cycle and electron transport chain inhibitors were each tested and compared to zero palmitate control incubations. Paired comparisons of the 0 and 20 min samples were made by Student's t-test. False discovery rate were estimated and Type I error rates assigned. Major metabolite groups were organic acids, amines and amino acids, free fatty acids and sugar phosphates. Palmitate oxidation was associated with unique profiles of metabolites, a subset of which correlated to palmitate oxidation rate. In particular, palmitate oxidation rate was associated with distinct changes in the levels of TCA cycle intermediates within and effluxed from mitochondria. CONCLUSIONS/SIGNIFICANCE: This proof-of-principle study establishes that large-scale metabolomics methods can be applied to organelle-level models to discover metabolite patterns reflective of LCFA combustion, which may lead to identification of molecules linking muscle fat metabolism and insulin signaling. Our results suggest that future studies should focus on the fate of effluxed TCA cycle intermediates and on mechanisms ensuring their replenishment during LCFA metabolism in skeletal muscle.


Subject(s)
Fatty Acids/metabolism , Mitochondria, Muscle/metabolism , Muscle, Skeletal/metabolism , 3-Hydroxybutyric Acid/pharmacology , Animals , Chromatography, Gas/methods , Female , Insulin Resistance , Ketones/metabolism , Mass Spectrometry/methods , Metabolomics/methods , Mice , Mice, Inbred C57BL , Mitochondria/metabolism , Oxygen/metabolism
16.
BMC Bioinformatics ; 11: 63, 2010 Jan 28.
Article in English | MEDLINE | ID: mdl-20109217

ABSTRACT

BACKGROUND: Sustained research on the problem of determining which genes are differentially expressed on the basis of microarray data has yielded a plethora of statistical algorithms, each justified by theory, simulation, or ad hoc validation and yet differing in practical results from equally justified algorithms. Recently, a concordance method that measures agreement among gene lists have been introduced to assess various aspects of differential gene expression detection. This method has the advantage of basing its assessment solely on the results of real data analyses, but as it requires examining gene lists of given sizes, it may be unstable. RESULTS: Two methodologies for assessing predictive error are described: a cross-validation method and a posterior predictive method. As a nonparametric method of estimating prediction error from observed expression levels, cross validation provides an empirical approach to assessing algorithms for detecting differential gene expression that is fully justified for large numbers of biological replicates. Because it leverages the knowledge that only a small portion of genes are differentially expressed, the posterior predictive method is expected to provide more reliable estimates of algorithm performance, allaying concerns about limited biological replication. In practice, the posterior predictive method can assess when its approximations are valid and when they are inaccurate. Under conditions in which its approximations are valid, it corroborates the results of cross validation. Both comparison methodologies are applicable to both single-channel and dual-channel microarrays. For the data sets considered, estimating prediction error by cross validation demonstrates that empirical Bayes methods based on hierarchical models tend to outperform algorithms based on selecting genes by their fold changes or by non-hierarchical model-selection criteria. (The latter two approaches have comparable performance.) The posterior predictive assessment corroborates these findings. CONCLUSIONS: Algorithms for detecting differential gene expression may be compared by estimating each algorithm's error in predicting expression ratios, whether such ratios are defined across microarray channels or between two independent groups.According to two distinct estimators of prediction error, algorithms using hierarchical models outperform the other algorithms of the study. The fact that fold-change shrinkage performed as well as conventional model selection criteria calls for investigating algorithms that combine the strengths of significance testing and fold-change estimation.


Subject(s)
Algorithms , Data Interpretation, Statistical , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods
17.
Bioinformatics ; 25(6): 772-9, 2009 Mar 15.
Article in English | MEDLINE | ID: mdl-19218351

ABSTRACT

MOTIVATION: Measurements of gene expression over time enable the reconstruction of transcriptional networks. However, Bayesian networks and many other current reconstruction methods rely on assumptions that conflict with the differential equations that describe transcriptional kinetics. Practical approximations of kinetic models would enable inferring causal relationships between genes from expression data of microarray, tag-based and conventional platforms, but conclusions are sensitive to the assumptions made. RESULTS: The representation of a sufficiently large portion of genome enables computation of an upper bound on how much confidence one may place in influences between genes on the basis of expression data. Information about which genes encode transcription factors is not necessary but may be incorporated if available. The methodology is generalized to cover cases in which expression measurements are missing for many of the genes that might control the transcription of the genes of interest. The assumption that the gene expression level is roughly proportional to the rate of translation led to better empirical performance than did either the assumption that the gene expression level is roughly proportional to the protein level or the Bayesian model average of both assumptions. AVAILABILITY: http://www.oisb.ca points to R code implementing the methods (R Development Core Team 2004). SUPPLEMENTARY INFORMATION: http://www.davidbickel.com.


Subject(s)
Gene Regulatory Networks , Transcription, Genetic , Bayes Theorem , Kinetics , Models, Genetic , Oligonucleotide Array Sequence Analysis
18.
Stat Appl Genet Mol Biol ; 7(1): Article10, 2008.
Article in English | MEDLINE | ID: mdl-18384263

ABSTRACT

The level of differential gene expression may be defined as a fold change, a frequency of upregulation, or some other measure of the degree or extent of a difference in expression across groups of interest. On the basis of expression data for hundreds or thousands of genes, inferring which genes are differentially expressed or ranking genes in order of priority introduces a bias in estimates of their differential expression levels. A previous correction of this feature selection bias suffers from a lack of generality in the method of ranking genes, from requiring many biological replicates, and from unnecessarily overcompensating for the bias. For any method of ranking genes on the basis of gene expression measured for as few as three biological replicates, a simple leave-one-out algorithm corrects, with less overcompensation, the bias in estimates of the level of differential gene expression. In a microarray data set, the bias correction reduces estimates of the probability of upregulation or downregulation from 100% to as low as 60%, even for genes with estimated local false discovery rates close to 0. A simulation study quantifies both the advantage of smoothing estimates of bias before correction and the degree of overcompensation.


Subject(s)
Gene Expression , Oligonucleotide Array Sequence Analysis , Selection Bias , Algorithms
19.
Plant Mol Biol ; 66(5): 551-63, 2008 Mar.
Article in English | MEDLINE | ID: mdl-18224447

ABSTRACT

Allelic differences in expression are important genetic factors contributing to quantitative trait variation in various organisms. However, the extent of genome-wide allele-specific expression by different modes of gene regulation has not been well characterized in plants. In this study we developed a new methodology for allele-specific expression analysis by applying Massively Parallel Signature Sequencing (MPSS), an open ended and sequencing based mRNA profiling technology. This methodology enabled a genome-wide evaluation of cis- and trans-effects on allelic expression in six meristem stages of the maize hybrid. Summarization of data from nearly 400 pairs of MPSS allelic signature tags showed that 60% of the genes in the hybrid meristems exhibited differential allelic expression. Because both alleles are subjected to the same trans-acting factors in the hybrid, the data suggest the abundance of cis-regulatory differences in the genome. Comparing the same allele expressed in the hybrid versus its inbred parents showed that 40% of the genes were differentially expressed, suggesting different trans-acting effects present in different genotypes. Such trans-acting effects may result in gene expression in the hybrid different from allelic additive expression. With this approach we quantified gene expression in the hybrid relative to its inbred parents at the allele-specific level. As compared to measuring total transcript levels, this study provides a new level of understanding of different modes of gene regulation in the hybrid and the molecular basis of heterosis.


Subject(s)
Alleles , Gene Expression Regulation, Plant/genetics , Genome, Plant/genetics , Meristem/genetics , Software , Zea mays/growth & development , Zea mays/genetics , Gene Expression Profiling , Plants, Genetically Modified , RNA, Messenger/genetics
20.
Bioinformatics ; 21(7): 1121-8, 2005 Apr 01.
Article in English | MEDLINE | ID: mdl-15546939

ABSTRACT

MOTIVATION: The reconstruction of gene networks from gene-expression microarrays is gaining popularity as methods improve and as more data become available. The reliability of such networks could be judged by the probability that a connection between genes is spurious, resulting from chance fluctuations rather than from a true biological relationship. RESULTS: Unlike the false discovery rate and positive false discovery rate, the decisive false discovery rate (dFDR) is exactly equal to a conditional probability without assuming independence or the randomness of hypothesis truth values. This property is useful not only in the common application to the detection of differential gene expression, but also in determining the probability of a spurious connection in a reconstructed gene network. Estimators of the dFDR can estimate each of three probabilities: (1) The probability that two genes that appear to be associated with each other lack such association. (2) The probability that a time ordering observed for two associated genes is misleading. (3) The probability that a time ordering observed for two genes is misleading, either because they are not associated or because they are associated without a lag in time. The first probability applies to both static and dynamic gene networks, and the other two only apply to dynamic gene networks.


Subject(s)
Algorithms , Gene Expression Profiling/methods , Gene Expression Regulation/physiology , Models, Genetic , Oligonucleotide Array Sequence Analysis/methods , Protein Interaction Mapping/methods , Signal Transduction/physiology , Transcription Factors/metabolism , Models, Statistical , Software , Time Factors , Transcription Factors/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...