Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
Add more filters










Publication year range
1.
ArXiv ; 2023 Sep 26.
Article in English | MEDLINE | ID: mdl-37744467

ABSTRACT

Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.

2.
Proc Natl Acad Sci U S A ; 118(51)2021 12 21.
Article in English | MEDLINE | ID: mdl-34903655

ABSTRACT

Short-term forecasts of traditional streams from public health reporting (such as cases, hospitalizations, and deaths) are a key input to public health decision-making during a pandemic. Since early 2020, our research group has worked with data partners to collect, curate, and make publicly available numerous real-time COVID-19 indicators, providing multiple views of pandemic activity in the United States. This paper studies the utility of five such indicators-derived from deidentified medical insurance claims, self-reported symptoms from online surveys, and COVID-related Google search activity-from a forecasting perspective. For each indicator, we ask whether its inclusion in an autoregressive (AR) model leads to improved predictive accuracy relative to the same model excluding it. Such an AR model, without external features, is already competitive with many top COVID-19 forecasting models in use today. Our analysis reveals that 1) inclusion of each of these five indicators improves on the overall predictive accuracy of the AR model; 2) predictive gains are in general most pronounced during times in which COVID cases are trending in "flat" or "down" directions; and 3) one indicator, based on Google searches, seems to be particularly helpful during "up" trends.


Subject(s)
COVID-19/epidemiology , Health Status Indicators , Models, Statistical , Epidemiologic Methods , Forecasting , Humans , Internet/statistics & numerical data , Surveys and Questionnaires , United States/epidemiology
3.
Proc Natl Acad Sci U S A ; 117(29): 16880-16890, 2020 07 21.
Article in English | MEDLINE | ID: mdl-32631986

ABSTRACT

We propose a general method for constructing confidence sets and hypothesis tests that have finite-sample guarantees without regularity conditions. We refer to such procedures as "universal." The method is very simple and is based on a modified version of the usual likelihood-ratio statistic that we call "the split likelihood-ratio test" (split LRT) statistic. The (limiting) null distribution of the classical likelihood-ratio statistic is often intractable when used to test composite null hypotheses in irregular statistical models. Our method is especially appealing for statistical inference in these complex setups. The method we suggest works for any parametric model and also for some nonparametric models, as long as computing a maximum-likelihood estimator (MLE) is feasible under the null. Canonical examples arise in mixture modeling and shape-constrained inference, for which constructing tests and confidence sets has been notoriously difficult. We also develop various extensions of our basic methods. We show that in settings when computing the MLE is hard, for the purpose of constructing valid tests and intervals, it is sufficient to upper bound the maximum likelihood. We investigate some conditions under which our methods yield valid inferences under model misspecification. Further, the split LRT can be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid P values and confidence sequences. Finally, when combined with the method of sieves, it can be used to perform model selection with nested model classes.

4.
Elife ; 72018 03 29.
Article in English | MEDLINE | ID: mdl-29595474

ABSTRACT

Animal cells within a tissue typically display a striking regularity in their size. To date, the molecular mechanisms that control this uniformity are still unknown. We have previously shown that size uniformity in animal cells is promoted, in part, by size-dependent regulation of G1 length. To identify the molecular mechanisms underlying this process, we performed a large-scale small molecule screen and found that the p38 MAPK pathway is involved in coordinating cell size and cell cycle progression. Small cells display higher p38 activity and spend more time in G1 than larger cells. Inhibition of p38 MAPK leads to loss of the compensatory G1 length extension in small cells, resulting in faster proliferation, smaller cell size and increased size heterogeneity. We propose a model wherein the p38 pathway responds to changes in cell size and regulates G1 exit accordingly, to increase cell size uniformity.


Subject(s)
Cell Size , Epithelial Cells/physiology , G1 Phase , Signal Transduction , p38 Mitogen-Activated Protein Kinases/metabolism , Cell Line , Humans , Social Control, Formal
5.
Proc Natl Acad Sci U S A ; 114(12): 3002-3003, 2017 03 21.
Article in English | MEDLINE | ID: mdl-28265072
7.
BMC Genomics ; 15 Suppl 6: S9, 2014.
Article in English | MEDLINE | ID: mdl-25572914

ABSTRACT

BACKGROUND: Phylogenetic birth-death models are opening a new window on the processes of genome evolution in studies of the evolution of gene and protein families, protein-protein interaction networks, microRNAs, and copy number variation. Given a species tree and a set of genomic characters in present-day species, the birth-death approach estimates the most likely rates required to explain the observed data and returns the expected ancestral character states and the history of character state changes. Achieving a balance between model complexity and generalizability is a fundamental challenge in the application of birth-death models. While more parameters promise greater accuracy and more biologically realistic models, increasing model complexity can lead to overfitting and a heavy computational cost. RESULTS: Here we present a systematic, empirical investigation of these tradeoffs, using protein domain families in six metazoan genomes as a case study. We compared models of increasing complexity, implemented in the Count program, with respect to model fit, robustness, and stability. In addition, we used a bootstrapping procedure to assess estimator variability. The results show that the most complex model, which allows for both branch-specific and family-specific rate variation, achieves the best fit, without overfitting. Variance remains low with increasing complexity, except for family-specific loss rates. This variance is reduced when the number of discrete rate categories is increased. CONCLUSIONS: The work presented here evaluates model choice for genomic birth-death models in a systematic way and presents the first use of bootstrapping to assess estimator variance in birth-death models. We find that a model incorporating both lineage and family rate variation yields more accurate estimators without sacrificing generality. Our results indicate that model choice can lead to fundamentally different evolutionary conclusions, emphasizing the importance of more biologically realistic and complex models.


Subject(s)
Evolution, Molecular , Genome , Genomics/methods , Models, Genetic , Phylogeny
8.
AMIA Annu Symp Proc ; 2014: 1980-9, 2014.
Article in English | MEDLINE | ID: mdl-25954471

ABSTRACT

Chronic Kidney Disease (CKD) is a costly and complex disease affecting 20 million US adults. Recent studies suggest that care delivery changes may improve clinical outcomes and quality of patient experience while reducing costs. This study analyzes the treatment data of 8,553 CKD patients to learn practice-based clinical pathways. Patients' visit history is modeled as sequences of visits containing information on visit type, date, procedures and diagnoses. We use hierarchical clustering based on longest common subsequence (LCS) distance to discover six patient subgroups, with each subgroup differing in the distribution of demographics and health conditions. Transitions of visits with high probabilities are elicited from each patient subgroup to learn common clinical pathways and treatment durations. Insights from this study can potentially result in new evidence to support patient-centered treatment approaches, empower CKD patients to better manage their disease and its complications, and provide a review guide for clinicians.


Subject(s)
Critical Pathways , Data Mining/methods , Electronic Health Records , Renal Insufficiency, Chronic/therapy , Adult , Aged , Datasets as Topic , Female , Humans , Male , Middle Aged , Pennsylvania
9.
J Am Stat Assoc ; 108(501): 278-287, 2013.
Article in English | MEDLINE | ID: mdl-25237208

ABSTRACT

This paper introduces a new approach to prediction by bringing together two different nonparametric ideas: distribution free inference and nonparametric smoothing. Specifically, we consider the problem of constructing nonparametric tolerance/prediction sets. We start from the general conformal prediction approach and we use a kernel density estimator as a measure of agreement between a sample point and the underlying distribution. The resulting prediction set is shown to be closely related to plug-in density level sets with carefully chosen cut-off values. Under standard smoothness conditions, we get an asymptotic efficiency result that is near optimal for a wide range of function classes. But the coverage is guaranteed whether or not the smoothness conditions hold and regardless of the sample size. The performance of our method is investigated through simulation studies and illustrated in a real data example.

10.
J Mach Learn Res ; 13: 1059-1062, 2012 Apr.
Article in English | MEDLINE | ID: mdl-26834510

ABSTRACT

We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efficiency.

11.
Ann Appl Stat ; 5(2A): 628-644, 2011 Jun 01.
Article in English | MEDLINE | ID: mdl-21892380

ABSTRACT

We introduce a new version of forward stepwise regression. Our modification finds solutions to regression problems where the selected predictors appear in a structured pattern, with respect to a predefined distance measure over the candidate predictors. Our method is motivated by the problem of predicting HIV-1 drug resistance from protein sequences. We find that our method improves the interpretability of drug resistance while producing comparable predictive accuracy to standard methods. We also demonstrate our method in a simulation study and present some theoretical results and connection.

12.
Adv Neural Inf Process Syst ; 24(2): 1432-1440, 2010 Dec 31.
Article in English | MEDLINE | ID: mdl-25152607

ABSTRACT

A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.

13.
Ann Stat ; 37(5A): 2178-2201, 2009 Jan 01.
Article in English | MEDLINE | ID: mdl-19784398

ABSTRACT

This paper explores the following question: what kind of statistical guarantees can be given when doing variable selection in high dimensional models? In particular, we look at the error rates and power of some multi-stage regression methods. In the first stage we fit a set of candidate models. In the second stage we select one model by cross-validation. In the third stage we use hypothesis testing to eliminate some variables. We refer to the first two stages as "screening" and the last stage as "cleaning." We consider three screening methods: the lasso, marginal regression, and forward stepwise regression. Our method gives consistent variable selection under certain conditions.

14.
Stat Sci ; 24(4): 398-413, 2009 Nov.
Article in English | MEDLINE | ID: mdl-20711421

ABSTRACT

Genetic investigations often involve the testing of vast numbers of related hypotheses simultaneously. To control the overall error rate, a substantial penalty is required, making it difficult to detect signals of moderate strength. To improve the power in this setting, a number of authors have considered using weighted p-values, with the motivation often based upon the scientific plausibility of the hypotheses. We review this literature, derive optimal weights and show that the power is remarkably robust to misspecification of these weights. We consider two methods for choosing weights in practice. The first, external weighting, is based on prior information. The second, estimated weighting, uses the data to choose weights.

15.
Genet Epidemiol ; 31(7): 741-7, 2007 Nov.
Article in English | MEDLINE | ID: mdl-17549760

ABSTRACT

The potential of genome-wide association analysis can only be realized when they have power to detect signals despite the detrimental effect of multiple testing on power. We develop a weighted multiple testing procedure that facilitates the input of prior information in the form of groupings of tests. For each group a weight is estimated from the observed test statistics within the group. Differentially weighting groups improves the power to detect signals in likely groupings. The advantage of the grouped-weighting concept, over fixed weights based on prior information, is that it often leads to an increase in power even if many of the groupings are not correlated with the signal. Being data dependent, the procedure is remarkably robust to poor choices in groupings. Power is typically improved if one (or more) of the groups clusters multiple tests with signals, yet little power is lost when the groupings are totally random. If there is no apparent signal in a group, relative to a group that appears to have several tests with signals, the former group will be down-weighted relative to the latter. If no groups show apparent signals, then the weights will be approximately equal. The only restriction on the procedure is that the number of groups be small, relative to the total number of tests performed.


Subject(s)
Genome, Human/genetics , Models, Genetic , Humans
16.
Am J Hum Genet ; 78(2): 243-52, 2006 Feb.
Article in English | MEDLINE | ID: mdl-16400608

ABSTRACT

Scanning the genome for association between markers and complex diseases typically requires testing hundreds of thousands of genetic polymorphisms. Testing such a large number of hypotheses exacerbates the trade-off between power to detect meaningful associations and the chance of making false discoveries. Even before the full genome is scanned, investigators often favor certain regions on the basis of the results of prior investigations, such as previous linkage scans. The remaining regions of the genome are investigated simultaneously because genotyping is relatively inexpensive compared with the cost of recruiting participants for a genetic study and because prior evidence is rarely sufficient to rule out these regions as harboring genes with variation of conferring liability (liability genes). However, the multiple testing inherent in broad genomic searches diminishes power to detect association, even for genes falling in regions of the genome favored a priori. Multiple testing problems of this nature are well suited for application of the false-discovery rate (FDR) principle, which can improve power. To enhance power further, a new FDR approach is proposed that involves weighting the hypotheses on the basis of prior data. We present a method for using linkage data to weight the association P values. Our investigations reveal that if the linkage study is informative, the procedure improves power considerably. Remarkably, the loss in power is small, even when the linkage study is uninformative. For a class of genetic models, we calculate the sample size required to obtain useful prior information from a linkage study. This inquiry reveals that, among genetic models that are seemingly equal in genetic information, some are much more promising than others for this mode of analysis.


Subject(s)
Genetic Linkage , Genetic Predisposition to Disease , Genetic Testing/methods , Genome, Human/genetics , Humans
17.
Genet Epidemiol ; 28(3): 193-206, 2005 Apr.
Article in English | MEDLINE | ID: mdl-15637716

ABSTRACT

Linkage disequilibrium (LD) in the human genome, often measured as pairwise correlation between adjacent markers, shows substantial spatial heterogeneity. Congruent with these results, studies have found that certain regions of the genome have far less haplotype diversity than expected if the alleles at multiple markers were independent, while other sets of adjacent markers behave almost independently. Regions with limited haplotype diversity have been described as "blocked" or "haplotype blocks." In this article, we propose a new method that aims to distinguish between blocked and unblocked regions in the genome. Like some other approaches, the method analyses haplotype diversity. Unlike other methods, it allows for adjacent, distinct blocks and also multiple, independent single nucleotide polymorphisms (SNPs) separating blocks. Based on an approximate likelihood model and a parsimony criterion to penalize for model complexity, the method partitions a genomic region into blocks relatively quickly, and simulations suggest that its partitions are accurate. We also propose a new, efficient method to select SNPs for association analysis, namely tag SNPs. These methods compare favorably to similar blocking and tagging methods using simulations.


Subject(s)
Genome, Human , Linkage Disequilibrium , Models, Genetic , Algorithms , Alleles , Genetic Markers , Haplotypes , Humans , Polymorphism, Single Nucleotide
18.
Genomics ; 83(6): 1169-75, 2004 Jun.
Article in English | MEDLINE | ID: mdl-15177570

ABSTRACT

We present evidence of a potentially serious source of error intrinsic to all spotted cDNA microarrays that use IMAGE clones of expressed sequence tags (ESTs). We found that a high proportion of these EST sequences contain 5'-end poly(dT) sequences that are remnants from the oligo(dT)-primed reverse transcription of polyadenylated mRNA templates used to generate EST cDNA for sequence clone libraries. Analysis of expression data from two single-dye cDNA microarray experiments showed that ESTs whose sequences contain repeats of consecutive 5'-end dT residues appeared to be strongly coexpressed, while expression data of all other sequences exhibited no such pattern. Our analysis suggests that expression data from sequences containing 5' poly(dT) tracts are more likely to be due to systematic cross-hybridization of these poly(dT) tracts than to true mRNA coexpression. This indicates that existing data generated by cDNA microarrays containing IMAGE clone ESTs should be filtered to remove expression data containing significant 5' poly(dT) tracts.


Subject(s)
Artifacts , Expressed Sequence Tags , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Adipocytes/drug effects , Animals , Chromans/pharmacology , Humans , Mice , Poly T/analysis , RNA, Messenger/analysis , Thiazolidinediones/pharmacology , Troglitazone
19.
Genet Epidemiol ; 25(1): 36-47, 2003 Jul.
Article in English | MEDLINE | ID: mdl-12813725

ABSTRACT

It is increasingly recognized that multiple genetic variants, within the same or different genes, combine to affect liability for many common diseases. Indeed, the variants may interact among themselves and with environmental factors. Thus realistic genetic/statistical models can include an extremely large number of parameters, and it is by no means obvious how to find the variants contributing to liability. For models of multiple candidate genes and their interactions, we prove that statistical inference can be based on controlling the false discovery rate (FDR), which is defined as the expected number of false rejections divided by the number of rejections. Controlling the FDR automatically controls the overall error rate in the special case that all the null hypotheses are true. So do more standard methods such as Bonferroni correction. However, when some null hypotheses are false, the goals of Bonferroni and FDR differ, and FDR will have better power. Model selection procedures, such as forward stepwise regression, are often used to choose important predictors for complex models. By analysis of simulations of such models, we compare a computationally efficient form of forward stepwise regression against the FDR methods. We show that model selection includes numerous genetic variants having no impact on the trait, whereas FDR maintains a false-positive rate very close to the nominal rate. With good control over false positives and better power than Bonferroni, the FDR-based methods we introduce present a viable means of evaluating complex, multivariate genetic models. Naturally, as for any method seeking to explore complex genetic models, the power of the methods is limited by sample size and model complexity.


Subject(s)
Chromosome Mapping/methods , Genetic Predisposition to Disease/genetics , Models, Genetic , Data Interpretation, Statistical , Humans
20.
Am J Hum Genet ; 72(4): 891-902, 2003 Apr.
Article in English | MEDLINE | ID: mdl-12610778

ABSTRACT

The observation that haplotypes from a particular region of the genome differ between affected and unaffected individuals or between chromosomes transmitted to affected individuals versus those not transmitted is sound evidence for a disease-liability mutation in the region. Tests for differentiation of haplotype distributions often take the form of either Pearson's chi(2) statistic or tests based on the similarity among haplotypes in the different populations. In this article, we show that many measures of haplotype similarity can be expressed in the same quadratic form, and we give the general form of the variance. As we describe, these methods can be applied to either phase-known or phase-unknown data. We investigate the performance of Pearson's chi(2) statistic and haplotype similarity tests through use of evolutionary simulations. We show that both approaches can be powerful, but under quite different conditions. Moreover, we show that the power of both approaches can be enhanced by clustering rare haplotypes from the distributions before performing a test.


Subject(s)
Genetic Diseases, Inborn/genetics , Models, Genetic , Mutation , Chromosomes, Human/genetics , Genetic Markers , Haplotypes , Humans , Models, Statistical , Reproducibility of Results
SELECTION OF CITATIONS
SEARCH DETAIL
...