Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
Add more filters










Publication year range
1.
J Am Stat Assoc ; 119(545): 715-729, 2024.
Article in English | MEDLINE | ID: mdl-38818252

ABSTRACT

It is important to develop statistical techniques to analyze high-dimensional data in the presence of both complex dependence and possible heavy tails and outliers in real-world applications such as imaging data analyses. We propose a new robust high-dimensional regression with coefficient thresholding, in which an efficient nonconvex estimation procedure is proposed through a thresholding function and the robust Huber loss. The proposed regularization method accounts for complex dependence structures in predictors and is robust against heavy tails and outliers in outcomes. Theoretically, we rigorously analyze the landscape of the population and empirical risk functions for the proposed method. The fine landscape enables us to establish both statistical consistency and computational convergence under the high-dimensional setting. We also present an extension to incorporate spatial information into the proposed method. Finite-sample properties of the proposed methods are examined by extensive simulation studies. An application concerns a scalar-on-image regression analysis for an association of psychiatric disorder measured by the general factor of psychopathology with features extracted from the task functional MRI data in the Adolescent Brain Cognitive Development (ABCD) study.

2.
J Multivar Anal ; 2022024 Jul.
Article in English | MEDLINE | ID: mdl-38525479

ABSTRACT

We introduce a new approach to nonlinear sufficient dimension reduction in cases where both the predictor and the response are distributional data, modeled as members of a metric space. Our key step is to build universal kernels (cc-universal) on the metric spaces, which results in reproducing kernel Hilbert spaces for the predictor and response that are rich enough to characterize the conditional independence that determines sufficient dimension reduction. For univariate distributions, we construct the universal kernel using the Wasserstein distance, while for multivariate distributions, we resort to the sliced Wasserstein distance. The sliced Wasserstein distance ensures that the metric space possesses similar topological properties to the Wasserstein space, while also offering significant computation benefits. Numerical results based on synthetic data show that our method outperforms possible competing methods. The method is also applied to several data sets, including fertility and mortality data and Calgary temperature data.

3.
J Bus Econ Stat ; 41(4): 1090-1100, 2023.
Article in English | MEDLINE | ID: mdl-38125739

ABSTRACT

Compositional data arises in a wide variety of research areas when some form of standardization and composition is necessary. Estimating covariance matrices is of fundamental importance for high-dimensional compositional data analysis. However, existing methods require the restrictive Gaussian or sub-Gaussian assumption, which may not hold in practice. We propose a robust composition adjusted thresholding covariance procedure based on Huber-type M-estimation to estimate the sparse covariance structure of high-dimensional compositional data. We introduce a cross-validation procedure to choose the tuning parameters of the proposed method. Theoretically, by assuming a bounded fourth moment condition, we obtain the rates of convergence and signal recovery property for the proposed method and provide the theoretical guarantees for the cross-validation procedure under the high-dimensional setting. Numerically, we demonstrate the effectiveness of the proposed method in simulation studies and also a real application to sales data analysis.

4.
Psychometrika ; 87(1): 83-106, 2022 03.
Article in English | MEDLINE | ID: mdl-34191228

ABSTRACT

Graphical models have received an increasing amount of attention in network psychometrics as a promising probabilistic approach to study the conditional relations among variables using graph theory. Despite recent advances, existing methods on graphical models usually assume a homogeneous population and focus on binary or continuous variables. However, ordinal variables are very popular in many areas of psychological science, and the population often consists of several different groups based on the heterogeneity in ordinal data. Driven by these needs, we introduce the finite mixture of ordinal graphical models to effectively study the heterogeneous conditional dependence relationships of ordinal data. We develop a penalized likelihood approach for model estimation, and design a generalized expectation-maximization (EM) algorithm to solve the significant computational challenges. We examine the performance of the proposed method and algorithm in simulation studies. Moreover, we demonstrate the potential usefulness of the proposed method in psychological science through a real application concerning the interests and attitudes related to fan avidity for students in a large public university in the United States.


Subject(s)
Algorithms , Computer Simulation , Humans , Likelihood Functions , Psychometrics
5.
Biometrics ; 77(3): 984-995, 2021 09.
Article in English | MEDLINE | ID: mdl-32683674

ABSTRACT

A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.


Subject(s)
Microbiota , Computer Simulation , Data Analysis , Microbiota/genetics , Regression Analysis , Research Design
6.
J Multivar Anal ; 1752020 Jan.
Article in English | MEDLINE | ID: mdl-32863458

ABSTRACT

Dynamic networks are a general language for describing time-evolving complex systems, and discrete time network models provide an emerging statistical technique for various applications. It is a fundamental research question to detect a set of nodes sharing similar connectivity patterns in time-evolving networks. Our work is primarily motivated by detecting groups based on interesting features of the time-evolving networks (e.g., stability). In this work, we propose a model-based clustering framework for time-evolving networks based on discrete time exponential-family random graph models, which simultaneously allows both modeling and detecting group structure. To choose the number of groups, we use the conditional likelihood to construct an effective model selection criterion. Furthermore, we propose an efficient variational expectation-maximization (EM) algorithm to find approximate maximum likelihood estimates of network parameters and mixing proportions. The power of our method is demonstrated in simulation studies and empirical applications to international trade networks and the collaboration networks of a large research university.

7.
Environ Sci Technol ; 54(14): 8632-8639, 2020 07 21.
Article in English | MEDLINE | ID: mdl-32603095

ABSTRACT

Chemical spills in streams can impact ecosystem or human health. Typically, the public learns of spills from reports from industry, media, or government rather than monitoring data. For example, ∼1300 spills (76 ≥ 400 gallons or ∼1500 L) were reported from 2007 to 2014 by the regulator for natural gas wellpads in the Marcellus shale region of Pennsylvania (U.S.), a region of extensive drilling and hydraulic fracturing. Only one such incident of stream contamination in Pennsylvania has been documented with water quality data in peer-reviewed literature. This could indicate that spills (1) were small or contained on wellpads, (2) were diluted, biodegraded, or obscured by other contaminants, (3) were not detected because of sparse monitoring, or (4) were not detected because of the difficulties of inspecting data for complex stream networks. As a first step in addressing the last problem, we developed a geospatial-analysis tool, GeoNet, that analyzes stream networks to detect statistically significant changes between background and potentially impacted sites. GeoNet was used on data in the Water Quality Portal for the Pennsylvania Marcellus region. With the most stringent statistical tests, GeoNet detected 0.2% to 2% of the known contamination incidents (Na ± Cl) in streams. With denser sensor networks, tools like GeoNet could allow real-time detection of polluting events.


Subject(s)
Natural Gas , Water Pollutants, Chemical , Ecosystem , Environmental Monitoring , Humans , Oil and Gas Fields , Pennsylvania , Rivers , Water Pollutants, Chemical/analysis
8.
Technometrics ; 62(2): 161-172, 2020.
Article in English | MEDLINE | ID: mdl-33716325

ABSTRACT

Water pollution is a major global environmental problem, and it poses a great environmental risk to public health and biological diversity. This work is motivated by assessing the potential environmental threat of coal mining through increased sulfate concentrations in river networks, which do not belong to any simple parametric distribution. However, existing network models mainly focus on binary or discrete networks and weighted networks with known parametric weight distributions. We propose a principled nonparametric weighted network model based on exponential-family random graph models and local likelihood estimation, and study its model-based clustering with application to large-scale water pollution network analysis. We do not require any parametric distribution assumption on network weights. The proposed method greatly extends the methodology and applicability of statistical network models. Furthermore, it is scalable to large and complex networks in large-scale environmental studies. The power of our proposed methods is demonstrated in simulation studies and a real application to sulfate pollution network analysis in Ohio watershed located in Pennsylvania, United States.

9.
Front Genet ; 10: 350, 2019.
Article in English | MEDLINE | ID: mdl-31068967

ABSTRACT

Differential abundance analysis is a crucial task in many microbiome studies, where the central goal is to identify microbiome taxa associated with certain biological or clinical conditions. There are two different modes of microbiome differential abundance analysis: the individual-based univariate differential abundance analysis and the group-based multivariate differential abundance analysis. The univariate analysis identifies differentially abundant microbiome taxa subject to multiple correction under certain statistical error measurements such as false discovery rate, which is typically complicated by the high-dimensionality of taxa and complex correlation structure among taxa. The multivariate analysis evaluates the overall shift in the abundance of microbiome composition between two conditions, which provides useful preliminary differential information for the necessity of follow-up validation studies. In this paper, we present a novel Adaptive multivariate two-sample test for Microbiome Differential Analysis (AMDA) to examine whether the composition of a taxa-set are different between two conditions. Our simulation studies and real data applications demonstrated that the AMDA test was often more powerful than several competing methods while preserving the correct type I error rate. A free implementation of our AMDA method in R software is available at https://github.com/xyz5074/AMDA.

10.
Nat Mater ; 18(7): 760-769, 2019 07.
Article in English | MEDLINE | ID: mdl-30911119

ABSTRACT

Integrins are membrane receptors that mediate cell adhesion and mechanosensing. The structure-function relationship of integrins remains incompletely understood, despite the extensive studies carried out because of its importance to basic cell biology and translational medicine. Using a fluorescence dual biomembrane force probe, microfluidics and cone-and-plate rheometry, we applied precisely controlled mechanical stimulations to platelets and identified an intermediate state of integrin αIIbß3 that is characterized by an ectodomain conformation, ligand affinity and bond lifetimes that are all intermediate between the well-known inactive and active states. This intermediate state is induced by ligand engagement of glycoprotein (GP) Ibα via a mechanosignalling pathway and potentiates the outside-in mechanosignalling of αIIbß3 for further transition to the active state during integrin mechanical affinity maturation. Our work reveals distinct αIIbß3 state transitions in response to biomechanical and biochemical stimuli, and identifies a role for the αIIbß3 intermediate state in promoting biomechanical platelet aggregation.


Subject(s)
Mechanical Phenomena , Platelet Aggregation , Platelet Glycoprotein GPIIb-IIIa Complex/metabolism , Biomechanical Phenomena , Humans , Ligands , Signal Transduction
11.
Environ Sci Process Impacts ; 21(2): 384-396, 2019 Feb 21.
Article in English | MEDLINE | ID: mdl-30608109

ABSTRACT

With recent improvements in high-volume hydraulic fracturing (HVHF, known to the public as fracking), vast new reservoirs of natural gas and oil are now being tapped. As HVHF has expanded into the populous northeastern USA, some residents have become concerned about impacts on water quality. Scientists have addressed this concern by investigating individual case studies or by statistically assessing the rate of problems. In general, however, lack of access to new or historical water quality data hinders the latter assessments. We introduce a new statistical approach to assess water quality datasets - especially sets that differ in data volume and variance - and apply the technique to one region of intense shale gas development in northeastern Pennsylvania (PA) and one with fewer shale gas wells in northwestern PA. The new analysis for the intensely developed region corroborates an earlier analysis based on a different statistical test: in that area, changes in groundwater chemistry show no degradation despite that area's dense development of shale gas. In contrast, in the region with fewer shale gas wells, we observe slight but statistically significant increases in concentrations in some solutes in groundwaters. One potential explanation for the slight changes in groundwater chemistry in that area (northwestern PA) is that it is the regional focus of the earliest commercial development of conventional oil and gas (O&G) in the USA. Alternate explanations include the use of brines from conventional O&G wells as well as other salt mixtures on roads in that area for dust abatement or de-icing, respectively.


Subject(s)
Groundwater/chemistry , Hydraulic Fracking , Natural Gas/analysis , Petroleum/analysis , Water Pollutants, Chemical/analysis , Water/analysis , Oil and Gas Fields , Pennsylvania , Water Quality
12.
PLoS Comput Biol ; 14(9): e1006436, 2018 09.
Article in English | MEDLINE | ID: mdl-30240439

ABSTRACT

Co-expression network analysis provides useful information for studying gene regulation in biological processes. Examining condition-specific patterns of co-expression can provide insights into the underlying cellular processes activated in a particular condition. One challenge in this type of analysis is that the sample sizes in each condition are usually small, making the statistical inference of co-expression patterns highly underpowered. A joint network construction that borrows information from related structures across conditions has the potential to improve the power of the analysis. One possible approach to constructing the co-expression network is to use the Gaussian graphical model. Though several methods are available for joint estimation of multiple graphical models, they do not fully account for the heterogeneity between samples and between co-expression patterns introduced by condition specificity. Here we develop the condition-adaptive fused graphical lasso (CFGL), a data-driven approach to incorporate condition specificity in the estimation of co-expression networks. We show that this method improves the accuracy with which networks are learned. The application of this method on a rat multi-tissue dataset and The Cancer Genome Atlas (TCGA) breast cancer dataset provides interesting biological insights. In both analyses, we identify numerous modules enriched for Gene Ontology functions and observe that the modules that are upregulated in a particular condition are often involved in condition-specific activities. Interestingly, we observe that the genes strongly associated with survival time in the TCGA dataset are less likely to be network hubs, suggesting that genes associated with cancer progression are likely to govern specific functions or execute final biological functions in pathways, rather than regulating a large number of biological processes. Additionally, we observed that the tumor-specific hub genes tend to have few shared edges with normal tissue, revealing tumor-specific regulatory mechanism.


Subject(s)
Brain/metabolism , Breast Neoplasms/metabolism , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Myocardium/metabolism , Algorithms , Animals , Area Under Curve , Breast Neoplasms/genetics , Computer Graphics , Computer Simulation , Databases, Factual , Female , Heart , Humans , Male , Neoplasms/metabolism , Normal Distribution , Rats , Software
13.
Genet Epidemiol ; 42(8): 772-782, 2018 12.
Article in English | MEDLINE | ID: mdl-30218543

ABSTRACT

Recent research has highlighted the importance of the human microbiome in many human disease and health conditions. Most current microbiome association analyses focus on unrelated samples; such methods are not appropriate for analysis of data collected from more advanced study designs such as longitudinal and pedigree studies, where outcomes can be correlated. Ignoring such correlations can sometimes lead to suboptimal results or even possibly biased conclusions. Thus, new methods to handle correlated outcome data in microbiome association studies are needed. In this paper, we propose the correlated sequence kernel association test (CSKAT) to address such correlations using the linear mixed model. Specifically, random effects are used to account for the outcome correlations and a variance component test is used to examine the microbiome effect. Compared to existing genetic association tests for longitudinal and family samples, we implement a correction procedure to better calibrate the null distribution of the score test statistic to accommodate the small sample size nature of data collected from a typical microbiome study. Comprehensive simulation studies are conducted to demonstrate the validity and efficiency of our method, and we show that CSKAT achieves a higher power than existing methods while correctly controlling the Type I error rate. We also apply our method to a microbiome data set collected from a UK twin study to illustrate its potential usefulness. A free implementation of our method in R software is available at https://github.com/jchen1981/SSKAT.


Subject(s)
Algorithms , Microbiota , Computer Simulation , Humans , Linear Models , Microbiota/genetics , Models, Genetic , Sample Size , Twins , United Kingdom
14.
Stat Surv ; 12: 105-135, 2018.
Article in English | MEDLINE | ID: mdl-31428219

ABSTRACT

We present a selective review of statistical modeling of dynamic networks. We focus on models with latent variables, specifically, the latent space models and the latent class models (or stochastic blockmodels), which investigate both the observed features and the unobserved structure of networks. We begin with an overview of the static models, and then we introduce the dynamic extensions. For each dynamic model, we also discuss its applications that have been studied in the literature, with the data source listed in Appendix. Based on the review, we summarize a list of open problems and challenges in dynamic network modeling with latent variables.

15.
Environ Geochem Health ; 40(2): 865-885, 2018 Apr.
Article in English | MEDLINE | ID: mdl-29027593

ABSTRACT

To understand how extraction of different energy sources impacts water resources requires assessment of how water chemistry has changed in comparison with the background values of pristine streams. With such understanding, we can develop better water quality standards and ecological interpretations. However, determination of pristine background chemistry is difficult in areas with heavy human impact. To learn to do this, we compiled a master dataset of sulfate and barium concentrations ([SO4], [Ba]) in Pennsylvania (PA, USA) streams from publically available sources. These elements were chosen because they can represent contamination related to oil/gas and coal, respectively. We applied changepoint analysis (i.e., likelihood ratio test) to identify pristine streams, which we defined as streams with a low variability in concentrations as measured over years. From these pristine streams, we estimated the baseline concentrations for major bedrock types in PA. Overall, we found that 48,471 data values are available for [SO4] from 1904 to 2014 and 3243 data for [Ba] from 1963 to 2014. Statewide [SO4] baseline was estimated to be 15.8 ± 9.6 mg/L, but values range from 12.4 to 26.7 mg/L for different bedrock types. The statewide [Ba] baseline is 27.7 ± 10.6 µg/L and values range from 25.8 to 38.7 µg/L. Results show that most increases in [SO4] from the baseline occurred in areas with intensive coal mining activities, confirming previous studies. Sulfate inputs from acid rain were also documented. Slight increases in [Ba] since 2007 and higher [Ba] in areas with higher densities of gas wells when compared to other areas could document impacts from shale gas development, the prevalence of basin brines, or decreases in acid rain and its coupled effects on [Ba] related to barite solubility. The largest impacts on PA stream [Ba] and [SO4] are related to releases from coal mining or burning rather than oil and gas development.


Subject(s)
Acid Rain , Barium/analysis , Coal Mining , Hydraulic Fracking , Natural Gas , Rivers , Sulfates/analysis , Water Pollutants, Chemical/analysis , Appalachian Region , Datasets as Topic , Geology , Human Activities , Humans , Pennsylvania , Time Factors
16.
J Econom ; 201(2): 292-306, 2017 Dec.
Article in English | MEDLINE | ID: mdl-29731537

ABSTRACT

We consider forecasting a single time series when there is a large number of predictors and a possible nonlinear effect. The dimensionality was first reduced via a high-dimensional (approximate) factor model implemented by the principal component analysis. Using the extracted factors, we develop a novel forecasting method called the sufficient forecasting, which provides a set of sufficient predictive indices, inferred from high-dimensional predictors, to deliver additional predictive power. The projected principal component analysis will be employed to enhance the accuracy of inferred factors when a semi-parametric (approximate) factor model is assumed. Our method is also applicable to cross-sectional sufficient regression using extracted factors. The connection between the sufficient forecasting and the deep learning architecture is explicitly stated. The sufficient forecasting correctly estimates projection indices of the underlying factors even in the presence of a nonparametric forecasting function. The proposed method extends the sufficient dimension reduction to high-dimensional regimes by condensing the cross-sectional information through factor models. We derive asymptotic properties for the estimate of the central subspace spanned by these projection directions as well as the estimates of the sufficient predictive indices. We further show that the natural method of running multiple regression of target on estimated factors yields a linear estimate that actually falls into this central subspace. Our method and theory allow the number of predictors to be larger than the number of observations. We finally demonstrate that the sufficient forecasting improves upon the linear forecasting in both simulation studies and an empirical study of forecasting macroeconomic variables.

17.
Elife ; 52016 07 19.
Article in English | MEDLINE | ID: mdl-27434669

ABSTRACT

How cells sense their mechanical environment and transduce forces into biochemical signals is a crucial yet unresolved question in mechanobiology. Platelets use receptor glycoprotein Ib (GPIb), specifically its α subunit (GPIbα), to signal as they tether and translocate on von Willebrand factor (VWF) of injured arterial surfaces against blood flow. Force elicits catch bonds to slow VWF-GPIbα dissociation and unfolds the GPIbα leucine-rich repeat domain (LRRD) and juxtamembrane mechanosensitive domain (MSD). How these mechanical processes trigger biochemical signals remains unknown. Here we analyze these extracellular events and the resulting intracellular Ca(2+) on a single platelet in real time, revealing that LRRD unfolding intensifies Ca(2+) signal whereas MSD unfolding affects the type of Ca(2+) signal. Therefore, LRRD and MSD are analog and digital force transducers, respectively. The >30 nm macroglycopeptide separating the two domains transmits force on the VWF-GPIbα bond (whose lifetime is prolonged by LRRD unfolding) to the MSD to enhance its unfolding, resulting in unfolding cooperativity at an optimal force. These elements may provide design principles for a generic mechanosensory protein machine.


Subject(s)
Blood Platelets/physiology , Calcium/metabolism , Mechanoreceptors/metabolism , Platelet Glycoprotein GPIb-IX Complex/metabolism , von Willebrand Factor/metabolism , Humans , Protein Binding , Protein Folding
18.
J Am Stat Assoc ; 111(516): 1726-1735, 2016.
Article in English | MEDLINE | ID: mdl-29097827

ABSTRACT

We consider estimating multi-task quantile regression under the transnormal model, with focus on high-dimensional setting. We derive a surprisingly simple closed-form solution through rank-based covariance regularization. In particular, we propose the rank-based ℓ1 penalization with positive definite constraints for estimating sparse covariance matrices, and the rank-based banded Cholesky decomposition regularization for estimating banded precision matrices. By taking advantage of alternating direction method of multipliers, nearest correlation matrix projection is introduced that inherits sampling properties of the unprojected one. Our work combines strengths of quantile regression and rank-based covariance regularization to simultaneously deal with nonlinearity and nonnormality for high-dimensional regression. Furthermore, the proposed method strikes a good balance between robustness and efficiency, achieves the "oracle"-like convergence rate, and provides the provable prediction interval under the high-dimensional setting. The finite-sample performance of the proposed method is also examined. The performance of our proposed rank-based method is demonstrated in a real application to analyze the protein mass spectroscopy data.

19.
Ann Stat ; 42(3): 819-849, 2014 Jun.
Article in English | MEDLINE | ID: mdl-25598560

ABSTRACT

Folded concave penalization methods have been shown to enjoy the strong oracle property for high-dimensional sparse estimation. However, a folded concave penalization problem usually has multiple local solutions and the oracle property is established only for one of the unknown local solutions. A challenging fundamental issue still remains that it is not clear whether the local optimum computed by a given optimization algorithm possesses those nice theoretical properties. To close this important theoretical gap in over a decade, we provide a unified theory to show explicitly how to obtain the oracle solution via the local linear approximation algorithm. For a folded concave penalized estimation problem, we show that as long as the problem is localizable and the oracle estimator is well behaved, we can obtain the oracle estimator by using the one-step local linear approximation. In addition, once the oracle estimator is obtained, the local linear approximation algorithm converges, namely it produces the same estimator in the next iteration. The general theory is demonstrated by using four classical sparse estimation problems, i.e., sparse linear regression, sparse logistic regression, sparse precision matrix estimation and sparse quantile regression.

20.
Neural Comput ; 25(8): 2172-98, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23607561

ABSTRACT

Chandrasekaran, Parrilo, and Willsky (2012) proposed a convex optimization problem for graphical model selection in the presence of unobserved variables. This convex optimization problem aims to estimate an inverse covariance matrix that can be decomposed into a sparse matrix minus a low-rank matrix from sample data. Solving this convex optimization problem is very challenging, especially for large problems. In this letter, we propose two alternating direction methods for solving this problem. The first method is to apply the classic alternating direction method of multipliers to solve the problem as a consensus problem. The second method is a proximal gradient-based alternating-direction method of multipliers. Our methods take advantage of the special structure of the problem and thus can solve large problems very efficiently. A global convergence result is established for the proposed methods. Numerical results on both synthetic data and gene expression data show that our methods usually solve problems with 1 million variables in 1 to 2 minutes and are usually 5 to 35 times faster than a state-of-the-art Newton-CG proximal point algorithm.


Subject(s)
Algorithms , Models, Theoretical , Normal Distribution , Computer Simulation , Humans , Pattern Recognition, Automated
SELECTION OF CITATIONS
SEARCH DETAIL
...