Search | VHL Regional Portal

Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions.

Boileau, Philippe; Hejazi, Nima S; van der Laan, Mark J; Dudoit, Sandrine.

J Comput Graph Stat ; 32(2): 601-612, 2023.

Article in English | MEDLINE | ID: mdl-37273839

ABSTRACT

The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience. Thus, a variety of estimators have been derived to overcome the shortcomings of the canonical estimator in such settings. Yet, selecting an optimal estimator from among the plethora available remains an open challenge. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. We propose a general class of loss functions for covariance matrix estimation and establish accompanying finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validation selector. In numerical experiments, we demonstrate the optimality of our proposed selector in moderate sample sizes and across diverse data-generating processes. The practical benefits of our procedure are highlighted in a dimension reduction application to single-cell transcriptome sequencing data.

A flexible approach for predictive biomarker discovery.

Boileau, Philippe; Qi, Nina Ting; van der Laan, Mark J; Dudoit, Sandrine; Leng, Ning.

Biostatistics ; 24(4): 1085-1105, 2023 10 18.

Article in English | MEDLINE | ID: mdl-35861622

ABSTRACT

An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical trials where the number of covariates exceeds the number of participants often results in high false discovery rates. The higher than expected number of false positives translates to wasted resources when conducting follow-up experiments for drug target identification and diagnostic assay development. Patient outcomes are in turn negatively affected. We propose a variable importance parameter for directly assessing the importance of potentially predictive biomarkers and develop a flexible nonparametric inference procedure for this estimand. We prove that our estimator is double robust and asymptotically linear under loose conditions in the data-generating process, permitting valid inference about the importance metric. The statistical guarantees of the method are verified in a thorough simulation study representative of randomized control trials with moderate and high-dimensional covariate vectors. Our procedure is then used to discover predictive biomarkers from among the tumor gene expression data of metastatic renal cell carcinoma patients enrolled in recently completed clinical trials. We find that our approach more readily discerns predictive from nonpredictive biomarkers than procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced.

Subject(s)

Biomedical Research , Carcinoma, Renal Cell , Kidney Neoplasms , Humans , Carcinoma, Renal Cell/diagnosis , Carcinoma, Renal Cell/genetics , Kidney Neoplasms/diagnosis , Kidney Neoplasms/genetics , Biomarkers , Computer Simulation

A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology.

Hejazi, Nima S; Boileau, Philippe; van der Laan, Mark J; Hubbard, Alan E.

Stat Methods Med Res ; 32(3): 539-554, 2023 03.

Article in English | MEDLINE | ID: mdl-36573044

ABSTRACT

The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational and high-dimensional biology. As the dimensionality of datasets continues to grow, so too does the complexity of identifying biomarkers linked to exposure patterns. The statistical analysis of such data often relies upon parametric modeling assumptions motivated by convenience, inviting opportunities for model misspecification. While estimation frameworks incorporating flexible, data adaptive regression strategies can mitigate this, their standard variance estimators are often unstable in high-dimensional settings, resulting in inflated Type-I error even after standard multiple testing corrections. We adapt a shrinkage approach compatible with parametric modeling strategies to semiparametric variance estimators of a family of efficient, asymptotically linear estimators of causal effects, defined by counterfactual exposure contrasts. Augmenting the inferential stability of these estimators in high-dimensional settings yields a data adaptive approach for robustly uncovering stable causal associations, even when sample sizes are limited. Our generalized variance estimator is evaluated against appropriate alternatives in numerical experiments, and an open source R/Bioconductor package, biotmle, is introduced. The proposal is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.

Subject(s)

Biology , Research Design , Sample Size , Causality

The impact of prenatal and early-life arsenic exposure on epigenetic age acceleration among adults in Northern Chile.

Bozack, Anne K; Boileau, Philippe; Hubbard, Alan E; Sillé, Fenna C M; Ferreccio, Catterina; Steinmaus, Craig M; Smith, Martyn T; Cardenas, Andres.

Environ Epigenet ; 8(1): dvac014, 2022.

Article in English | MEDLINE | ID: mdl-35769198

ABSTRACT

Exposure to arsenic affects millions of people globally. Changes in the epigenome may be involved in pathways linking arsenic to health or serve as biomarkers of exposure. This study investigated associations between prenatal and early-life arsenic exposure and epigenetic age acceleration (EAA) in adults, a biomarker of morbidity and mortality. DNA methylation was measured in peripheral blood mononuclear cells (PBMCs) and buccal cells from 40 adults (median age = 49 years) in Chile with and without high prenatal and early-life arsenic exposure. EAA was calculated using the Horvath, Hannum, PhenoAge, skin and blood, GrimAge, and DNA methylation telomere length clocks. We evaluated associations between arsenic exposure and EAA using robust linear models. Participants classified as with and without arsenic exposure had a median drinking water arsenic concentration at birth of 555 and 2 µg/l, respectively. In PBMCs, adjusting for sex and smoking, exposure was associated with a 6-year PhenoAge acceleration [B (95% CI) = 6.01 (2.60, 9.42)]. After adjusting for cell-type composition, we found positive associations with Hannum EAA [B (95% CI) = 3.11 (0.13, 6.10)], skin and blood EAA [B (95% CI) = 1.77 (0.51, 3.03)], and extrinsic EAA [B (95% CI) = 4.90 (1.22, 8.57)]. The association with PhenoAge acceleration in buccal cells was positive but not statistically significant [B (95% CI) = 4.88 (-1.60, 11.36)]. Arsenic exposure limited to early-life stages may be associated with biological aging in adulthood. Future research may provide information on how EAA programmed in early life is related to health.

Development and Characterization of MYB-NFIB Fusion Expression in Adenoid Cystic Carcinoma.

Humtsoe, Joseph O; Kim, Hyun-Su; Jones, Leilani; Cevallos, James; Boileau, Philippe; Kuo, Fengshen; Morris, Luc G T; Ha, Patrick.

Cancers (Basel) ; 14(9)2022 Apr 30.

Article in English | MEDLINE | ID: mdl-35565392

ABSTRACT

Adenoid cystic carcinoma (ACC) is the second most common cancer type arising from the salivary gland. The frequent occurrence of chromosome t(6;9) translocation leading to the fusion of MYB and NFIB transcription factor genes is considered a genetic hallmark of ACC. This inter-chromosomal rearrangement may encode multiple variants of functional MYB-NFIB fusion in ACC. However, the lack of an ACC model that harbors the t(6;9) translocation has limited studies on defining the potential function and implication of chimeric MYB-NFIB protein in ACC. This report aims to establish a MYB-NFIB fusion protein expressing system in ACC cells for in vitro and in vivo studies. RNA-seq data from MYB-NFIB translocation positive ACC patients' tumors and MYB-NFIB fusion transcript in ACC patient-derived xenografts (ACCX) was analyzed to identify MYB breakpoints and their frequency of occurrence. Based on the MYB breakpoint identified, variants of MYB-NFIB fusion expression system were developed in a MYB-NFIB deficient ACC cell lines. Analysis confirmed MYB-NFIB fusion protein expression in ACC cells and ACCXs. Furthermore, recombinant MYB-NFIB fusion displayed sustained protein stability and impacted transcriptional activities of interferon-associated genes set as compared to a wild type MYB. In vivo tumor formation analysis indicated the capacity of MYB-NFIB fusion cells to grow as implanted tumors, although there were no fusion-mediated growth advantages. This expression system may be useful not only in studies to determine the functional aspects of MYB-NFIB fusion but also in evaluating effective drug response in vitro and in vivo settings.

In vitro relationships of galactic cosmic radiation and epigenetic clocks in human bronchial epithelial cells.

Nwanaji-Enwerem, Jamaji C; Boileau, Philippe; Galazka, Jonathan M; Cardenas, Andres.

Environ Mol Mutagen ; 63(4): 184-189, 2022 04.

Article in English | MEDLINE | ID: mdl-35470505

ABSTRACT

Ionizing radiation is a well-appreciated health risk, precipitant of DNA damage, and contributor to DNA methylation variability. Nevertheless, relationships of ionizing radiation with DNA methylation-based markers of biological age (i.e. epigenetic clocks) remain poorly understood. Using existing data from human bronchial epithelial cells, we examined in vitro relationships of three epigenetic clock measures (Horvath DNAmAge, MiAge, and epiTOC2) with galactic cosmic radiation (GCR), which is particularly hazardous due to its high linear energy transfer (LET) heavy-ion components. High-LET 56Fe was significantly associated with accelerations in epiTOC2 (ß = 192 cell divisions, 95% CI: 71, 313, p-value = .003). We also observed a significant, positive interaction of 56Fe ions and time-in-culture with epiTOC2 (95% CI: 42, 441, p-value = .019). However, only the direct 56Fe ion association remained statistically significant after adjusting for multiple hypothesis testing. Epigenetic clocks were not significantly associated with high-LET 28Si and low-LET X-rays. Our results demonstrate sensitivities of specific epigenetic clock measures to certain forms of GCR. These findings suggest that epigenetic clocks may have some utility for monitoring and better understanding the health impacts of GCR.

Subject(s)

Cosmic Radiation , Cosmic Radiation/adverse effects , Epigenesis, Genetic , Epigenomics , Epithelial Cells , Humans , Linear Energy Transfer

Exposure to arsenic at different life-stages and DNA methylation meta-analysis in buccal cells and leukocytes.

Bozack, Anne K; Boileau, Philippe; Wei, Linqing; Hubbard, Alan E; Sillé, Fenna C M; Ferreccio, Catterina; Acevedo, Johanna; Hou, Lifang; Ilievski, Vesna; Steinmaus, Craig M; Smith, Martyn T; Navas-Acien, Ana; Gamble, Mary V; Cardenas, Andres.

Environ Health ; 20(1): 79, 2021 07 09.

Article in English | MEDLINE | ID: mdl-34243768

ABSTRACT

BACKGROUND: Arsenic (As) exposure through drinking water is a global public health concern. Epigenetic dysregulation including changes in DNA methylation (DNAm), may be involved in arsenic toxicity. Epigenome-wide association studies (EWAS) of arsenic exposure have been restricted to single populations and comparison across EWAS has been limited by methodological differences. Leveraging data from epidemiological studies conducted in Chile and Bangladesh, we use a harmonized data processing and analysis pipeline and meta-analysis to combine results from four EWAS. METHODS: DNAm was measured among adults in Chile with and without prenatal and early-life As exposure in PBMCs and buccal cells (N = 40, 850K array) and among men in Bangladesh with high and low As exposure in PBMCs (N = 32, 850K array; N = 48, 450K array). Linear models were used to identify differentially methylated positions (DMPs) and differentially variable positions (DVPs) adjusting for age, smoking, cell type, and sex in the Chile cohort. Probes common across EWAS were meta-analyzed using METAL, and differentially methylated and variable regions (DMRs and DVRs, respectively) were identified using comb-p. KEGG pathway analysis was used to understand biological functions of DMPs and DVPs. RESULTS: In a meta-analysis restricted to PBMCs, we identified one DMP and 23 DVPs associated with arsenic exposure; including buccal cells, we identified 3 DMPs and 19 DVPs (FDR < 0.05). Using meta-analyzed results, we identified 11 DMRs and 11 DVRs in PBMC samples, and 16 DMRs and 19 DVRs in PBMC and buccal cell samples. One region annotated to LRRC27 was identified as a DMR and DVR. Arsenic-associated KEGG pathways included lysosome, autophagy, and mTOR signaling, AMPK signaling, and one carbon pool by folate. CONCLUSIONS: Using a two-step process of (1) harmonized data processing and analysis and (2) meta-analysis, we leverage four DNAm datasets from two continents of individuals exposed to high levels of As prenatally and during adulthood to identify DMPs and DVPs associated with arsenic exposure. Our approach suggests that standardizing analytical pipelines can aid in identifying biological meaningful signals.

Subject(s)

Arsenic/adverse effects , DNA Methylation/drug effects , Leukocytes/metabolism , Mouth Mucosa/cytology , Prenatal Exposure Delayed Effects/genetics , Water Pollutants, Chemical/adverse effects , Adult , Female , Genome-Wide Association Study , Humans , Male , Middle Aged , Pregnancy , Prenatal Exposure Delayed Effects/epidemiology

Exploring high-dimensional biological data with sparse contrastive principal component analysis.

Boileau, Philippe; Hejazi, Nima S; Dudoit, Sandrine.

Bioinformatics ; 36(11): 3422-3430, 2020 06 01.

Article in English | MEDLINE | ID: mdl-32176249

ABSTRACT

MOTIVATION: Statistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques and others incorporating subject-matter knowledge, have provided effective advances. However, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously. RESULTS: Inspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis (PCA), sparse contrastive PCA that extracts sparse, stable, interpretable and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study and via analyses of several publicly available protein expression, microarray gene expression and single-cell transcriptome sequencing datasets. AVAILABILITY AND IMPLEMENTATION: A free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in this article is also available via GitHub. CONTACT: philippe_boileau@berkeley.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Principal Component Analysis

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL