Search | VHL Regional Portal

Subject clustering by IF-PCA and several recent methods.

Chen, Dieyi; Jin, Jiashun; Ke, Zheng Tracy.

Front Genet ; 14: 1166404, 2023.

Article in English | MEDLINE | ID: mdl-37287536

ABSTRACT

Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of significant interest. In recent years, many approaches have been proposed, among which unsupervised deep learning (UDL) has received much attention. Two interesting questions are 1) how to combine the strengths of UDL and other approaches and 2) how these approaches compare to each other. We combine the variational auto-encoder (VAE), a popular UDL approach, with the recent idea of influential feature-principal component analysis (IF-PCA) and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on 10 gene microarray data sets and eight single-cell RNA-seq data sets. We find that IF-VAE shows significant improvement over VAE, but still underperforms compared to IF-PCA. We also find that IF-PCA is quite competitive, slightly outperforming Seurat and SC3 over the eight single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving phase transition in a rare/weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).

Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates.

He, Kevin; Li, Yanming; Zhu, Ji; Liu, Hongliang; Lee, Jeffrey E; Amos, Christopher I; Hyslop, Terry; Jin, Jiashun; Lin, Huazhen; Wei, Qinyi; Li, Yi.

Bioinformatics ; 32(1): 50-7, 2016 Jan 01.

Article in English | MEDLINE | ID: mdl-26382192

ABSTRACT

MOTIVATION: Technological advances that allow routine identification of high-dimensional risk factors have led to high demand for statistical techniques that enable full utilization of these rich sources of information for genetics studies. Variable selection for censored outcome data as well as control of false discoveries (i.e. inclusion of irrelevant variables) in the presence of high-dimensional predictors present serious challenges. This article develops a computationally feasible method based on boosting and stability selection. Specifically, we modified the component-wise gradient boosting to improve the computational feasibility and introduced random permutation in stability selection for controlling false discoveries. RESULTS: We have proposed a high-dimensional variable selection method by incorporating stability selection to control false discovery. Comparisons between the proposed method and the commonly used univariate and Lasso approaches for variable selection reveal that the proposed method yields fewer false discoveries. The proposed method is applied to study the associations of 2339 common single-nucleotide polymorphisms (SNPs) with overall survival among cutaneous melanoma (CM) patients. The results have confirmed that BRCA2 pathway SNPs are likely to be associated with overall survival, as reported by previous literature. Moreover, we have identified several new Fanconi anemia (FA) pathway SNPs that are likely to modulate survival of CM patients. AVAILABILITY AND IMPLEMENTATION: The related source code and documents are freely available at https://sites.google.com/site/bestumich/issues. CONTACT: yili@umich.edu.

Subject(s)

Algorithms , Melanoma/genetics , BRCA2 Protein/genetics , Computer Simulation , Humans , Polymorphism, Single Nucleotide/genetics , Risk Factors , Skin Neoplasms , Survival Analysis , Time Factors , Melanoma, Cutaneous Malignant

COVARIANCE ASSISTED SCREENING AND ESTIMATION.

Ke, By Tracy; Jin, Jiashun; Fan, Jianqing.

Ann Stat ; 42(6): 2202-2242, 2014 Nov 01.

Article in English | MEDLINE | ID: mdl-25541567

ABSTRACT

Consider a linear model Y = X ß + z, where X = Xn,p and z ~ N(0, In ). The vector ß is unknown and it is of interest to separate its nonzero coordinates from the zero ones (i.e., variable selection). Motivated by examples in long-memory time series (Fan and Yao, 2003) and the change-point problem (Bhattacharya, 1994), we are primarily interested in the case where the Gram matrix G = X'X is non-sparse but sparsifiable by a finite order linear filter. We focus on the regime where signals are both rare and weak so that successful variable selection is very challenging but is still possible. We approach this problem by a new procedure called the Covariance Assisted Screening and Estimation (CASE). CASE first uses a linear filtering to reduce the original setting to a new regression model where the corresponding Gram (covariance) matrix is sparse. The new covariance matrix induces a sparse graph, which guides us to conduct multivariate screening without visiting all the submodels. By interacting with the signal sparsity, the graph enables us to decompose the original problem into many separated small-size subproblems (if only we know where they are!). Linear filtering also induces a so-called problem of information leakage, which can be overcome by the newly introduced patching technique. Together, these give rise to CASE, which is a two-stage Screen and Clean (Fan and Song, 2010; Wasserman and Roeder, 2009) procedure, where we first identify candidates of these submodels by patching and screening, and then re-examine each candidate to remove false positives. For any procedure ßÌ for variable selection, we measure the performance by the minimax Hamming distance between the sign vectors of ßÌ and ß. We show that in a broad class of situations where the Gram matrix is non-sparse but sparsifiable, CASE achieves the optimal rate of convergence. The results are successfully applied to long-memory time series and the change-point model.

A GENERALIZED FOURIER APPROACH TO ESTIMATING THE NULL PARAMETERS AND PROPORTION OF NONNULL EFFECTS IN LARGE-SCALE MULTIPLE TESTING.

Jin, Jiashun; Peng, Jie; Wang, Pei.

J Stat Res ; 44(1): 103-107, 2010 Jan 01.

Article in English | MEDLINE | ID: mdl-24563569

ABSTRACT

In a recent paper [4], Efron pointed out that an important issue in large-scale multiple hypothesis testing is that the null distribution may be unknown and need to be estimated. Consider a Gaussian mixture model, where the null distribution is known to be normal but both null parameters-the mean and the variance-are unknown. We address the problem with a method based on Fourier transformation. The Fourier approach was first studied by Jin and Cai [9], which focuses on the scenario where any non-null effect has either the same or a larger variance than that of the null effects. In this paper, we review the main ideas in [9], and propose a generalized Fourier approach to tackle the problem under another scenario: any non-null effect has a larger mean than that of the null effects, but no constraint is imposed on the variance. This approach and that in [9] complement with each other: each approach is successful in a wide class of situations where the other fails. Also, we extend the Fourier approach to estimate the proportion of non-null effects. The proposed procedures perform well both in theory and on simulated data.

Feature selection by higher criticism thresholding achieves the optimal phase diagram.

Donoho, David; Jin, Jiashun.

Philos Trans A Math Phys Eng Sci ; 367(1906): 4449-70, 2009 Nov 13.

Article in English | MEDLINE | ID: mdl-19805453

ABSTRACT

We consider two-class linear classification in a high-dimensional, small-sample-size setting. Only a small fraction of the features are useful, these being unknown to us, and each useful feature contributes weakly to the classification decision. This was called the rare/weak (RW) model in our previous study (Donoho, D. & Jin, J. 2008 Proc. Natl Acad. Sci. USA 105, 14 790-14 795). We select features by thresholding feature Z-scores. The threshold is set by higher criticism (HC). For 1

Impossibility of successful classification when useful features are rare and weak.

Jin, Jiashun.

Proc Natl Acad Sci U S A ; 106(22): 8859-64, 2009 Jun 02.

Article in English | MEDLINE | ID: mdl-19447927

ABSTRACT

We study a two-class classification problem with a large number of features, out of which many are useless and only a few are useful, but we do not know which ones they are. The number of features is large compared with the number of training observations. Calibrating the model with 4 key parameters--the number of features, the size of the training sample, the fraction, and strength of useful features--we identify a region in parameter space where no trained classifier can reliably separate the two classes on fresh data. The complement of this region--where successful classification is possible--is also briefly discussed.

Higher criticism thresholding: Optimal feature selection when useful features are rare and weak.

Donoho, David; Jin, Jiashun.

Proc Natl Acad Sci U S A ; 105(39): 14790-5, 2008 Sep 30.

Article in English | MEDLINE | ID: mdl-18815365

ABSTRACT

In important application fields today-genomics and proteomics are examples-selecting a small subset of useful features is crucial for success of Linear Classification Analysis. We study feature selection by thresholding of feature Z-scores and introduce a principle of threshold selection, based on the notion of higher criticism (HC). For i = 1, 2, ..., p, let pi(i) denote the two-sided P-value associated with the ith feature Z-score and pi((i)) denote the ith order statistic of the collection of P-values. The HC threshold is the absolute Z-score corresponding to the P-value maximizing the HC objective (i/p - pi((i)))/sqrt{i/p(1-i/p)}. We consider a rare/weak (RW) feature model, where the fraction of useful features is small and the useful features are each too weak to be of much use on their own. HC thresholding (HCT) has interesting behavior in this setting, with an intimate link between maximizing the HC objective and minimizing the error rate of the designed classifier, and very different behavior from popular threshold selection procedures such as false discovery rate thresholding (FDRT). In the most challenging RW settings, HCT uses an unconventionally low threshold; this keeps the missed-feature detection rate under better control than FDRT and yields a classifier with improved misclassification performance. Replacing cross-validated threshold selection in the popular Shrunken Centroid classifier with the computationally less expensive and simpler HCT reduces the variance of the selected threshold and the error rate of the constructed classifier. Results on standard real datasets and in asymptotic theory confirm the advantages of HCT.

Subject(s)

Bias , Data Collection/statistics & numerical data , Genomics/statistics & numerical data , Linear Models , Proteomics/statistics & numerical data

Counting of six pRNAs of phi29 DNA-packaging motor with customized single-molecule dual-view system.

Shu, Dan; Zhang, Hui; Jin, Jiashun; Guo, Peixuan.

EMBO J ; 26(2): 527-37, 2007 Jan 24.

Article in English | MEDLINE | ID: mdl-17245435

ABSTRACT

Direct imaging or counting of RNA molecules has been difficult owing to its relatively low electron density for EM and insufficient resolution in AFM. Bacteriophage phi29 DNA-packaging motor is geared by a packaging RNA (pRNA) ring. Currently, whether the ring is a pentagon or hexagon is under fervent debate. We report here the assembly of a highly sensitive imaging system for direct counting of the copy number of pRNA within this 20-nm motor. Single fluorophore imaging clearly identified the quantized photobleaching steps from pRNA labeled with a single fluorophore and concluded its stoichiometry within the motor. Almost all of the motors contained six copies of pRNA before and during DNA translocation, identified by dual-color detection of the stalled intermediates of motors containing Cy3-pRNA and Cy5-DNA. The stalled motors were restarted to observe the motion of DNA packaging in real time. Heat-denaturation analysis confirmed that the stoichiometry of pRNA is the common multiple of 2 and 3. EM imaging of procapsid/pRNA complexes clearly revealed six ferritin particles that were conjugated to each pRNA ring.

Subject(s)

Bacillus Phages/ultrastructure , DNA Packaging , DNA, Viral , Microscopy, Electron/methods , RNA, Viral/physiology , Capsid/chemistry , Dimerization , Ferritins/chemistry , Hot Temperature , Microscopy, Fluorescence/methods , Models, Biological , Multiprotein Complexes/chemistry

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL