Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 78
Filtrar
1.
J Comput Graph Stat ; 33(2): 736-748, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39170642

RESUMEN

For measuring the strength of visually-observed subpopulation differences, the Population Difference Criterion is proposed to assess the statistical significance of visually observed subpopulation differences. It addresses the following challenges: in high-dimensional contexts, distributional models can be dubious; in high-signal contexts, conventional permutation tests give poor pairwise comparisons. We also make two other contributions: Based on a careful analysis we find that a balanced permutation approach is more powerful in high-signal contexts than conventional permutations. Another contribution is the quantification of uncertainty due to permutation variation via a bootstrap confidence interval. The practical usefulness of these ideas is illustrated in the comparison of subpopulations of modern cancer data.

2.
Cell Rep Methods ; 4(7): 100810, 2024 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-38981475

RESUMEN

In single-cell RNA sequencing (scRNA-seq) studies, cell types and their marker genes are often identified by clustering and differentially expressed gene (DEG) analysis. A common practice is to select genes using surrogate criteria such as variance and deviance, then cluster them using selected genes and detect markers by DEG analysis assuming known cell types. The surrogate criteria can miss important genes or select unimportant genes, while DEG analysis has the selection-bias problem. We present Festem, a statistical method for the direct selection of cell-type markers for downstream clustering. Festem distinguishes marker genes with heterogeneous distribution across cells that are cluster informative. Simulation and scRNA-seq applications demonstrate that Festem can sensitively select markers with high precision and enables the identification of cell types often missed by other methods. In a large intrahepatic cholangiocarcinoma dataset, we identify diverse CD8+ T cell types and potential prognostic marker genes.


Asunto(s)
Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Humanos , Análisis por Conglomerados , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Biomarcadores de Tumor/genética , Biomarcadores de Tumor/metabolismo , Linfocitos T CD8-positivos/metabolismo , Colangiocarcinoma/genética , Colangiocarcinoma/patología , Marcadores Genéticos/genética
3.
Struct Multidiscipl Optim ; 67(7): 122, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39006128

RESUMEN

This paper investigates a novel approach to efficiently construct and improve surrogate models in problems with high-dimensional input and output. In this approach, the principal components and corresponding features of the high-dimensional output are first identified. For each feature, the active subspace technique is used to identify a corresponding low-dimensional subspace of the input domain; then a surrogate model is built for each feature in its corresponding active subspace. A low-dimensional adaptive learning strategy is proposed to identify training samples to improve the surrogate model. In contrast to existing adaptive learning methods that focus on a scalar output or a small number of outputs, this paper addresses adaptive learning with high-dimensional input and output, with a novel learning function that balances exploration and exploitation, i.e., considering unexplored regions and high-error regions, respectively. The adaptive learning is in terms of the active variables in the low-dimensional space, and the newly added training samples can be easily mapped back to the original space for running the expensive physics model. The proposed method is demonstrated for the numerical simulation of an additive manufacturing part, with a high-dimensional field output quantity of interest (residual stress) in the component that has spatial variability due to the stochastic nature of multiple input variables (including process variables and material properties). Various factors in the adaptive learning process are investigated, including the number of training samples, range and distribution of the adaptive training samples, contributions of various errors, and the importance of exploration versus exploitation in the learning function.

4.
Front Immunol ; 15: 1285215, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38629063

RESUMEN

The analytical capability of flow cytometry is crucial for differentiating the growing number of cell subsets found in human blood. This is important for accurate immunophenotyping of patients with few cells and a large number of parameters to monitor. Here, we present a 43-parameter panel to analyze peripheral blood mononuclear cells from healthy individuals using 41 fluorescence-labelled monoclonal antibodies, an autofluorescent channel, and a viability dye. We demonstrate minimal population distortions that lead to optimized population identification and reproducible results. We have applied an advanced approach in panel design, in selection of sample acquisition parameters and in data analysis. Appropriate autofluorescence identification and integration in the unmixing matrix, allowed for resolution of unspecific signals and increased dimensionality. Addition of one laser without assigned fluorochrome resulted in decreased fluorescence spill over and improved discrimination of cell subsets. It also increased the staining index when autofluorescence was integrated in the matrix. We conclude that spectral flow cytometry is a highly valuable tool for high-end immunophenotyping, and that fine-tuning of major experimental steps is key for taking advantage of its full capacity.


Asunto(s)
Colorantes Fluorescentes , Leucocitos Mononucleares , Humanos , Anticuerpos Monoclonales , Recuento de Leucocitos , Luz
5.
Stat Med ; 43(10): 2007-2042, 2024 May 10.
Artículo en Inglés | MEDLINE | ID: mdl-38634309

RESUMEN

Quantile regression, known as a robust alternative to linear regression, has been widely used in statistical modeling and inference. In this paper, we propose a penalized weighted convolution-type smoothed method for variable selection and robust parameter estimation of the quantile regression with high dimensional longitudinal data. The proposed method utilizes a twice-differentiable and smoothed loss function instead of the check function in quantile regression without penalty, and can select the important covariates consistently using the efficient gradient-based iterative algorithms when the dimension of covariates is larger than the sample size. Moreover, the proposed method can circumvent the influence of outliers in the response variable and/or the covariates. To incorporate the correlation within each subject and enhance the accuracy of the parameter estimation, a two-step weighted estimation method is also established. Furthermore, we prove the oracle properties of the proposed method under some regularity conditions. Finally, the performance of the proposed method is demonstrated by simulation studies and two real examples.


Asunto(s)
Algoritmos , Modelos Estadísticos , Humanos , Simulación por Computador , Modelos Lineales , Tamaño de la Muestra
6.
J Pharm Biomed Anal ; 242: 116031, 2024 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-38382317

RESUMEN

Robust classification algorithms for high-dimensional, small-sample datasets are valuable in practical applications. Faced with the infrared spectroscopic dataset with 568 samples and 3448 wavelengths (features) to identify the origins of Chinese medicinal materials, this paper proposed a novel embedded multiclassification algorithm, ITabNet, derived from the framework of TabNet. Firstly, a refined data pre-processing (DP) mechanism was designed to efficiently find the best adaptive one among 50 DP methods with the help of Support Vector Machine (SVM). Following this, an innovative focal loss function was designed and joined with a cross-validation experiment strategy to mitigate the impact of sample imbalance on algorithm. Detailed investigations on ITabNet were conducted, including comparisons of ITabNet with SVM for the conditions of DP and Non-DP, GPU and CPU computer settings, as well as ITabNet against XGBT (Extreme Gradient Boosting). The numerical results demonstrate that ITabNet can significantly improve the effectiveness of prediction. The best accuracy score is 1.0000, and the best Area Under the Curve (AUC) score is 1.0000. Suggestions on how to use models effectively were given. Furthermore, ITabNet shows the potential to apply the analysis of medicinal efficacy and chemical composition of medicinal materials. The paper also provides ideas for multi-classification modeling data with small sample size and high-dimensional feature.


Asunto(s)
Medicamentos Herbarios Chinos , Algoritmos , Espectrofotometría Infrarroja , Máquina de Vectores de Soporte
7.
Diagnostics (Basel) ; 14(3)2024 Feb 04.
Artículo en Inglés | MEDLINE | ID: mdl-38337853

RESUMEN

Given the pronounced impact COVID-19 continues to have on society-infecting 700 million reported individuals and causing 6.96 million deaths-many deep learning works have recently focused on the virus's diagnosis. However, assessing severity has remained an open and challenging problem due to a lack of large datasets, the large dimensionality of images for which to find weights, and the compute limitations of modern graphics processing units (GPUs). In this paper, a new, iterative application of transfer learning is demonstrated on the understudied field of 3D CT scans for COVID-19 severity analysis. This methodology allows for enhanced performance on the MosMed Dataset, which is a small and challenging dataset containing 1130 images of patients for five levels of COVID-19 severity (Zero, Mild, Moderate, Severe, and Critical). Specifically, given the large dimensionality of the input images, we create several custom shallow convolutional neural network (CNN) architectures and iteratively refine and optimize them, paying attention to learning rates, layer types, normalization types, filter sizes, dropout values, and more. After a preliminary architecture design, the models are systematically trained on a simplified version of the dataset-building models for two-class, then three-class, then four-class, and finally five-class classification. The simplified problem structure allows the model to start learning preliminary features, which can then be further modified for more difficult classification tasks. Our final model CoSev boosts classification accuracies from below 60% at first to 81.57% with the optimizations, reaching similar performance to the state-of-the-art on the dataset, with much simpler setup procedures. In addition to COVID-19 severity diagnosis, the explored methodology can be applied to general image-based disease detection. Overall, this work highlights innovative methodologies that advance current computer vision practices for high-dimension, low-sample data as well as the practicality of data-driven machine learning and the importance of feature design for training, which can then be implemented for improvements in clinical practices.

8.
IEEE Trans Inf Theory ; 69(3): 1695-1738, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-37842015

RESUMEN

In this paper, we consider asymptotically exact support recovery in the context of high dimensional and sparse Canonical Correlation Analysis (CCA). Our main results describe four regimes of interest based on information theoretic and computational considerations. In regimes of "low" sparsity we describe a simple, general, and computationally easy method for support recovery, whereas in a regime of "high" sparsity, it turns out that support recovery is information theoretically impossible. For the sake of information theoretic lower bounds, our results also demonstrate a non-trivial requirement on the "minimal" size of the nonzero elements of the canonical vectors that is required for asymptotically consistent support recovery. Subsequently, the regime of "moderate" sparsity is further divided into two subregimes. In the lower of the two sparsity regimes, we show that polynomial time support recovery is possible by using a sharp analysis of a co-ordinate thresholding [1] type method. In contrast, in the higher end of the moderate sparsity regime, appealing to the "Low Degree Polynomial" Conjecture [2], we provide evidence that polynomial time support recovery methods are inconsistent. Finally, we carry out numerical experiments to compare the efficacy of various methods discussed.

9.
Entropy (Basel) ; 25(9)2023 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-37761640

RESUMEN

Clustering is used to analyze the intrinsic structure of a dataset based on the similarity of datapoints. Its widespread use, from image segmentation to object recognition and information retrieval, requires great robustness in the clustering process. In this paper, a novel clustering method based on adjacent grid searching (CAGS) is proposed. The CAGS consists of two steps: a strategy based on adaptive grid-space construction and a clustering strategy based on adjacent grid searching. In the first step, a multidimensional grid space is constructed to provide a quantization structure of the input dataset. The noise and cluster halo are automatically distinguished according to grid density. Moreover, the adaptive grid generating process solves the common problem of grid clustering, in which the number of cells increases sharply with the dimension. In the second step, a two-stage traversal process is conducted to accomplish the cluster recognition. The cluster cores with arbitrary shapes can be found by concealing the halo points. As a result, the number of clusters will be easily identified by CAGS. Therefore, CAGS has the potential to be widely used for clustering datasets with different characteristics. We test the clustering performance of CAGS through six different types of datasets: dataset with noise, large-scale dataset, high-dimensional dataset, dataset with arbitrary shapes, dataset with large differences in density between classes, and dataset with high overlap between classes. Experimental results show that CAGS, which performed best on 10 out of 11 tests, outperforms the state-of-the-art clustering methods in all the above datasets.

10.
Ann Stat ; 51(1): 233-259, 2023 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-37602147

RESUMEN

We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.

11.
Lifetime Data Anal ; 29(4): 769-806, 2023 10.
Artículo en Inglés | MEDLINE | ID: mdl-37393569

RESUMEN

Despite the urgent need for an effective prediction model tailored to individual interests, existing models have mainly been developed for the mean outcome, targeting average people. Additionally, the direction and magnitude of covariates' effects on the mean outcome may not hold across different quantiles of the outcome distribution. To accommodate the heterogeneous characteristics of covariates and provide a flexible risk model, we propose a quantile forward regression model for high-dimensional survival data. Our method selects variables by maximizing the likelihood of the asymmetric Laplace distribution (ALD) and derives the final model based on the extended Bayesian Information Criterion (EBIC). We demonstrate that the proposed method enjoys a sure screening property and selection consistency. We apply it to the national health survey dataset to show the advantages of a quantile-specific prediction model. Finally, we discuss potential extensions of our approach, including the nonlinear model and the globally concerned quantile regression coefficients model.


Asunto(s)
Modelos Estadísticos , Humanos , Simulación por Computador , Análisis de Regresión , Teorema de Bayes
12.
Biometrics ; 79(2): 841-853, 2023 06.
Artículo en Inglés | MEDLINE | ID: mdl-35278218

RESUMEN

In the era of big data, univariate models have widely been used as a workhorse tool for quickly producing marginal estimators; and this is true even when in a high-dimensional dense setting, in which many features are "true," but weak signals. Genome-wide association studies (GWAS) epitomize this type of setting. Although the GWAS marginal estimator is popular, it has long been criticized for ignoring the correlation structure of genetic variants (i.e., the linkage disequilibrium [LD] pattern). In this paper, we study the effects of LD pattern on the GWAS marginal estimator and investigate whether or not additionally accounting for the LD can improve the prediction accuracy of complex traits. We consider a general high-dimensional dense setting for GWAS and study a class of ridge-type estimators, including the popular marginal estimator and the best linear unbiased prediction (BLUP) estimator as two special cases. We show that the performance of GWAS marginal estimator depends on the LD pattern through the first three moments of its eigenvalue distribution. Furthermore, we uncover that the relative performance of GWAS marginal and BLUP estimators highly depends on the ratio of GWAS sample size over the number of genetic variants. Particularly, our finding reveals that the marginal estimator can easily become near-optimal within this class when the sample size is relatively small, even though it ignores the LD pattern. On the other hand, BLUP estimator has substantially better performance than the marginal estimator as the sample size increases toward the number of genetic variants, which is typically in millions. Therefore, adjusting for the LD (such as in the BLUP) is most needed when GWAS sample size is large. We illustrate the importance of our results by using the simulated data and real GWAS.


Asunto(s)
Estudio de Asociación del Genoma Completo , Desequilibrio de Ligamiento , Herencia Multifactorial , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Humanos , Exactitud de los Datos , Tamaño de la Muestra , Simulación por Computador
13.
Neural Netw ; 157: 147-159, 2023 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-36334536

RESUMEN

Compared with relatively easy feature creation or generation in data analysis, manual data labeling needs a lot of time and effort in most cases. Even if automated data labeling​ seems to make it better in some cases, the labeling results still need to be checked and verified by manual. The High Dimension and Low Sample Size (HDLSS) data are therefore very common in data mining and machine learning. For classification problems with the HDLSS data, due to data piling and approximate equidistance between any two input points in high-dimension space, some traditional classifiers often give poor predictive performance. In this paper, we propose a Maximum Decentral Projection Margin Classifier (MDPMC) in the framework of a Support Vector Classifier (SVC). In the MDPMC model, the constraints of maximizing the projection distance between decentralized input points and their supporting hyperplane are integrated into the SVC model in addition to maximizing the margin of two supporting hyperplanes. On ten real HDLSS datasets, the experiment results show that the proposed MDPMC approach can deal well with data piling and approximate equidistance problems. Compared with SVC with Linear Kernel (SVC-LK) and Radial Basis Function Kernel (SVC-RBFK), Distance Weighted Discrimination (DWD), weighted DWD (wDWD), Distance-Weighted Support Vector Machine (DWSVM), Population-Guided Large Margin Classifier (PGLMC), and Data Maximum Dispersion Classifier (DMDC), MDPMC obtains better predictive accuracy and lower classification errors than the other seven classifiers on the HDLSS data.


Asunto(s)
Inteligencia Artificial , Máquina de Vectores de Soporte , Tamaño de la Muestra , Aprendizaje Automático
14.
Nanomicro Lett ; 14(1): 221, 2022 Nov 14.
Artículo en Inglés | MEDLINE | ID: mdl-36374430

RESUMEN

Parallel multi-thread processing in advanced intelligent processors is the core to realize high-speed and high-capacity signal processing systems. Optical neural network (ONN) has the native advantages of high parallelization, large bandwidth, and low power consumption to meet the demand of big data. Here, we demonstrate the dual-layer ONN with Mach-Zehnder interferometer (MZI) network and nonlinear layer, while the nonlinear activation function is achieved by optical-electronic signal conversion. Two frequency components from the microcomb source carrying digit datasets are simultaneously imposed and intelligently recognized through the ONN. We successfully achieve the digit classification of different frequency components by demultiplexing the output signal and testing power distribution. Efficient parallelization feasibility with wavelength division multiplexing is demonstrated in our high-dimensional ONN. This work provides a high-performance architecture for future parallel high-capacity optical analog computing.

15.
J Appl Stat ; 49(15): 3889-3907, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36324486

RESUMEN

Many research proposals involve collecting multiple sources of information from a set of common samples, with the goal of performing an integrative analysis describing the associations between sources. We propose a method that characterizes the dominant modes of co-variation between the variables in two datasets while simultaneously performing variable selection. Our method relies on a sparse, low rank approximation of a matrix containing pairwise measures of association between the two sets of variables. We show that the proposed method shares a close connection with another group of methods for integrative data analysis - sparse canonical correlation analysis (CCA). Under some assumptions, the proposed method and sparse CCA aim to select the same subsets of variables. We show through simulation that the proposed method can achieve better variable selection accuracies than two state-of-the-art sparse CCA algorithms. Empirically, we demonstrate through the analysis of DNA methylation and gene expression data that the proposed method selects variables that have as high or higher canonical correlation than the variables selected by sparse CCA methods, which is a rather surprising finding given that objective function of the proposed method does not actually maximize the canonical correlation.

16.
Sensors (Basel) ; 22(19)2022 Sep 20.
Artículo en Inglés | MEDLINE | ID: mdl-36236225

RESUMEN

With the development and integration of GNSS systems in the world, the positioning accuracy and reliability of GNSS navigation services are increasing in various fields. Because the current multisystem fusion leads to an increase in the ambiguity dimension and the ambiguity parameters have discrete characteristics, the current conventional search algorithm leads to low search efficiency when the ambiguity dimension is large. Therefore, this paper describes a new algorithm that searches the optimal lattice points by lattice theory through the breadth-first algorithm and reduces the search space of ambiguity by calculating and judging the Euclidean distance between each search variable and the target one so as to propose a new lattice ambiguity search algorithm based on the breadth-first algorithm. The experimental results show that this method can effectively improve the search efficiency of ambiguity in high-dimension situations.

17.
Front Genet ; 13: 859462, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35734430

RESUMEN

Motivation: Identifying new genetic associations in non-Mendelian complex diseases is an increasingly difficult challenge. These diseases sometimes appear to have a significant component of heritability requiring explanation, and this missing heritability may be due to the existence of subtypes involving different genetic factors. Taking genetic information into account in clinical trials might potentially have a role in guiding the process of subtyping a complex disease. Most methods dealing with multiple sources of information rely on data transformation, and in disease subtyping, the two main strategies used are 1) the clustering of clinical data followed by posterior genetic analysis and 2) the concomitant clustering of clinical and genetic variables. Both of these strategies have limitations that we propose to address. Contribution: This work proposes an original method for disease subtyping on the basis of both longitudinal clinical variables and high-dimensional genetic markers via a sparse mixture-of-regressions model. The added value of our approach lies in its interpretability in relation to two aspects. First, our model links both clinical and genetic data with regard to their initial nature (i.e., without transformation) and does not require post-processing where the original information is accessed a second time to interpret the subtypes. Second, it can address large-scale problems because of a variable selection step that is used to discard genetic variables that may not be relevant for subtyping. Results: The proposed method was validated on simulations. A dataset from a cohort of Parkinson's disease patients was also analyzed. Several subtypes of the disease and genetic variants that potentially have a role in this typology were identified. Software availability: The R code for the proposed method, named DiSuGen, and a tutorial are available for download (see the references).

18.
J Appl Stat ; 49(3): 764-781, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35706767

RESUMEN

We propose a new methodology for selecting and ranking covariates associated with a variable of interest in a context of high-dimensional data under dependence but few observations. The methodology successively intertwines the clustering of covariates, decorrelation of covariates using Factor Latent Analysis, selection using aggregation of adapted methods and finally ranking. A simulation study shows the interest of the decorrelation inside the different clusters of covariates. We first apply our method to transcriptomic data of 37 patients with advanced non-small-cell lung cancer who have received chemotherapy, to select the transcriptomic covariates that explain the survival outcome of the treatment. Secondly, we apply our method to 79 breast tumor samples to define patient profiles for a new metastatic biomarker and associated gene network in order to personalize the treatments.

19.
Biom J ; 64(6): 1007-1022, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35524713

RESUMEN

We propose a two-way additive model with group-specific interactions, where the group information is unknown. We treat the group membership as latent information and propose an EM algorithm for estimation. With a single observation matrix and under the situation of diverging row and column numbers, we rigorously establish the estimation consistency and asymptotic normality of our estimator. Extensive simulation studies are conducted to demonstrate the finite sample performance. We apply the model to the triple negative breast cancer (TNBC) gene expression data and provide a new way to classify patients into different subtypes. Our analysis detects the potential genes that may be associated with TNBC.


Asunto(s)
Neoplasias de la Mama Triple Negativas , Algoritmos , Simulación por Computador , Expresión Génica , Humanos , Neoplasias de la Mama Triple Negativas/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA