RESUMO
We applied two state-of-the-art, knowledge independent data-mining methods - Dynamic Quantum Clustering (DQC) and t-Distributed Stochastic Neighbor Embedding (t-SNE) - to data from The Cancer Genome Atlas (TCGA). We showed that the RNA expression patterns for a mixture of 2,016 samples from five tumor types can sort the tumors into groups enriched for relevant annotations including tumor type, gender, tumor stage, and ethnicity. DQC feature selection analysis discovered 48 core biomarker transcripts that clustered tumors by tumor type. When these transcripts were removed, the geometry of tumor relationships changed, but it was still possible to classify the tumors using the RNA expression profiles of the remaining transcripts. We continued to remove the top biomarkers for several iterations and performed cluster analysis. Even though the most informative transcripts were removed from the cluster analysis, the sorting ability of remaining transcripts remained strong after each iteration. Further, in some iterations we detected a repeating pattern of biological function that wasn't detectable with the core biomarker transcripts present. This suggests the existence of a "background classification" potential in which the pattern of gene expression after continued removal of "biomarker" transcripts could still classify tumors in agreement with the tumor type.
Assuntos
Biomarcadores Tumorais/genética , Biologia Computacional , Neoplasias/classificação , Neoplasias/genética , Análise por Conglomerados , Feminino , Perfilação da Expressão Gênica , Humanos , Masculino , Estadiamento de Neoplasias , Neoplasias/patologiaRESUMO
We present a modified Lanczos algorithm to diagonalize lattice Hamiltonians with dramatically reduced memory requirements, without restricting to variational ansatzes. The lattice of size N is partitioned into two subclusters. At each iteration the Lanczos vector is projected into two sets of n(svd) smaller subcluster vectors using singular value decomposition. For low entanglement entropy S(ee), (satisfied by short-range Hamiltonians), the truncation error is expected to vanish as exp(-n(svd)(1/S(ee))). Convergence is tested for the Heisenberg model on Kagomé clusters of 24, 30, and 36 sites, with no lattice symmetries exploited, using less than 15 GB of dynamical memory. Generalization of the Lanczos-SVD algorithm to multiple partitioning is discussed, and comparisons to other techniques are given.
RESUMO
A given set of data points in some feature space may be associated with a Schrödinger equation whose potential is determined by the data. This is known to lead to good clustering solutions. Here we extend this approach into a full-fledged dynamical scheme using a time-dependent Schrödinger equation. Moreover, we approximate this Hamiltonian formalism by a truncated calculation within a set of Gaussian wave functions (coherent states) centered around the original points. This allows for analytic evaluation of the time evolution of all such states opening up the possibility of exploration of relationships among data points through observation of varying dynamical distances among points and convergence of points into clusters. This formalism may be further supplemented by preprocessing such as dimensional reduction through singular-value decomposition or feature filtering.