Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
BMC Bioinformatics ; 25(1): 209, 2024 Jun 12.
Article in English | MEDLINE | ID: mdl-38867193

ABSTRACT

BACKGROUND: Single-cell RNA sequencing (sc-RNASeq) data illuminate transcriptomic heterogeneity but also possess a high level of noise, abundant missing entries and sometimes inadequate or no cell type annotations at all. Bulk-level gene expression data lack direct information of cell population composition but are more robust and complete and often better annotated. We propose a modeling framework to integrate bulk-level and single-cell RNASeq data to address the deficiencies and leverage the mutual strengths of each type of data and enable a more comprehensive inference of their transcriptomic heterogeneity. Contrary to the standard approaches of factorizing the bulk-level data with one algorithm and (for some methods) treating single-cell RNASeq data as references to decompose bulk-level data, we employed multiple deconvolution algorithms to factorize the bulk-level data, constructed the probabilistic graphical models of cell-level gene expressions from the decomposition outcomes, and compared the log-likelihood scores of these models in single-cell data. We term this framework backward deconvolution as inference operates from coarse-grained bulk-level data to fine-grained single-cell data. As the abundant missing entries in sc-RNASeq data have a significant effect on log-likelihood scores, we also developed a criterion for inclusion or exclusion of zero entries in log-likelihood score computation. RESULTS: We selected nine deconvolution algorithms and validated backward deconvolution in five datasets. In the in-silico mixtures of mouse sc-RNASeq data, the log-likelihood scores of the deconvolution algorithms were strongly anticorrelated with their errors of mixture coefficients and cell type specific gene expression signatures. In the true bulk-level mouse data, the sample mixture coefficients were unknown but the log-likelihood scores were strongly correlated with accuracy rates of inferred cell types. In the data of autism spectrum disorder (ASD) and normal controls, we found that ASD brains possessed higher fractions of astrocytes and lower fractions of NRGN-expressing neurons than normal controls. In datasets of breast cancer and low-grade gliomas (LGG), we compared the log-likelihood scores of three simple hypotheses about the gene expression patterns of the cell types underlying the tumor subtypes. The model that tumors of each subtype were dominated by one cell type persistently outperformed an alternative model that each cell type had elevated expression in one gene group and tumors were mixtures of those cell types. Superiority of the former model is also supported by comparing the real breast cancer sc-RNASeq clusters with those generated by simulated sc-RNASeq data. CONCLUSIONS: The results indicate that backward deconvolution serves as a sensible model selection tool for deconvolution algorithms and facilitates discerning hypotheses about cell type compositions underlying heterogeneous specimens such as tumors.


Subject(s)
Algorithms , Sequence Analysis, RNA , Single-Cell Analysis , Transcriptome , Single-Cell Analysis/methods , Sequence Analysis, RNA/methods , Transcriptome/genetics , Humans , Gene Expression Profiling/methods , Animals , Mice , Single-Cell Gene Expression Analysis
2.
Biol Open ; 11(6)2022 06 15.
Article in English | MEDLINE | ID: mdl-35665803

ABSTRACT

Despite the remarkable progress in probing tumor transcriptomic heterogeneity by single-cell RNA sequencing (sc-RNAseq) data, several gaps exist in prior studies. Tumor heterogeneity is frequently mentioned but not quantified. Clustering analyses typically target cells rather than genes, and differential levels of transcriptomic heterogeneity of gene clusters are not characterized. Relations between gene clusters inferred from multiple datasets remain less explored. We provided a series of quantitative methods to analyze cancer sc-RNAseq data. First, we proposed two quantitative measures to assess intra-tumoral heterogeneity/homogeneity. Second, we established a hierarchy of gene clusters from sc-RNAseq data, devised an algorithm to reduce the gene cluster hierarchy to a compact structure, and characterized the gene clusters with functional enrichment and heterogeneity. Third, we developed an algorithm to align the gene cluster hierarchies from multiple datasets to a small number of meta gene clusters. By applying these methods to nine cancer sc-RNAseq datasets, we discovered that cancer cell transcriptomes were more homogeneous within tumors than the accompanying normal cells. Furthermore, many gene clusters from the nine datasets were aligned to two large meta gene clusters, which had high and low heterogeneity and were enriched with distinct functions. Finally, we found the homogeneous meta gene cluster retained stronger expression coherence and associations with survival times in bulk level RNAseq data than the heterogeneous meta gene cluster, yet the combinatorial expression patterns of breast cancer subtypes in bulk level data were not preserved in single-cell data. The inference outcomes derived from nine cancer sc-RNAseq datasets provide insights about the contributing factors for transcriptomic heterogeneity of cancer cells and complex relations between bulk level and single-cell RNAseq data. They demonstrate the utility of our methods to enable a comprehensive characterization of co-expressed gene clusters in a wide range of sc-RNAseq data in cancers and beyond.


Subject(s)
Breast Neoplasms , Transcriptome , Algorithms , Breast Neoplasms/genetics , Cluster Analysis , Female , Humans , Multigene Family
3.
PLOS Digit Health ; 1(12): e0000151, 2022 Dec.
Article in English | MEDLINE | ID: mdl-36812605

ABSTRACT

Cancer cells harbor molecular alterations at all levels of information processing. Genomic/epigenomic and transcriptomic alterations are inter-related between genes, within and across cancer types and may affect clinical phenotypes. Despite the abundant prior studies of integrating cancer multi-omics data, none of them organizes these associations in a hierarchical structure and validates the discoveries in extensive external data. We infer this Integrated Hierarchical Association Structure (IHAS) from the complete data of The Cancer Genome Atlas (TCGA) and compile a compendium of cancer multi-omics associations. Intriguingly, diverse alterations on genomes/epigenomes from multiple cancer types impact transcriptions of 18 Gene Groups. Half of them are further reduced to three Meta Gene Groups enriched with (1) immune and inflammatory responses, (2) embryonic development and neurogenesis, (3) cell cycle process and DNA repair. Over 80% of the clinical/molecular phenotypes reported in TCGA are aligned with the combinatorial expressions of Meta Gene Groups, Gene Groups, and other IHAS subunits. Furthermore, IHAS derived from TCGA is validated in more than 300 external datasets including multi-omics measurements and cellular responses upon drug treatments and gene perturbations in tumors, cancer cell lines, and normal tissues. To sum up, IHAS stratifies patients in terms of molecular signatures of its subunits, selects targeted genes or drugs for precision cancer therapy, and demonstrates that associations between survival times and transcriptional biomarkers may vary with cancer types. These rich information is critical for diagnosis and treatments of cancers.

4.
Sci Rep ; 11(1): 17741, 2021 09 07.
Article in English | MEDLINE | ID: mdl-34493766

ABSTRACT

Principal Component Analysis (PCA) projects high-dimensional genotype data into a few components that discern populations. Ancestry Informative Markers (AIMs) are a small subset of SNPs capable of distinguishing populations. We integrate these two approaches by proposing an algorithm to identify necessary informative loci whose removal from the data deteriorates the PCA structure. Unlike classical AIMs, necessary informative loci densely cover the genome, hence can illuminate the evolution and mixing history of populations. We conduct a comprehensive analysis to the genotype data of the 1000 Genomes Project using necessary informative loci. Projections along the top seven principal components demarcate populations at distinct geographic levels. Millions of necessary informative loci along each PC are identified. Population identities along each PC are approximately determined by weighted sums of minor (or major) alleles over the informative loci. Variations of allele frequencies are aligned with the history and direction of population evolution. The population distribution of projections along the top three PCs is recapitulated by a simple demographic model based on several waves of founder population separation and mixing. Informative loci possess locational concentration in the genome and functional enrichment. Genes at two hot spots encompassing dense PC 7 informative loci exhibit differential expressions among European populations. The mosaic of local ancestry in the genome of a mixed descendant from multiple populations can be inferred from partial PCA projections of informative loci. Finally, informative loci derived from the 1000 Genomes data well predict the projections of an independent genotype data of South Asians. These results demonstrate the utility and relevance of informative loci to investigate human evolution.


Subject(s)
Evolution, Molecular , Genome, Human , Genotype , Human Migration , Algorithms , Datasets as Topic , Gene Expression , Genetics, Population , Humans , Polymorphism, Single Nucleotide/genetics , Principal Component Analysis , Racial Groups/genetics
5.
BMC Bioinformatics ; 20(1): 145, 2019 Mar 18.
Article in English | MEDLINE | ID: mdl-30885118

ABSTRACT

BACKGROUND: Gene Set Enrichment Analysis (GSEA) is a powerful tool to identify enriched functional categories of informative biomarkers. Canonical GSEA takes one-dimensional feature scores derived from the data of one platform as inputs. Numerous extensions of GSEA handling multimodal OMIC data are proposed, yet none of them explicitly captures combinatorial relations of feature scores from multiple platforms. RESULTS: We propose multivariate GSEA (MGSEA) to capture combinatorial relations of gene set enrichment among multiple platform features. MGSEA successfully captures designed feature relations from simulated data. By applying it to the scores of delineating breast cancer and glioblastoma multiforme (GBM) subtypes from The Cancer Genome Atlas (TCGA) datasets of CNV, DNA methylation and mRNA expressions, we find that breast cancer and GBM data yield both similar and distinct outcomes. Among the enriched functional categories, subtype-specific biomarkers are dominated by mRNA expression in many functional categories in both cancer types and also by CNV in many functional categories in breast cancer. The enriched functional categories belonging to distinct combinatorial patterns are involved different oncogenic processes: cell proliferation (such as cell cycle control, estrogen responses, MYC and E2F targets) for mRNA expression in breast cancer, invasion and metastasis (such as cell adhesion and epithelial-mesenchymal transition (EMT)) for CNV in breast cancer, and diverse processes (such as immune and inflammatory responses, cell adhesion, angiogenesis, and EMT) for mRNA expression in GBM. These observations persist in two external datasets (Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) for breast cancer and Repository for Molecular Brain Neoplasia Data (REMBRANDT) for GBM) and are consistent with knowledge of cancer subtypes. We further compare the characteristics of MGSEA with several extensions of GSEA and point out the pros and cons of each method. CONCLUSIONS: We demonstrated the utility of MGSEA by inferring the combinatorial relations of multiple platforms for cancer subtype delineation in three multi-OMIC datasets: TCGA, METABRIC and REMBRANDT. The inferred combinatorial patterns are consistent with the current knowledge and also reveal novel insights about cancer subtypes. MGSEA can be further applied to any genotype-phenotype association problems with multimodal OMIC data.


Subject(s)
Brain Neoplasms/genetics , Breast Neoplasms/genetics , Glioblastoma/genetics , Biomarkers, Tumor/genetics , Cell Proliferation , DNA Methylation , Databases, Genetic , Epithelial-Mesenchymal Transition , Gene Expression Regulation, Neoplastic , Humans , Models, Theoretical , Multivariate Analysis
6.
Sci Rep ; 8(1): 11456, 2018 07 30.
Article in English | MEDLINE | ID: mdl-30061703

ABSTRACT

Most cancer driver genes are involved in generic cellular processes such as DNA repair, cell proliferation and cell adhesion, yet their mutations are often confined to specific cancer types. To resolve this paradox, we explained mutation frequencies of selected genes across tumor types with four features in the corresponding normal tissues from cancer-free subjects: mRNA expression and chromatin accessibility of mutated genes, mRNA expressions of their neighbors in curated pathways and the protein-protein interaction network. Encouragingly, these transcriptomic/epigenomic features in normal tissues were closely associated with mutational/functional characteristics in tumors. First, chromatin accessibility was a necessary but not sufficient condition for frequent mutations. Second, variations of mutation frequencies in selected genes across tissue types were significantly associated with all four features. Third, the genes possessing significant associations between mutation frequency variations and pathway gene expression were enriched with documented cancer genes. We further proposed a novel bivariate gene set enrichment analysis and confirmed that the pathway gene expression was the dominant factor in cancer gene enrichment. These findings shed lights on the functional roles of genes in normal tissues in shaping the mutational landscape during tumor genome evolution.


Subject(s)
Epigenesis, Genetic , Mutation/genetics , Neoplasms/genetics , Transcriptome/genetics , Chromatin/metabolism , Genes, Neoplasm , Humans , Mutation Rate , Organ Specificity/genetics
7.
Neoplasia ; 16(5): 441-50, 2014 May.
Article in English | MEDLINE | ID: mdl-24947187

ABSTRACT

Two genes are called synthetic lethal (SL) if their simultaneous mutations lead to cell death, but each individual mutation does not. Targeting SL partners of mutated cancer genes can kill cancer cells specifically, but leave normal cells intact. We present an integrated approach to uncovering SL pairs in colorectal cancer (CRC). Screening verified SL pairs using microarray gene expression data of cancerous and normal tissues, we first identified potential functionally relevant (simultaneously differentially expressed) gene pairs. From the top-ranked pairs, ~20 genes were chosen for immunohistochemistry (IHC) staining in 171 CRC patients. To find novel SL pairs, all 169 combined pairs from the individual IHC were synergistically correlated to five clinicopathological features, e.g. overall survival. Of the 11 predicted SL pairs, MSH2-POLB and CSNK1E-MYC were consistent with literature, and we validated the top two pairs, CSNK1E-TP53 and CTNNB1-TP53 using RNAi knockdown and small molecule inhibitors of CSNK1E in isogenic HCT-116 and RKO cells. Furthermore, synthetic lethality of CSNK1E and TP53 was verified in mouse model. Importantly, multivariate analysis revealed that CSNK1E-P53, CTNNB1-P53, MSH2-RB1, and BRCA1-WNT5A were independent prognosis markers from stage, with CSNK1E-P53 applicable to early-stage and the remaining three throughout all stages. Our findings suggest that CSNK1E is a promising target for TP53-mutant CRC patients which constitute ~40% to 50% of patients, while to date safety regarding inhibition of TP53 is controversial. Thus the integrated approach is useful in finding novel SL pairs for cancer therapeutics, and it is readily accessible and applicable to other cancers.


Subject(s)
Biomarkers, Tumor/genetics , Casein Kinase 1 epsilon/genetics , Colorectal Neoplasms/genetics , Tumor Suppressor Protein p53/genetics , beta Catenin/genetics , Animals , Colorectal Neoplasms/mortality , Humans , Immunohistochemistry , Kaplan-Meier Estimate , Mice , Oligonucleotide Array Sequence Analysis , Prognosis , Proportional Hazards Models , Real-Time Polymerase Chain Reaction , Tissue Array Analysis
SELECTION OF CITATIONS
SEARCH DETAIL
...