Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 116
Filter
1.
Adv Sci (Weinh) ; : e2401815, 2024 Jun 17.
Article in English | MEDLINE | ID: mdl-38887194

ABSTRACT

In recent years, the integration of single-cell multi-omics data has provided a more comprehensive understanding of cell functions and internal regulatory mechanisms from a non-single omics perspective, but it still suffers many challenges, such as omics-variance, sparsity, cell heterogeneity, and confounding factors. As it is known, the cell cycle is regarded as a confounder when analyzing other factors in single-cell RNA-seq data, but it is not clear how it will work on the integrated single-cell multi-omics data. Here, a cell cycle-aware network (CCAN) is developed to remove cell cycle effects from the integrated single-cell multi-omics data while keeping the cell type-specific variations. This is the first computational model to study the cell-cycle effects in the integration of single-cell multi-omics data. Validations on several benchmark datasets show the outstanding performance of CCAN in a variety of downstream analyses and applications, including removing cell cycle effects and batch effects of scRNA-seq datasets from different protocols, integrating paired and unpaired scRNA-seq and scATAC-seq data, accurately transferring cell type labels from scRNA-seq to scATAC-seq data, and characterizing the differentiation process from hematopoietic stem cells to different lineages in the integration of differentiation data.

2.
BMC Bioinformatics ; 25(1): 181, 2024 May 08.
Article in English | MEDLINE | ID: mdl-38720247

ABSTRACT

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.


Subject(s)
Machine Learning , Neoplasms , RNA-Seq , Humans , RNA-Seq/methods , Neoplasms/genetics , Transcriptome/genetics , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Computational Biology/methods
3.
Adv Sci (Weinh) ; : e2308934, 2024 May 22.
Article in English | MEDLINE | ID: mdl-38778573

ABSTRACT

Numerous single-cell transcriptomic datasets from identical tissues or cell lines are generated from different laboratories or single-cell RNA sequencing (scRNA-seq) protocols. The denoising of these datasets to eliminate batch effects is crucial for data integration, ensuring accurate interpretation and comprehensive analysis of biological questions. Although many scRNA-seq data integration methods exist, most are inefficient and/or not conducive to downstream analysis. Here, DeepBID, a novel deep learning-based method for batch effect correction, non-linear dimensionality reduction, embedding, and cell clustering concurrently, is introduced. DeepBID utilizes a negative binomial-based autoencoder with dual Kullback-Leibler divergence loss functions, aligning cell points from different batches within a consistent low-dimensional latent space and progressively mitigating batch effects through iterative clustering. Extensive validation on multiple-batch scRNA-seq datasets demonstrates that DeepBID surpasses existing tools in removing batch effects and achieving superior clustering accuracy. When integrating multiple scRNA-seq datasets from patients with Alzheimer's disease, DeepBID significantly improves cell clustering, effectively annotating unidentified cells, and detecting cell-specific differentially expressed genes.

4.
Front Genet ; 15: 1381917, 2024.
Article in English | MEDLINE | ID: mdl-38746057

ABSTRACT

MicroRNAs (miRNAs) are promising biomarkers for the early detection of disease, and many miRNA-based diagnostic models have been constructed to distinguish patients and healthy individuals. To thoroughly utilize the miRNA-profiling data across different sequencing platforms or multiple centers, the models accounting the batch effects were demanded for the generalization of medical application. We conducted transcription factor (TF)-mediated miRNA-miRNA interaction network analysis and adopted the within-sample expression ratios of miRNA pairs as predictive markers. The ratio of the expression values between each miRNA pair turned out to be stable across multiple data sources. A genetic algorithm-based classifier was constructed to quantify risk scores of the probability of disease and discriminate disease states from normal states in discovery, with a validation dataset for COVID-19, renal cell carcinoma, and lung adenocarcinoma. The predictive models based on the expression ratio of interacting miRNA pairs demonstrated good performances in the discovery and validation datasets, and the classifier may be used accurately for the early detection of disease.

5.
Methods Mol Biol ; 2757: 383-445, 2024.
Article in English | MEDLINE | ID: mdl-38668977

ABSTRACT

The emergence and development of single-cell RNA sequencing (scRNA-seq) techniques enable researchers to perform large-scale analysis of the transcriptomic profiling at cell-specific resolution. Unsupervised clustering of scRNA-seq data is central for most studies, which is essential to identify novel cell types and their gene expression logics. Although an increasing number of algorithms and tools are available for scRNA-seq analysis, a practical guide for users to navigate the landscape remains underrepresented. This chapter presents an overview of the scRNA-seq data analysis pipeline, quality control, batch effect correction, data standardization, cell clustering and visualization, cluster correlation analysis, and marker gene identification. Taking the two broadly used analysis packages, i.e., Scanpy and MetaCell, as examples, we provide a hands-on guideline and comparison regarding the best practices for the above essential analysis steps and data visualization. Additionally, we compare both packages and algorithms using a scRNA-seq dataset of the ctenophore Mnemiopsis leidyi, which is representative of one of the earliest animal lineages, critical to understanding the origin and evolution of animal novelties. This pipeline can also be helpful for analyses of other taxa, especially prebilaterian animals, where these tools are under development (e.g., placozoan and Porifera).


Subject(s)
Algorithms , Gene Expression Profiling , Single-Cell Analysis , Software , Single-Cell Analysis/methods , Animals , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Computational Biology/methods , Cluster Analysis , Transcriptome/genetics
6.
Med Image Anal ; 94: 103123, 2024 May.
Article in English | MEDLINE | ID: mdl-38430651

ABSTRACT

Cell line authentication plays a crucial role in the biomedical field, ensuring researchers work with accurately identified cells. Supervised deep learning has made remarkable strides in cell line identification by studying cell morphological features through cell imaging. However, biological batch (bio-batch) effects, a significant issue stemming from the different times at which data is generated, lead to substantial shifts in the underlying data distribution, thus complicating reliable differentiation between cell lines from distinct batch cultures. To address this challenge, we introduce CLANet, a pioneering framework for cross-batch cell line identification using brightfield images, specifically designed to tackle three distinct bio-batch effects. We propose a cell cluster-level selection method to efficiently capture cell density variations, and a self-supervised learning strategy to manage image quality variations, thus producing reliable patch representations. Additionally, we adopt multiple instance learning(MIL) for effective aggregation of instance-level features for cell line identification. Our innovative time-series segment sampling module further enhances MIL's feature-learning capabilities, mitigating biases from varying incubation times across batches. We validate CLANet using data from 32 cell lines across 93 experimental bio-batches from the AstraZeneca Global Cell Bank. Our results show that CLANet outperforms related approaches (e.g. domain adaptation, MIL), demonstrating its effectiveness in addressing bio-batch effects in cell line identification.


Subject(s)
Cell Line Authentication , Humans , Pancreas , Time Factors
7.
Front Public Health ; 12: 1328089, 2024.
Article in English | MEDLINE | ID: mdl-38444441

ABSTRACT

Background: Ultraviolet B (UVB) from sunlight represents a major environmental factor that causes toxic effects resulting in structural and functional cutaneous abnormalities in most living organisms. Although numerous studies have indicated the biological mechanisms linking UVB exposure and cutaneous manifestations, they have typically originated from a single study performed under limited conditions. Methods: We accessed all publicly accessible expression data of various skin cell types exposed to UVB, including skin biopsies, keratinocytes, and fibroblasts. We performed biological network analysis to identify the molecular mechanisms and identify genetic biomarkers. Results: We interpreted the inflammatory response and carcinogenesis as major UVB-induced signaling alternations and identified three candidate biomarkers (IL1B, CCL2, and LIF). Moreover, we confirmed that these three biomarkers contribute to the survival probability of patients with cutaneous melanoma, the most aggressive and lethal form of skin cancer. Conclusion: Our findings will aid the understanding of UVB-induced cutaneous toxicity and the accompanying molecular mechanisms. In addition, the three candidate biomarkers that change molecular signals due to UVB exposure of skin might be related to the survival rate of patients with cutaneous melanoma.


Subject(s)
Melanoma , Skin Neoplasms , Humans , Melanoma/genetics , Skin Neoplasms/genetics , Base Sequence , Biomarkers , RNA
8.
Genes (Basel) ; 15(1)2024 01 03.
Article in English | MEDLINE | ID: mdl-38254957

ABSTRACT

Genome-wide association studies (GWAS) have successfully revealed many disease-associated genetic variants. For a case-control study, the adequate power of an association test can be achieved with a large sample size, although genotyping large samples is expensive. A cost-effective strategy to boost power is to integrate external control samples with publicly available genotyped data. However, the naive integration of external controls may inflate the type I error rates if ignoring the systematic differences (batch effect) between studies, such as the differences in sequencing platforms, genotype-calling procedures, population stratification, and so forth. To account for the batch effect, we propose an approach by integrating External Controls into the Association Test by Regression Calibration (iECAT-RC) in case-control association studies. Extensive simulation studies show that iECAT-RC not only can control type I error rates but also can boost statistical power in all models. We also apply iECAT-RC to the UK Biobank data for M72 Fibroblastic disorders by considering genotype calling as the batch effect. Four SNPs associated with fibroblastic disorders have been detected by iECAT-RC and the other two comparison methods, iECAT-Score and Internal. However, our method has a higher probability of identifying these significant SNPs in the scenario of an unbalanced case-control association study.


Subject(s)
Genome-Wide Association Study , Calibration , Case-Control Studies , Computer Simulation , Genotype
9.
Diab Vasc Dis Res ; 20(6): 14791641231218453, 2023.
Article in English | MEDLINE | ID: mdl-38059349

ABSTRACT

INTRODUCTION: The Singapore Study of Macro-Angiopathy and microvascular Reactivity in Type 2 Diabetes (SMART2D) is a prospective cohort study which was started in 2011 to investigate the effect of risk factors on vascular function and diabetes-related complications in Asians. We aimed to compare the longitudinal change in risk factors by accounting for batch effect and assess the tracking stability of risk factors over time in patients recruited for SMART2D. In this study, we (1) described batch effect and its extent across a heterogenous range of longitudinal data parameters; (2) mitigated batch effect through statistical approach; and (3) assessed the tracking stability of the risk factors over time. METHODS: A total of 2258 patients with type 2 diabetes mellitus (T2DM) were recruited at baseline. The study adopted a three-wave longitudinal design with intervals of 3 years between consecutive waves. The changes in a few selected risk factors were assessed after calibration, assuming patients with similar demographic and anthropometry profile had similar physiology. The tracking pattern of the risk factors was determined with stability coefficients derived from generalised estimating equations. RESULTS: The medians of the longitudinal differences in risk factors between the waves were mostly modest at <10%. Larger increases in augmentation index (AI), aortic systolic blood pressure (BP) and aortic mean BP were consistently observed after calibration. The medians of the longitudinal differences in AI, aortic systolic BP and aortic mean BP between the waves were <2% before calibration, but increased slightly to <5% after calibration. Most of the risk factors had moderate to high tracking stability. Muscle mass and serum creatinine were among those with relatively high tracking stability. CONCLUSIONS: The longitudinal differences in parameters between the waves were overall modest after calibration, suggesting that calibration may attenuate longitudinal differences inflated by non-biological factors such as systematic drift due to batch effect. Changes of the hemodynamic parameters are robust over time and not entirely attributable to age. Our study also demonstrated moderate to high tracking stability for most of the parameters.


Subject(s)
Diabetes Mellitus, Type 2 , Hypertension , Humans , Diabetes Mellitus, Type 2/complications , Diabetes Mellitus, Type 2/diagnosis , Prospective Studies , Singapore/epidemiology , Risk Factors , Hypertension/complications , Blood Pressure/physiology , Longitudinal Studies
10.
Brief Bioinform ; 25(1)2023 11 22.
Article in English | MEDLINE | ID: mdl-37991248

ABSTRACT

Due to the high dimensionality and sparsity of the gene expression matrix in single-cell RNA-sequencing (scRNA-seq) data, coupled with significant noise generated by shallow sequencing, it poses a great challenge for cell clustering methods. While numerous computational methods have been proposed, the majority of existing approaches center on processing the target dataset itself. This approach disregards the wealth of knowledge present within other species and batches of scRNA-seq data. In light of this, our paper proposes a novel method named graph-based deep embedding clustering (GDEC) that leverages transfer learning across species and batches. GDEC integrates graph convolutional networks, effectively overcoming the challenges posed by sparse gene expression matrices. Additionally, the incorporation of DEC in GDEC enables the partitioning of cell clusters within a lower-dimensional space, thereby mitigating the adverse effects of noise on clustering outcomes. GDEC constructs a model based on existing scRNA-seq datasets and then applying transfer learning techniques to fine-tune the model using a limited amount of prior knowledge gleaned from the target dataset. This empowers GDEC to adeptly cluster scRNA-seq data cross different species and batches. Through cross-species and cross-batch clustering experiments, we conducted a comparative analysis between GDEC and conventional packages. Furthermore, we implemented GDEC on the scRNA-seq data of uterine fibroids. Compared results obtained from the Seurat package, GDEC unveiled a novel cell type (epithelial cells) and identified a notable number of new pathways among various cell types, thus underscoring the enhanced analytical capabilities of GDEC. Availability and implementation: https://github.com/YuzhiSun/GDEC/tree/main.


Subject(s)
Gene Expression Profiling , Leiomyoma , Humans , Gene Expression Profiling/methods , Algorithms , Sequence Analysis, RNA/methods , Single-Cell Gene Expression Analysis , Single-Cell Analysis/methods , Cluster Analysis , Machine Learning
11.
Methods ; 220: 61-68, 2023 12.
Article in English | MEDLINE | ID: mdl-37931852

ABSTRACT

Spatial transcriptomics is a rapidly evolving field that enables researchers to capture comprehensive molecular profiles while preserving information about the physical locations. One major challenge in this research area involves the identification of spatial domains, which are distinct regions characterized by unique gene expression patterns. However, current unsupervised methods have struggled to perform well in this regard due to the presence of high levels of noise and dropout events in spatial transcriptomic profiles. In this paper, we propose a novel hexagonal Convolutional Neural Network (hexCNN) for hexagonal image segmentation on spatially resolved transcriptomics. To address the problem of noise and dropout occurrences within spatial transcriptomics data, we first extend an unsupervised algorithm to a supervised learning method that can identify useful features and reduce noise hindrance. Then, inspired by the classical convolution in convolutional neural networks (CNNs), we designed a regular hexagonal convolution to compensate for the missing gene expression patterns from adjacent spots. We evaluated the performance of hexCNN by applying it to the DLPFC dataset. The results show that hexCNN achieves a classification accuracy of 86.8% and an average Rand index (ARI) of 77.1% (1.4% and 2.5% higher than those of GNNs). The results also demonstrate that hexCNN is capable of removing the noise caused by batch effect while preserving the biological signal differences.


Subject(s)
Algorithms , Gene Expression Profiling , Neural Networks, Computer , Image Processing, Computer-Assisted
12.
J Magn Reson Imaging ; 2023 Oct 25.
Article in English | MEDLINE | ID: mdl-37877463

ABSTRACT

BACKGROUND: "Batch effect" in MR images, due to vendor-specific features, MR machine generations, and imaging parameters, challenges image quality and hinders deep learning (DL) model generalizability. PURPOSE: We aim to develop a DL model using contrast adjustment and super-resolution to reduce diffusion-weighted images (DWIs) diversity across magnetic field strengths and imaging parameters. STUDY TYPE: Retrospective. SUBJECTS: The DL model was built using an open dataset from one individual. The MR machine identification model was trained and validated on a dataset of 1134 adults (54% females, 46% males), with 1050 subjects showing no DWI abnormalities and 84 with conditions like stroke and tumors. The 21,000 images were divided into 80% for training, 20% for validation, and 3500 for testing. FIELD STRENGTH/SEQUENCE: Seven MR scanners from four manufacturers with 1.5 T and 3 T magnetic field strengths. DWIs were acquired using spin-echo sequences and high-resolution T2WIs using the T2-SPACE sequence. ASSESSMENT: An experienced, board-certified radiologist evaluated the effectiveness of restoring high-resolution T2WI and harmonizing diverse DWI with metrics such as PSNR and SSIM, and the texture and frequency attributes were further analyzed using gray-level co-occurrence matrix and 1-dimensional power spectral density. The model's impact on machine-specific characteristics was gauged through the performance metrics of a ResNet-50 model. Comprehensive statistical tests were employed for statistical robustness, including McNemar's test and the Dice index. RESULTS: Our DL protocol reduced DWI contrast and resolution variation. ResNet-50 model's accuracy decreased from 0.9443 to 0.5786, precision from 0.9442 to 0.6494, recall from 0.9443 to 0.5786, and F1 score from 0.9438 to 0.5587. The t-SNE visualization indicated more consistent image features across multiple MR devices. Autoencoder halved learning iterations; Dice coefficient >0.74 confirmed signal reproducibility in 84 lesions. CONCLUSION: This study presents a DL strategy to mitigate batch effects in diffusion MR images, improving their quality and generalizability. EVIDENCE LEVEL: 3 TECHNICAL EFFICACY: Stage 1.

13.
Comput Struct Biotechnol J ; 21: 4804-4815, 2023.
Article in English | MEDLINE | ID: mdl-37841330

ABSTRACT

The human microbiome is an emerging research frontier due to its profound impacts on health. High-throughput microbiome sequencing enables studying microbial communities but suffers from analytical challenges. In particular, the lack of dedicated preprocessing methods to improve data quality impedes effective minimization of biases prior to downstream analysis. This review aims to address this gap by providing a comprehensive overview of preprocessing techniques relevant to microbiome research. We outline a typical workflow for microbiome data analysis. Preprocessing methods discussed include quality filtering, batch effect correction, imputation of missing values, normalization, and data transformation. We highlight strengths and limitations of each technique to serve as a practical guide for researchers and identify areas needing further methodological development. Establishing robust, standardized preprocessing will be essential for drawing valid biological conclusions from microbiome studies.

14.
Quant Imaging Med Surg ; 13(9): 6139-6151, 2023 Sep 01.
Article in English | MEDLINE | ID: mdl-37711807

ABSTRACT

Background: Broad generalization of radiomics-assisted models may be impeded by concerns about variability. This study aimed to evaluate the merit of combatting batch effect (ComBat) harmonization in reducing the variability of voxel size-related radiomics in both phantom and clinical study in comparison with image resampling correction method. Methods: A pulmonary phantom with 22 different types of nodules was scanned by computed tomography (CT) with different voxel sizes. The variability of voxel size-related radiomics features was evaluated using concordance correlation coefficient (CCC), dynamic range (DR), and intraclass correlation coefficient (ICC). ComBat and image resampling compensation methods were used to reduce variability of voxel size-related radiomics. The percentage of robust radiomics features was compared before and after optimization. Pathologically differential diagnosis of invasive adenocarcinoma (IAC) from adenocarcinoma in situ (AIS) and minimally invasive adenocarcinoma (MIA) (AIS-MIA group) was used for clinical validation in 134 patients. Results: Before optimization, the number of excellent features in the phantom and clinical data was 26.12% and 32.31%, respectively. The excellent features were increased after image resampling and ComBat correction. For clinical optimization, the effect of the ComBat compensation method was significantly better than that of image resampling, with excellent features reaching 90.96% and poor features only amounting to 4.96%. In addition, the hierarchical clustering analysis showed that the first-order and shape features had better robustness than did texture features. In clinical validation, the area under the curve (AUC) of the testing set was 0.865 after ComBat correction. Conclusions: The ComBat harmonization can optimize voxel size-related CT radiomics variability in pulmonary nodules more efficiently than image resampling harmonization.

15.
Genome Biol ; 24(1): 212, 2023 09 20.
Article in English | MEDLINE | ID: mdl-37730638

ABSTRACT

BACKGROUND: Single-cell sequencing provides detailed insights into biological processes including cell differentiation and identity. While providing deep cell-specific information, the method suffers from technical constraints, most notably a limited number of expressed genes per cell, which leads to suboptimal clustering and cell type identification. RESULTS: Here, we present DISCERN, a novel deep generative network that precisely reconstructs missing single-cell gene expression using a reference dataset. DISCERN outperforms competing algorithms in expression inference resulting in greatly improved cell clustering, cell type and activity detection, and insights into the cellular regulation of disease. We show that DISCERN is robust against differences between batches and is able to keep biological differences between batches, which is a common problem for imputation and batch correction algorithms. We use DISCERN to detect two unseen COVID-19-associated T cell types, cytotoxic CD4+ and CD8+ Tc2 T helper cells, with a potential role in adverse disease outcome. We utilize T cell fraction information of patient blood to classify mild or severe COVID-19 with an AUROC of 80% that can serve as a biomarker of disease stage. DISCERN can be easily integrated into existing single-cell sequencing workflow. CONCLUSIONS: Thus, DISCERN is a flexible tool for reconstructing missing single-cell gene expression using a reference dataset and can easily be applied to a variety of data sets yielding novel insights, e.g., into disease mechanisms.


Subject(s)
COVID-19 , Humans , COVID-19/genetics , Algorithms , Cell Cycle , Cell Differentiation , Cluster Analysis
16.
Crit Rev Biotechnol ; : 1-19, 2023 Sep 20.
Article in English | MEDLINE | ID: mdl-37731336

ABSTRACT

Shotgun metagenomics is an increasingly cost-effective approach for profiling environmental and host-associated microbial communities. However, due to the complexity of both microbiomes and the molecular techniques required to analyze them, the reliability and representativeness of the results are contingent upon the field, laboratory, and bioinformatic procedures employed. Here, we consider 15 field and laboratory issues that critically impact downstream bioinformatic and statistical data processing, as well as result interpretation, in bacterial shotgun metagenomic studies. The issues we consider encompass intrinsic properties of samples, study design, and laboratory-processing strategies. We identify the links of field and laboratory steps with downstream analytical procedures, explain the means for detecting potential pitfalls, and propose mitigation measures to overcome or minimize their impact in metagenomic studies. We anticipate that our guidelines will assist data scientists in appropriately processing and interpreting their data, while aiding field and laboratory researchers to implement strategies for improving the quality of the generated results.

17.
Genome Biol ; 24(1): 201, 2023 09 07.
Article in English | MEDLINE | ID: mdl-37674217

ABSTRACT

BACKGROUND: Batch effects are notoriously common technical variations in multiomics data and may result in misleading outcomes if uncorrected or over-corrected. A plethora of batch-effect correction algorithms are proposed to facilitate data integration. However, their respective advantages and limitations are not adequately assessed in terms of omics types, the performance metrics, and the application scenarios. RESULTS: As part of the Quartet Project for quality control and data integration of multiomics profiling, we comprehensively assess the performance of seven batch effect correction algorithms based on different performance metrics of clinical relevance, i.e., the accuracy of identifying differentially expressed features, the robustness of predictive models, and the ability of accurately clustering cross-batch samples into their own donors. The ratio-based method, i.e., by scaling absolute feature values of study samples relative to those of concurrently profiled reference material(s), is found to be much more effective and broadly applicable than others, especially when batch effects are completely confounded with biological factors of study interests. We further provide practical guidelines for implementing the ratio based approach in increasingly large-scale multiomics studies. CONCLUSIONS: Multiomics measurements are prone to batch effects, which can be effectively corrected using ratio-based scaling of the multiomics data. Our study lays the foundation for eliminating batch effects at a ratio scale.


Subject(s)
Algorithms , Multiomics , Base Composition , Benchmarking , Clinical Relevance
18.
Front Bioinform ; 3: 1191961, 2023.
Article in English | MEDLINE | ID: mdl-37600970

ABSTRACT

Important quantities of biological data can today be acquired to characterize cell types and states, from various sources and using a wide diversity of methods, providing scientists with more and more information to answer challenging biological questions. Unfortunately, working with this amount of data comes at the price of ever-increasing data complexity. This is caused by the multiplication of data types and batch effects, which hinders the joint usage of all available data within common analyses. Data integration describes a set of tasks geared towards embedding several datasets of different origins or modalities into a joint representation that can then be used to carry out downstream analyses. In the last decade, dozens of methods have been proposed to tackle the different facets of the data integration problem, relying on various paradigms. This review introduces the most common data types encountered in computational biology and provides systematic definitions of the data integration problems. We then present how machine learning innovations were leveraged to build effective data integration algorithms, that are widely used today by computational biologists. We discuss the current state of data integration and important pitfalls to consider when working with data integration tools. We eventually detail a set of challenges the field will have to overcome in the coming years.

19.
Cancer Inform ; 22: 11769351231190477, 2023.
Article in English | MEDLINE | ID: mdl-37577174

ABSTRACT

Hepatocellular carcinoma (HCC) is one of the most fatal cancers in the world. There is an urgent need to understand the molecular background of HCC to facilitate the identification of biomarkers and discover effective therapeutic targets. Published transcriptomic studies have reported a large number of genes that are individually significant for HCC. However, reliable biomarkers remain to be determined. In this study, built on max-linear competing risk factor models, we developed a machine learning analytical framework to analyze transcriptomic data to identify the most miniature set of differentially expressed genes (DEGs). By analyzing 9 public whole-transcriptome datasets (containing 1184 HCC samples and 672 nontumor controls), we identified 5 critical differentially expressed genes (DEGs) (ie, CCDC107, CXCL12, GIGYF1, GMNN, and IFFO1) between HCC and control samples. The classifiers built on these 5 DEGs reached nearly perfect performance in identification of HCC. The performance of the 5 DEGs was further validated in a US Caucasian cohort that we collected (containing 17 HCC with paired nontumor tissue). The conceptual advance of our work lies in modeling gene-gene interactions and correcting batch effect in the analytic framework. The classifiers built on the 5 DEGs demonstrated clear signature patterns for HCC. The results are interpretable, robust, and reproducible across diverse cohorts/populations with various disease etiologies, indicating the 5 DEGs are intrinsic variables that can describe the overall features of HCC at the genomic level. The analytical framework applied in this study may pave a new way for improving transcriptome profiling analysis of human cancers.

20.
bioRxiv ; 2023 Jun 01.
Article in English | MEDLINE | ID: mdl-37398321

ABSTRACT

Increasingly scRNA-Seq studies explore the heterogeneity of cell populations across different samples and its effect on an organism's phenotype. However, relatively few bioinformatic methods have been developed which adequately address the variation between samples for such population-level analyses. We propose a framework for representing the entire single-cell profile of a sample, which we call its GloScope representation. We implement GloScope on scRNA-Seq datasets from study designs ranging from 12 to over 300 samples. These examples demonstrate how GloScope allows researchers to perform essential bioinformatic tasks at the sample-level, in particular visualization and quality control assessment.

SELECTION OF CITATIONS
SEARCH DETAIL
...