Search | VHL Regional Portal

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

Van, Richard; Alvarez, Daniel; Mize, Travis; Gannavarapu, Sravani; Chintham Reddy, Lohitha; Nasoz, Fatma; Han, Mira V.

BMC Bioinformatics ; 25(1): 181, 2024 May 08.

Article in English | MEDLINE | ID: mdl-38720247

ABSTRACT

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

Subject(s)

Machine Learning , Neoplasms , RNA-Seq , Humans , RNA-Seq/methods , Neoplasms/genetics , Transcriptome/genetics , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Computational Biology/methods

Machine learning approaches for the prediction of bone mineral density by using genomic and phenotypic data of 5130 older men.

Wu, Qing; Nasoz, Fatma; Jung, Jongyun; Bhattarai, Bibek; Han, Mira V; Greenes, Robert A; Saag, Kenneth G.

Sci Rep ; 11(1): 4482, 2021 02 24.

Article in English | MEDLINE | ID: mdl-33627720

ABSTRACT

The study aimed to utilize machine learning (ML) approaches and genomic data to develop a prediction model for bone mineral density (BMD) and identify the best modeling approach for BMD prediction. The genomic and phenotypic data of Osteoporotic Fractures in Men Study (n = 5130) was analyzed. Genetic risk score (GRS) was calculated from 1103 associated SNPs for each participant after a comprehensive genotype imputation. Data were normalized and divided into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and linear regression were used to develop BMD prediction models separately. Ten-fold cross-validation was used for hyper-parameters optimization. Mean square error and mean absolute error were used to assess model performance. When using GRS and phenotypic covariates as the predictors, all ML models' performance and linear regression in BMD prediction were similar. However, when replacing GRS with the 1103 individual SNPs in the model, ML models performed significantly better than linear regression (with lasso regularization), and the gradient boosting model performed the best. Our study suggested that ML models, especially gradient boosting, can improve BMD prediction in genomic data.

Subject(s)

Bone Density/genetics , Bone Density/physiology , Aged , Fractures, Bone/genetics , Fractures, Bone/pathology , Genomics/methods , Genotype , Humans , Linear Models , Machine Learning , Male , Polymorphism, Single Nucleotide/genetics , Risk Assessment , Risk Factors

Machine Learning Approaches for Fracture Risk Assessment: A Comparative Analysis of Genomic and Phenotypic Data in 5130 Older Men.

Wu, Qing; Nasoz, Fatma; Jung, Jongyun; Bhattarai, Bibek; Han, Mira V.

Calcif Tissue Int ; 107(4): 353-361, 2020 10.

Article in English | MEDLINE | ID: mdl-32728911

ABSTRACT

The study aims were to develop fracture prediction models by using machine learning approaches and genomic data, as well as to identify the best modeling approach for fracture prediction. The genomic data of Osteoporotic Fractures in Men, cohort Study (n = 5130), were analyzed. After a comprehensive genotype imputation, genetic risk score (GRS) was calculated from 1103 associated Single Nucleotide Polymorphisms for each participant. Data were normalized and split into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and logistic regression were used to develop prediction models for major osteoporotic fractures separately, with GRS, bone density, and other risk factors as predictors. In model training, the synthetic minority oversampling technique was used to account for low fracture rate, and tenfold cross-validation was employed for hyperparameters optimization. In the testing, the area under curve (AUC) and accuracy were used to assess the model performance. The McNemar test was employed to examine the accuracy difference between models. The results showed that the prediction performance of gradient boosting was the best, with AUC of 0.71 and an accuracy of 0.88, and the GRS ranked as the 7th most important variable in the model. The performance of random forest and neural network were also significantly better than that of logistic regression. This study suggested that improving fracture prediction in older men can be achieved by incorporating genetic profiling and by utilizing the gradient boosting approach. This result should not be extrapolated to women or young individuals.

Subject(s)

Bone Density , Fractures, Bone/diagnosis , Machine Learning , Risk Assessment , Activities of Daily Living , Aged , Aged, 80 and over , Cohort Studies , Genomics , Humans , Male , Phenotype

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL