Search | VHL Regional Portal

A novel gene selection method for gene expression data for the task of cancer type classification.

Özcan SImSek, N Özlem; ÖzgÜr, Arzucan; GÜrgen, Fikret.

Biol Direct ; 16(1): 7, 2021 02 08.

Article in English | MEDLINE | ID: mdl-33557857

ABSTRACT

Cancer is a poligenetic disease with each cancer type having a different mutation profile. Genomic data can be utilized to detect these profiles and to diagnose and differentiate cancer types. Variant calling provide mutation information. Gene expression data reveal the altered cell behaviour. The combination of the mutation and expression information can lead to accurate discrimination of different cancer types. In this study, we utilized and transferred the information of existing mutations for a novel gene selection method for gene expression data. We tested the proposed method in order to diagnose and differentiate cancer types. It is a disease specific method as both the mutations and expressions are filtered according to the selected cancer types. Our experiment results show that the proposed gene selection method leads to similar or improved performance metrics compared to classical feature selection methods and curated gene sets.

Subject(s)

Gene Expression Profiling/methods , Genomics/statistics & numerical data , Machine Learning , Neoplasms/classification , Algorithms , Neoplasms/genetics

Statistical representation models for mutation information within genomic data.

Özcan Simsek, N Özlem; Özgür, Arzucan; Gürgen, Fikret.

BMC Bioinformatics ; 20(1): 324, 2019 Jun 13.

Article in English | MEDLINE | ID: mdl-31195961

ABSTRACT

BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes.

Subject(s)

Genomics/methods , Models, Genetic , Models, Statistical , Mutation/genetics , Databases, Genetic , Exons/genetics , Humans , Introns/genetics , Machine Learning , Neoplasms/genetics , Neural Networks, Computer

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL