Search | VHL Regional Portal

A novel gene selection method for gene expression data for the task of cancer type classification.

Özcan SImSek, N Özlem; ÖzgÜr, Arzucan; GÜrgen, Fikret.

Biol Direct ; 16(1): 7, 2021 02 08.

Article in English | MEDLINE | ID: mdl-33557857

ABSTRACT

Cancer is a poligenetic disease with each cancer type having a different mutation profile. Genomic data can be utilized to detect these profiles and to diagnose and differentiate cancer types. Variant calling provide mutation information. Gene expression data reveal the altered cell behaviour. The combination of the mutation and expression information can lead to accurate discrimination of different cancer types. In this study, we utilized and transferred the information of existing mutations for a novel gene selection method for gene expression data. We tested the proposed method in order to diagnose and differentiate cancer types. It is a disease specific method as both the mutations and expressions are filtered according to the selected cancer types. Our experiment results show that the proposed gene selection method leads to similar or improved performance metrics compared to classical feature selection methods and curated gene sets.

Subject(s)

Gene Expression Profiling/methods , Genomics/statistics & numerical data , Machine Learning , Neoplasms/classification , Algorithms , Neoplasms/genetics

Estimation of Parkinson's disease severity using speech features and extreme gradient boosting.

Tunc, Hunkar C; Sakar, C Okan; Apaydin, Hulya; Serbes, Gorkem; Gunduz, Aysegul; Tutuncu, Melih; Gurgen, Fikret.

Med Biol Eng Comput ; 58(11): 2757-2773, 2020 Nov.

Article in English | MEDLINE | ID: mdl-32910301

ABSTRACT

In recent years, there is an increasing interest in building e-health systems. The systems built to deliver the health services with the use of internet and communication technologies aim to reduce the costs arising from outpatient visits of patients. Some of the related recent studies propose machine learning-based telediagnosis and telemonitoring systems for Parkinson's disease (PD). Motivated from the studies showing the potential of speech disorders in PD telemonitoring systems, in this study, we aim to estimate the severity of PD from voice recordings of the patients using motor Unified Parkinson's Disease Rating Scale (UPDRS) as the evaluation metric. For this purpose, we apply various speech processing algorithms to the voice signals of the patients and then use these features as input to a two-stage estimation model. The first step is to apply a wrapper-based feature selection algorithm, called Boruta, and select the most informative speech features. The second step is to feed the selected set of features to a decision tree-based boosting algorithm, extreme gradient boosting, which has been recently applied successfully in many machine learning tasks due to its generalization ability and speed. The feature selection analysis showed that the vibration pattern of the vocal fold is an important indicator of PD severity. Besides, we also investigate the effectiveness of using age and years passed since diagnosis as covariates together with speech features. The lowest mean absolute error with 3.87 was obtained by combining these covariates and speech features with prediction level fusion. Graphical Abstract Framework for the proposed UPDRS estimation model.

Subject(s)

Algorithms , Diagnosis, Computer-Assisted , Parkinson Disease/diagnosis , Speech , Age Factors , Aged , Female , Humans , Machine Learning , Male , Middle Aged , Self-Assessment , Severity of Illness Index , Signal Processing, Computer-Assisted , Tape Recording , Telemedicine/methods

Statistical representation models for mutation information within genomic data.

Özcan Simsek, N Özlem; Özgür, Arzucan; Gürgen, Fikret.

BMC Bioinformatics ; 20(1): 324, 2019 Jun 13.

Article in English | MEDLINE | ID: mdl-31195961

ABSTRACT

BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes.

Subject(s)

Genomics/methods , Models, Genetic , Models, Statistical , Mutation/genetics , Databases, Genetic , Exons/genetics , Humans , Introns/genetics , Machine Learning , Neoplasms/genetics , Neural Networks, Computer

Combining multiple clusterings for protein structure prediction.

Sakar, C Okan; Kursun, Olcay; Seker, Huseyin; Gurgen, Fikret.

Int J Data Min Bioinform ; 10(2): 162-74, 2014.

Article in English | MEDLINE | ID: mdl-25796736

ABSTRACT

Computational annotation and prediction of protein structure is very important in the post-genome era due to existence of many different proteins, most of which are yet to be verified. Mutual information based feature selection methods can be used in selecting such minimal yet predictive subsets of features. However, as protein features are organised into natural partitions, individual feature selection that ignores the presence of these views, dismantles them, and treats their variables intermixed along with those of others at best results in a complex un-interpretable predictive system for such multi-view datasets. In this paper, instead of selecting a subset of individual features, each feature subset is passed through a clustering step so that it is represented in discrete form using the cluster indices; this makes mutual information based methods applicable to view-selection. We present our experimental results on a multi-view protein dataset that are used to predict protein structure.

Subject(s)

Algorithms , Databases, Protein , Models, Chemical , Proteins/chemistry , Proteins/ultrastructure , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Computer Simulation , Data Mining/methods , Models, Molecular , Molecular Sequence Data , Pattern Recognition, Automated/methods , Protein Conformation

Collection and analysis of a Parkinson speech dataset with multiple types of sound recordings.

Sakar, Betul Erdogdu; Isenkul, M Erdem; Sakar, C Okan; Sertbas, Ahmet; Gurgen, Fikret; Delil, Sakir; Apaydin, Hulya; Kursun, Olcay.

IEEE J Biomed Health Inform ; 17(4): 828-34, 2013 Jul.

Article in English | MEDLINE | ID: mdl-25055311

ABSTRACT

There has been an increased interest in speech pattern analysis applications of Parkinsonism for building predictive telediagnosis and telemonitoring models. For this purpose, we have collected a wide variety of voice samples, including sustained vowels, words, and sentences compiled from a set of speaking exercises for people with Parkinson's disease. There are two main issues in learning from such a dataset that consists of multiple speech recordings per subject: 1) How predictive these various types, e.g., sustained vowels versus words, of voice samples are in Parkinson's disease (PD) diagnosis? 2) How well the central tendency and dispersion metrics serve as representatives of all sample recordings of a subject? In this paper, investigating our Parkinson dataset using well-known machine learning tools, as reported in the literature, sustained vowels are found to carry more PD-discriminative information. We have also found that rather than using each voice recording of each subject as an independent data sample, representing the samples of a subject with central tendency and dispersion metrics improves generalization of the predictive model.

Subject(s)

Parkinson Disease/physiopathology , Pattern Recognition, Automated/methods , Sound Spectrography/methods , Speech/physiology , Voice/physiology , Adult , Aged , Databases, Factual , Female , Humans , Male , Middle Aged , Support Vector Machine

Intelligent data analysis to interpret major risk factors for diabetic patients with and without ischemic stroke in a small population.

Gürgen, Fikret; Gürgen, Nurgül.

Biomed Eng Online ; 2: 5, 2003 Mar 04.

Article in English | MEDLINE | ID: mdl-12685939

ABSTRACT

This study proposes an intelligent data analysis approach to investigate and interpret the distinctive factors of diabetes mellitus patients with and without ischemic (non-embolic type) stroke in a small population. The database consists of a total of 16 features collected from 44 diabetic patients. Features include age, gender, duration of diabetes, cholesterol, high density lipoprotein, triglyceride levels, neuropathy, nephropathy, retinopathy, peripheral vascular disease, myocardial infarction rate, glucose level, medication and blood pressure. Metric and non-metric features are distinguished. First, the mean and covariance of the data are estimated and the correlated components are observed. Second, major components are extracted by principal component analysis. Finally, as common examples of local and global classification approach, a k-nearest neighbor and a high-degree polynomial classifier such as multilayer perceptron are employed for classification with all the components and major components case. Macrovascular changes emerged as the principal distinctive factors of ischemic-stroke in diabetes mellitus. Microvascular changes were generally ineffective discriminators. Recommendations were made according to the rules of evidence-based medicine. Briefly, this case study, based on a small population, supports theories of stroke in diabetes mellitus patients and also concludes that the use of intelligent data analysis improves personalized preventive intervention.

Subject(s)

Brain Infarction/epidemiology , Diabetes Mellitus/epidemiology , Models, Statistical , Brain Ischemia/epidemiology , Comorbidity , Factor Analysis, Statistical , Humans , Risk Factors

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL