Search | VHL Regional Portal

A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data.

Shirkhorshidi, Ali Seyed; Aghabozorgi, Saeed; Wah, Teh Ying.

PLoS One ; 10(12): e0144059, 2015.

Article in English | MEDLINE | ID: mdl-26658987

ABSTRACT

Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.

Subject(s)

Algorithms , Data Mining/statistics & numerical data , Datasets as Topic , Analysis of Variance , Benchmarking , Cluster Analysis

Diagnosing tuberculosis with a novel support vector machine-based artificial immune recognition system.

Saybani, Mahmoud Reza; Shamshirband, Shahaboddin; Golzari Hormozi, Shahram; Wah, Teh Ying; Aghabozorgi, Saeed; Pourhoseingholi, Mohamad Amin; Olariu, Teodora.

Iran Red Crescent Med J ; 17(4): e24557, 2015 Apr.

Article in English | MEDLINE | ID: mdl-26023340

ABSTRACT

BACKGROUND: Tuberculosis (TB) is a major global health problem, which has been ranked as the second leading cause of death from an infectious disease worldwide. Diagnosis based on cultured specimens is the reference standard, however results take weeks to process. Scientists are looking for early detection strategies, which remain the cornerstone of tuberculosis control. Consequently there is a need to develop an expert system that helps medical professionals to accurately and quickly diagnose the disease. Artificial Immune Recognition System (AIRS) has been used successfully for diagnosing various diseases. However, little effort has been undertaken to improve its classification accuracy. OBJECTIVES: In order to increase the classification accuracy of AIRS, this study introduces a new hybrid system that incorporates a support vector machine into AIRS for diagnosing tuberculosis. PATIENTS AND METHODS: Patient epacris reports obtained from the Pasteur laboratory of Iran were used as the benchmark data set, with the sample size of 175 (114 positive samples for TB and 60 samples in the negative group). The strategy of this study was to ensure representativeness, thus it was important to have an adequate number of instances for both TB and non-TB cases. The classification performance was measured through 10-fold cross-validation, Root Mean Squared Error (RMSE), sensitivity and specificity, Youden's Index, and Area Under the Curve (AUC). Statistical analysis was done using the Waikato Environment for Knowledge Analysis (WEKA), a machine learning program for windows. RESULTS: With an accuracy of 100%, sensitivity of 100%, specificity of 100%, Youden's Index of 1, Area Under the Curve of 1, and RMSE of 0, the proposed method was able to successfully classify tuberculosis patients. CONCLUSIONS: There have been many researches that aimed at diagnosing tuberculosis faster and more accurately. Our results described a model for diagnosing tuberculosis with 100% sensitivity and 100% specificity. This model can be used as an additional tool for experts in medicine to diagnose TBC more accurately and quickly.

Improving RLRN image splicing detection with the Use of PCA and kernel PCA.

Moghaddasi, Zahra; Jalab, Hamid A; Md Noor, Rafidah; Aghabozorgi, Saeed.

ScientificWorldJournal ; 2014: 606570, 2014.

Article in English | MEDLINE | ID: mdl-25295304

ABSTRACT

Digital image forgery is becoming easier to perform because of the rapid development of various manipulation tools. Image splicing is one of the most prevalent techniques. Digital images had lost their trustability, and researches have exerted considerable effort to regain such trustability by focusing mostly on algorithms. However, most of the proposed algorithms are incapable of handling high dimensionality and redundancy in the extracted features. Moreover, existing algorithms are limited by high computational time. This study focuses on improving one of the image splicing detection algorithms, that is, the run length run number algorithm (RLRN), by applying two dimension reduction methods, namely, principal component analysis (PCA) and kernel PCA. Support vector machine is used to distinguish between authentic and spliced images. Results show that kernel PCA is a nonlinear dimension reduction method that has the best effect on R, G, B, and Y channels and gray-scale images.

Subject(s)

Pattern Recognition, Automated/methods , Photography/methods , Principal Component Analysis/methods , Signal Processing, Computer-Assisted , Humans , Image Interpretation, Computer-Assisted , Pattern Recognition, Automated/trends , Photography/trends

A review of subsequence time series clustering.

Zolhavarieh, Seyedjamal; Aghabozorgi, Saeed; Teh, Ying Wah.

ScientificWorldJournal ; 2014: 312521, 2014.

Article in English | MEDLINE | ID: mdl-25140332

ABSTRACT

Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies.

Subject(s)

Algorithms , Cluster Analysis , Data Mining , Models, Theoretical

A hybrid algorithm for clustering of time series data based on affinity search technique.

Aghabozorgi, Saeed; Ying Wah, Teh; Herawan, Tutut; Jalab, Hamid A; Shaygan, Mohammad Amin; Jalali, Alireza.

ScientificWorldJournal ; 2014: 562194, 2014.

Article in English | MEDLINE | ID: mdl-24982966

ABSTRACT

Time series clustering is an important solution to various problems in numerous fields of research, including business, medical science, and finance. However, conventional clustering algorithms are not practical for time series data because they are essentially designed for static data. This impracticality results in poor clustering accuracy in several systems. In this paper, a new hybrid clustering algorithm is proposed based on the similarity in shape of time series data. Time series data are first grouped as subclusters based on similarity in time. The subclusters are then merged using the k-Medoids algorithm based on similarity in shape. This model has two contributions: (1) it is more accurate than other conventional and hybrid approaches and (2) it determines the similarity in shape among time series data with a low complexity. To evaluate the accuracy of the proposed model, the model is tested extensively using syntactic and real-world time series datasets.

Subject(s)

Algorithms , Cluster Analysis , Pattern Recognition, Automated

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL