Search | VHL Regional Portal

On the evaluation of outlier detection and one-class classification: a comparative study of algorithms, model selection, and ensembles.

Marques, Henrique O; Swersky, Lorne; Sander, Jörg; Campello, Ricardo J G B; Zimek, Arthur.

Data Min Knowl Discov ; 37(4): 1473-1517, 2023.

Article in English | MEDLINE | ID: mdl-37424877

ABSTRACT

It has been shown that unsupervised outlier detection methods can be adapted to the one-class classification problem (Janssens and Postma, in: Proceedings of the 18th annual Belgian-Dutch on machine learning, pp 56-64, 2009; Janssens et al. in: Proceedings of the 2009 ICMLA international conference on machine learning and applications, IEEE Computer Society, pp 147-153, 2009. 10.1109/ICMLA.2009.16). In this paper, we focus on the comparison of one-class classification algorithms with such adapted unsupervised outlier detection methods, improving on previous comparison studies in several important aspects. We study a number of one-class classification and unsupervised outlier detection methods in a rigorous experimental setup, comparing them on a large number of datasets with different characteristics, using different performance measures. In contrast to previous comparison studies, where the models (algorithms, parameters) are selected by using examples from both classes (outlier and inlier), here we also study and compare different approaches for model selection in the absence of examples from the outlier class, which is more realistic for practical applications since labeled outliers are rarely available. Our results showed that, overall, SVDD and GMM are top-performers, regardless of whether the ground truth is used for parameter selection or not. However, in specific application scenarios, other methods exhibited better performance. Combining one-class classifiers into ensembles showed better performance than individual methods in terms of accuracy, as long as the ensemble members are properly selected. Supplementary Information: The online version contains supplementary material available at 10.1007/s10618-023-00931-x.

A unified view of density-based methods for semi-supervised clustering and classification.

Castro Gertrudes, Jadson; Zimek, Arthur; Sander, Jörg; Campello, Ricardo J G B.

Data Min Knowl Discov ; 33(6): 1894-1952, 2019.

Article in English | MEDLINE | ID: mdl-32831623

ABSTRACT

Semi-supervised learning is drawing increasing attention in the era of big data, as the gap between the abundance of cheap, automatically collected unlabeled data and the scarcity of labeled data that are laborious and expensive to obtain is dramatically increasing. In this paper, we first introduce a unified view of density-based clustering algorithms. We then build upon this view and bridge the areas of semi-supervised clustering and classification under a common umbrella of density-based techniques. We show that there are close relations between density-based clustering algorithms and the graph-based approach for transductive classification. These relations are then used as a basis for a new framework for semi-supervised classification based on building-blocks from density-based clustering. This framework is not only efficient and effective, but it is also statistically sound. In addition, we generalize the core algorithm in our framework, HDBSCAN*, so that it can also perform semi-supervised clustering by directly taking advantage of any fraction of labeled data that may be available. Experimental results on a large collection of datasets show the advantages of the proposed approach both for semi-supervised classification as well as for semi-supervised clustering.

Clustering of RNA-Seq samples: Comparison study on cancer data.

Jaskowiak, Pablo Andretta; Costa, Ivan G; Campello, Ricardo J G B.

Methods ; 132: 42-49, 2018 01 01.

Article in English | MEDLINE | ID: mdl-28778489

ABSTRACT

RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.

Subject(s)

Proteogenomics/methods , RNA, Messenger/genetics , Cluster Analysis , Gene Expression Profiling , Humans , Neoplasms/genetics , Neoplasms/metabolism , RNA, Messenger/metabolism , Sequence Analysis, RNA , Transcriptome

A systematic comparative evaluation of biclustering techniques.

Padilha, Victor A; Campello, Ricardo J G B.

BMC Bioinformatics ; 18(1): 55, 2017 Jan 23.

Article in English | MEDLINE | ID: mdl-28114903

ABSTRACT

BACKGROUND: Biclustering techniques are capable of simultaneously clustering rows and columns of a data matrix. These techniques became very popular for the analysis of gene expression data, since a gene can take part of multiple biological pathways which in turn can be active only under specific experimental conditions. Several biclustering algorithms have been developed in the past recent years. In order to provide guidance regarding their choice, a few comparative studies were conducted and reported in the literature. In these studies, however, the performances of the methods were evaluated through external measures that have more recently been shown to have undesirable properties. Furthermore, they considered a limited number of algorithms and datasets. RESULTS: We conducted a broader comparative study involving seventeen algorithms, which were run on three synthetic data collections and two real data collections with a more representative number of datasets. For the experiments with synthetic data, five different experimental scenarios were studied: different levels of noise, different numbers of implanted biclusters, different levels of symmetric bicluster overlap, different levels of asymmetric bicluster overlap and different bicluster sizes, for which the results were assessed with more suitable external measures. For the experiments with real datasets, the results were assessed by gene set enrichment and clustering accuracy. CONCLUSIONS: We observed that each algorithm achieved satisfactory results in part of the biclustering tasks in which they were investigated. The choice of the best algorithm for some application thus depends on the task at hand and the types of patterns that one wants to detect.

Subject(s)

Algorithms , Computational Biology/methods , Software , Cluster Analysis , Computer Simulation , Databases, Genetic , Gene Expression Regulation , Humans , Neoplasms/genetics

On the selection of appropriate distances for gene expression data clustering.

Jaskowiak, Pablo A; Campello, Ricardo J G B; Costa, Ivan G.

BMC Bioinformatics ; 15 Suppl 2: S2, 2014.

Article in English | MEDLINE | ID: mdl-24564555

ABSTRACT

BACKGROUND: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. RESULTS AND CONCLUSIONS: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.

Subject(s)

Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Cluster Analysis , Humans , Neoplasms/genetics

Similarity Measures for Comparing Biclusterings.

Horta, Danilo; Campello, Ricardo J G B.

IEEE/ACM Trans Comput Biol Bioinform ; 11(5): 942-54, 2014.

Article in English | MEDLINE | ID: mdl-26356865

ABSTRACT

The comparison of ordinary partitions of a set of objects is well established in the clustering literature, which comprehends several studies on the analysis of the properties of similarity measures for comparing partitions. However, similarity measures for clusterings are not readily applicable to biclusterings, since each bicluster is a tuple of two sets (of rows and columns), whereas a cluster is only a single set (of rows). Some biclustering similarity measures have been defined as minor contributions in papers which primarily report on proposals and evaluation of biclustering algorithms or comparative analyses of biclustering algorithms. The consequence is that some desirable properties of such measures have been overlooked in the literature. We review 14 biclustering similarity measures. We define eight desirable properties of a biclustering measure, discuss their importance, and prove which properties each of the reviewed measures has. We show examples drawn and inspired from important studies in which several biclustering measures convey misleading evaluations due to the absence of one or more of the discussed properties. We also advocate the use of a more general comparison approach that is based on the idea of transforming the original problem of comparing biclusterings into an equivalent problem of comparing clustering partitions with overlapping clusters.

Subject(s)

Cluster Analysis , Computational Biology/methods , Gene Expression Profiling/methods , Algorithms , Models, Genetic , Reproducibility of Results

Proximity measures for clustering gene expression microarray data: a validation methodology and a comparative analysis.

Jaskowiak, Pablo A; Campello, Ricardo J G B; Costa, Ivan G.

IEEE/ACM Trans Comput Biol Bioinform ; 10(4): 845-57, 2013.

Article in English | MEDLINE | ID: mdl-24334380

ABSTRACT

Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.

Subject(s)

Cluster Analysis , Computational Biology/methods , Databases, Genetic , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Humans , Neoplasms/genetics , Neoplasms/metabolism , Reproducibility of Results , Statistics, Nonparametric

Takagi-Sugeno fuzzy models in the framework of orthonormal basis functions.

Machado, Jeremias B; Campello, Ricardo J G B; Amaral, Wagner Caradori.

IEEE Trans Cybern ; 43(3): 858-70, 2013 Jun.

Article in English | MEDLINE | ID: mdl-23096073

ABSTRACT

An approach to obtain Takagi-Sugeno (TS) fuzzy models of nonlinear dynamic systems using the framework of orthonormal basis functions (OBFs) is presented in this paper. This approach is based on an architecture in which local linear models with ladder-structured generalized OBFs (GOBFs) constitute the fuzzy rule consequents and the outputs of the corresponding GOBF filters are input variables for the rule antecedents. The resulting GOBF-TS model is characterized by having only real-valued parameters that do not depend on any user specification about particular types of functions to be used in the orthonormal basis. The fuzzy rules of the model are initially obtained by means of a well-known technique based on fuzzy clustering and least squares. Those rules are then simplified, and the model parameters (GOBF poles, GOBF expansion coefficients, and fuzzy membership functions) are subsequently adjusted by using a nonlinear optimization algorithm. The exact gradients of an error functional with respect to the parameters to be optimized are computed analytically. Those gradients provide exact search directions for the optimization process, which relies solely on input-output data measured from the system to be modeled. An example is presented to illustrate the performance of this approach in the modeling of a complex nonlinear dynamic system.

Subject(s)

Algorithms , Fuzzy Logic , Models, Statistical , Nonlinear Dynamics , Computer Simulation

A simpler and more accurate AUTO-HDS framework for clustering and visualization of biological data.

Campello, Ricardo J G B; Moulavi, Davoud; Sander, Jörg.

IEEE/ACM Trans Comput Biol Bioinform ; 9(6): 1850-2, 2012.

Article in English | MEDLINE | ID: mdl-23221094

ABSTRACT

In [1], the authors proposed a framework for automated clustering and visualization of biological data sets named AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity.

Subject(s)

Cluster Analysis , Computational Biology/methods , Data Mining/methods , Software , Algorithms , Databases, Factual

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL