Search | VHL Regional Portal

Impact of missing data imputation methods on gene expression clustering and classification.

de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G.

BMC Bioinformatics ; 16: 64, 2015 Feb 26.

Article in English | MEDLINE | ID: mdl-25888091

ABSTRACT

BACKGROUND: Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. RESULTS AND CONCLUSIONS: We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .

Subject(s)

Algorithms , Cluster Analysis , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Data Interpretation, Statistical , Humans

On the selection of appropriate distances for gene expression data clustering.

Jaskowiak, Pablo A; Campello, Ricardo J G B; Costa, Ivan G.

BMC Bioinformatics ; 15 Suppl 2: S2, 2014.

Article in English | MEDLINE | ID: mdl-24564555

ABSTRACT

BACKGROUND: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. RESULTS AND CONCLUSIONS: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.

Subject(s)

Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Cluster Analysis , Humans , Neoplasms/genetics

Proximity measures for clustering gene expression microarray data: a validation methodology and a comparative analysis.

Jaskowiak, Pablo A; Campello, Ricardo J G B; Costa, Ivan G.

IEEE/ACM Trans Comput Biol Bioinform ; 10(4): 845-57, 2013.

Article in English | MEDLINE | ID: mdl-24334380

ABSTRACT

Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.

Subject(s)

Cluster Analysis , Computational Biology/methods , Databases, Genetic , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Humans , Neoplasms/genetics , Neoplasms/metabolism , Reproducibility of Results , Statistics, Nonparametric

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL