Pesquisa | Portal Regional da BVS

Pooled variable scaling for cluster analysis.

Raymaekers, Jakob; Zamar, Ruben H.

Bioinformatics ; 36(12): 3849-3855, 2020 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-32282889

RESUMO

MOTIVATION: Many popular clustering methods are not scale-invariant because they are based on Euclidean distances. Even methods using scale-invariant distances, such as the Mahalanobis distance, lose their scale invariance when combined with regularization and/or variable selection. Therefore, the results from these methods are very sensitive to the measurement units of the clustering variables. A simple way to achieve scale invariance is to scale the variables before clustering. However, scaling variables is a very delicate issue in cluster analysis: A bad choice of scaling can adversely affect the clustering results. On the other hand, reporting clustering results that depend on measurement units is not satisfactory. Hence, a safe and efficient scaling procedure is needed for applications in bioinformatics and medical sciences research. RESULTS: We propose a new approach for scaling prior to cluster analysis based on the concept of pooled variance. Unlike available scaling procedures, such as the SD and the range, our proposed scale avoids dampening the beneficial effect of informative clustering variables. We confirm through an extensive simulation study and applications to well-known real-data examples that the proposed scaling method is safe and generally useful. Finally, we use our approach to cluster a high-dimensional genomic dataset consisting of gene expression data for several specimens of breast cancer cells tissue obtained from human patients. AVAILABILITY AND IMPLEMENTATION: An R-implementation of the algorithms presented is available at https://wis.kuleuven.be/statdatascience/robust/software. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Genômica , Análise por Conglomerados , Humanos , Software

Comparing Reverse Complementary Genomic Words Based on Their Distance Distributions and Frequencies.

Tavares, Ana Helena; Raymaekers, Jakob; Rousseeuw, Peter J; Silva, Raquel M; Bastos, Carlos A C; Pinho, Armando; Brito, Paula; Afreixo, Vera.

Interdiscip Sci ; 10(1): 1-11, 2018 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-29214497

RESUMO

In this work, we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is also explored, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version.

Assuntos

DNA Complementar/genética , Genômica , Genoma Humano , Humanos , Anotação de Sequência Molecular

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA