Search | VHL Regional Portal

Multivariate binary classification of imbalanced datasets-A case study based on high-dimensional multiplex autoimmune assay data.

Schlieker, Laura; Telaar, Anna; Lueking, Angelika; Schulz-Knappe, Peter; Theek, Carmen; Ickstadt, Katja.

Biom J ; 59(5): 948-966, 2017 Sep.

Article in English | MEDLINE | ID: mdl-28626952

ABSTRACT

The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.

Subject(s)

Biometry/methods , Algorithms , Arthritis, Rheumatoid/diagnosis , Biological Assay/standards , Computer Simulation , Discriminant Analysis , Humans , Lupus Erythematosus, Systemic/diagnosis

An extension of PPLS-DA for classification and comparison to ordinary PLS-DA.

Telaar, Anna; Liland, Kristian Hovde; Repsilber, Dirk; Nürnberg, Gerd.

PLoS One ; 8(2): e55267, 2013.

Article in English | MEDLINE | ID: mdl-23408965

ABSTRACT

Classification studies are widely applied, e.g. in biomedical research to classify objects/patients into predefined groups. The goal is to find a classification function/rule which assigns each object/patient to a unique group with the greatest possible accuracy (classification error). Especially in gene expression experiments often a lot of variables (genes) are measured for only few objects/patients. A suitable approach is the well-known method PLS-DA, which searches for a transformation to a lower dimensional space. Resulting new components are linear combinations of the original variables. An advancement of PLS-DA leads to PPLS-DA, introducing a so called 'power parameter', which is maximized towards the correlation between the components and the group-membership. We introduce an extension of PPLS-DA for optimizing this power parameter towards the final aim, namely towards a minimal classification error. We compare this new extension with the original PPLS-DA and also with the ordinary PLS-DA using simulated and experimental datasets. For the investigated data sets with weak linear dependency between features/variables, no improvement is shown for PPLS-DA and for the extensions compared to PLS-DA. A very weak linear dependency, a low proportion of differentially expressed genes for simulated data, does not lead to an improvement of PPLS-DA over PLS-DA, but our extension shows a lower prediction error. On the contrary, for the data set with strong between-feature collinearity and a low proportion of differentially expressed genes and a large total number of genes, the prediction error of PPLS-DA and the extensions is clearly lower than for PLS-DA. Moreover we compare these prediction results with results of support vector machines with linear kernel and linear discriminant analysis.

Subject(s)

Discriminant Analysis , Least-Squares Analysis , Gene Expression Profiling

Biomarkers of inflammation, immunosuppression and stress with active disease are revealed by metabolomic profiling of tuberculosis patients.

Weiner, January; Parida, Shreemanta K; Maertzdorf, Jeroen; Black, Gillian F; Repsilber, Dirk; Telaar, Anna; Mohney, Robert P; Arndt-Sullivan, Cordelia; Ganoza, Christian A; Faé, Kellen C; Walzl, Gerhard; Kaufmann, Stefan H E.

PLoS One ; 7(7): e40221, 2012.

Article in English | MEDLINE | ID: mdl-22844400

ABSTRACT

Although tuberculosis (TB) causes more deaths than any other pathogen, most infected individuals harbor the pathogen without signs of disease. We explored the metabolome of >400 small molecules in serum of uninfected individuals, latently infected healthy individuals and patients with active TB. We identified changes in amino acid, lipid and nucleotide metabolism pathways, providing evidence for anti-inflammatory metabolomic changes in TB. Metabolic profiles indicate increased activity of indoleamine 2,3 dioxygenase 1 (IDO1), decreased phospholipase activity, increased abundance of adenosine metabolism products, as well as indicators of fibrotic lesions in active disease as compared to latent infection. Consistent with our predictions, we experimentally demonstrate TB-induced IDO1 activity. Furthermore, we demonstrate a link between metabolic profiles and cytokine signaling. Finally, we show that 20 metabolites are sufficient for robust discrimination of TB patients from healthy individuals. Our results provide specific insights into the biology of TB and pave the way for the rational development of metabolic biomarkers for TB.

Subject(s)

Immune Tolerance , Metabolomics , Stress, Physiological , Tuberculosis, Pulmonary/immunology , Tuberculosis, Pulmonary/metabolism , Biomarkers/metabolism , Case-Control Studies , Cluster Analysis , Female , Humans , Indoleamine-Pyrrole 2,3,-Dioxygenase/metabolism , Inflammation/metabolism , Kynurenine/biosynthesis , Male , Tuberculosis, Pulmonary/enzymology , Tuberculosis, Pulmonary/physiopathology

Finding biomarker signatures in pooled sample designs: a simulation framework for methodological comparisons.

Telaar, Anna; Nürnberg, Gerd; Repsilber, Dirk.

Adv Bioinformatics ; : 318573, 2010.

Article in English | MEDLINE | ID: mdl-20671968

ABSTRACT

Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest.

Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach.

Repsilber, Dirk; Kern, Sabine; Telaar, Anna; Walzl, Gerhard; Black, Gillian F; Selbig, Joachim; Parida, Shreemanta K; Kaufmann, Stefan H E; Jacobsen, Marc.

BMC Bioinformatics ; 11: 27, 2010 Jan 14.

Article in English | MEDLINE | ID: mdl-20070912

ABSTRACT

BACKGROUND: For heterogeneous tissues, such as blood, measurements of gene expression are confounded by relative proportions of cell types involved. Conclusions have to rely on estimation of gene expression signals for homogeneous cell populations, e.g. by applying micro-dissection, fluorescence activated cell sorting, or in-silico deconfounding. We studied feasibility and validity of a non-negative matrix decomposition algorithm using experimental gene expression data for blood and sorted cells from the same donor samples. Our objective was to optimize the algorithm regarding detection of differentially expressed genes and to enable its use for classification in the difficult scenario of reversely regulated genes. This would be of importance for the identification of candidate biomarkers in heterogeneous tissues. RESULTS: Experimental data and simulation studies involving noise parameters estimated from these data revealed that for valid detection of differential gene expression, quantile normalization and use of non-log data are optimal. We demonstrate the feasibility of predicting proportions of constituting cell types from gene expression data of single samples, as a prerequisite for a deconfounding-based classification approach.Classification cross-validation errors with and without using deconfounding results are reported as well as sample-size dependencies. Implementation of the algorithm, simulation and analysis scripts are available. CONCLUSIONS: The deconfounding algorithm without decorrelation using quantile normalization on non-log data is proposed for biomarkers that are difficult to detect, and for cases where confounding by varying proportions of cell types is the suspected reason. In this case, a deconfounding ranking approach can be used as a powerful alternative to, or complement of, other statistical learning approaches to define candidate biomarkers for molecular diagnosis and prediction in biomedicine, in realistically noisy conditions and with moderate sample sizes.

Subject(s)

Algorithms , Biomarkers/chemistry , Computational Biology/methods , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL