Search | VHL Regional Portal

Supervised learning and model analysis with compositional data.

Huang, Shimeng; Ailer, Elisabeth; Kilbertus, Niki; Pfister, Niklas.

PLoS Comput Biol ; 19(6): e1011240, 2023 Jun.

Article in English | MEDLINE | ID: mdl-37390111

ABSTRACT

Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.

Subject(s)

Algorithms , Machine Learning , Phylogeny , Linear Models , Supervised Machine Learning

Invariant Policy Learning: A Causal Perspective.

Saengkyongam, Sorawit; Thams, Nikolaj; Peters, Jonas; Pfister, Niklas.

IEEE Trans Pattern Anal Mach Intell ; 45(7): 8606-8620, 2023 Jul.

Article in English | MEDLINE | ID: mdl-37018267

ABSTRACT

Contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over different environments. In many real-world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we take a step toward tackling the problem of environmental shifts considering the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved variables are present and show that, in that case, an optimal invariant policy is guaranteed to generalize across environments under suitable assumptions.

Interpreting tree ensemble machine learning models with endoR.

Ruaud, Albane; Pfister, Niklas; Ley, Ruth E; Youngblut, Nicholas D.

PLoS Comput Biol ; 18(12): e1010714, 2022 12.

Article in English | MEDLINE | ID: mdl-36516158

ABSTRACT

Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa may be associated. We developed endoR, a method to interpret tree ensemble models. First, endoR simplifies the fitted model into a decision ensemble. Then, it extracts information on the importance of individual features and their pairwise interactions, displaying them as an interpretable network. Both the endoR network and importance scores provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed endoR on both simulated and real metagenomic data. We found endoR to have comparable accuracy to other common approaches while easing and enhancing model interpretation. Using endoR, we also confirmed published results on gut microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to explore associations between human gut methanogens and microbiome components. Indeed, these hydrogen consumers are expected to interact with fermenting bacteria in a complex syntrophic network. Specifically, we analyzed a global metagenome dataset of 2203 individuals and confirmed the previously reported association between Methanobacteriaceae and Christensenellales. Additionally, we observed that Methanobacteriaceae are associated with a network of hydrogen-producing bacteria. Our method accurately captures how tree ensembles use features and interactions between them to predict a response. As demonstrated by our applications, the resultant visualizations and summary outputs facilitate model interpretation and enable the generation of novel hypotheses about complex systems.

Subject(s)

Gastrointestinal Microbiome , Microbiota , Humans , Bacteria/genetics , Gastrointestinal Microbiome/genetics , Machine Learning , Metagenome

Multiomic profiling of the liver across diets and age in a diverse mouse population.

Williams, Evan G; Pfister, Niklas; Roy, Suheeta; Statzer, Cyril; Haverty, Jack; Ingels, Jesse; Bohl, Casey; Hasan, Moaraj; Cuklina, Jelena; Bühlmann, Peter; Zamboni, Nicola; Lu, Lu; Ewald, Collin Y; Williams, Robert W; Aebersold, Ruedi.

Cell Syst ; 13(1): 43-57.e6, 2022 01 19.

Article in English | MEDLINE | ID: mdl-34666007

ABSTRACT

We profiled the liver transcriptome, proteome, and metabolome in 347 individuals from 58 isogenic strains of the BXD mouse population across age (7 to 24 months) and diet (low or high fat) to link molecular variations to metabolic traits. Several hundred genes are affected by diet and/or age at the transcript and protein levels. Orthologs of two aging-associated genes, St7 and Ctsd, were knocked down in C. elegans, reducing longevity in wild-type and mutant long-lived strains. The multiomics data were analyzed as segregating gene networks according to each independent variable, providing causal insight into dietary and aging effects. Candidates were cross-examined in an independent diversity outbred mouse liver dataset segregating for similar diets, with â¼80%-90% of diet-related candidate genes found in common across datasets. Together, we have developed a large multiomics resource for multivariate analysis of complex traits and demonstrate a methodology for moving from observational associations to causal connections.

Subject(s)

Caenorhabditis elegans , Liver , Animals , Caenorhabditis elegans/genetics , Diet , Gene Regulatory Networks , Liver/metabolism , Mice , Transcriptome/genetics

A Causal Framework for Distribution Generalization.

Christiansen, Rune; Pfister, Niklas; Jakobsen, Martin Emil; Gnecco, Nicola; Peters, Jonas.

IEEE Trans Pattern Anal Mach Intell ; 44(10): 6614-6630, 2022 10.

Article in English | MEDLINE | ID: mdl-34232865

ABSTRACT

We consider the problem of predicting a response Y from a set of covariates X when test- and training distributions differ. Since such differences may have causal explanations, we consider test distributions that emerge from interventions in a structural causal model, and focus on minimizing the worst-case risk. Causal regression models, which regress the response on its direct causes, remain unchanged under arbitrary interventions on the covariates, but they are not always optimal in the above sense. For example, for linear models and bounded interventions, alternative solutions have been shown to be minimax prediction optimal. We introduce the formal framework of distribution generalization that allows us to analyze the above problem in partially observed nonlinear models for both direct interventions on X and interventions that occur indirectly via exogenous variables A. It takes into account that, in practice, minimax solutions need to be identified from data. Our framework allows us to characterize under which class of interventions the causal function is minimax optimal. We prove sufficient conditions for distribution generalization and present corresponding impossibility results. We propose a practical method, NILE, that achieves distribution generalization in a nonlinear IV setting with linear extrapolation. We prove consistency and present empirical results.

Subject(s)

Algorithms , Models, Theoretical , Linear Models

Learning stable and predictive structures in kinetic systems.

Pfister, Niklas; Bauer, Stefan; Peters, Jonas.

Proc Natl Acad Sci U S A ; 116(51): 25405-25411, 2019 12 17.

Article in English | MEDLINE | ID: mdl-31776252

ABSTRACT

Learning kinetic systems from data is one of the core challenges in many fields. Identifying stable models is essential for the generalization capabilities of data-driven inference. We introduce a computationally efficient framework, called CausalKinetiX, that identifies structure from discrete time, noisy observations, generated from heterogeneous experiments. The algorithm assumes the existence of an underlying, invariant kinetic model, a key criterion for reproducible research. Results on both simulated and real-world examples suggest that learning the structure of kinetic systems benefits from a causal perspective. The identified variables and models allow for a concise description of the dynamics across multiple experimental settings and can be used for prediction in unseen experiments. We observe significant improvements compared to well-established approaches focusing solely on predictive performance, especially for out-of-sample generalization.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL