Search | VHL Regional Portal

Integrating biological knowledge and gene expression data using pathway-guided random forests: a benchmarking study.

Seifert, Stephan; Gundlach, Sven; Junge, Olaf; Szymczak, Silke.

Bioinformatics ; 36(15): 4301-4308, 2020 08 01.

Article in English | MEDLINE | ID: mdl-32399562

ABSTRACT

MOTIVATION: High-throughput technologies allow comprehensive characterization of individuals on many molecular levels. However, training computational models to predict disease status based on omics data is challenging. A promising solution is the integration of external knowledge about structural and functional relationships into the modeling process. We compared four published random forest-based approaches using two simulation studies and nine experimental datasets. RESULTS: The self-sufficient prediction error approach should be applied when large numbers of relevant pathways are expected. The competing methods hunting and learner of functional enrichment should be used when low numbers of relevant pathways are expected or the most strongly associated pathways are of interest. The hybrid approach synthetic features is not recommended because of its high false discovery rate. AVAILABILITY AND IMPLEMENTATION: An R package providing functions for data analysis and simulation is available at GitHub (https://github.com/szymczak-lab/PathwayGuidedRF). An accompanying R data package (https://github.com/szymczak-lab/DataPathwayGuidedRF) stores the processed and quality controlled experimental datasets downloaded from Gene Expression Omnibus (GEO). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Benchmarking , Software , Gene Expression , Humans

Comparison of Markov Chain Monte Carlo Software for the Evolutionary Analysis of Y-Chromosomal Microsatellite Data.

Gundlach, Sven; Junge, Olaf; Wienbrandt, Lars; Krawczak, Michael; Caliebe, Amke.

Comput Struct Biotechnol J ; 17: 1082-1090, 2019.

Article in English | MEDLINE | ID: mdl-31452861

ABSTRACT

The evolutionary analysis of genetic data is an important subject of modern bioscience, with practical applications in diverse fields. Parameters of interest in this context include effective population sizes, mutation rates, population growth rates and the times to most recent common ancestors. Studying Y-chromosomal microsatellite data, in particular, has proven useful to unravel the recent patrilineal history of Homo sapiens populations. We compared the individual analysis options and technical details of four software tools that are widely used for this purpose, namely BATWING, BEAST, IMa2 and LAMARC, all of which use Bayesian coalescent-based Markov chain Monte Carlo (MCMC) methods for parameter estimation. More specifically, we simulated datasets for either eight or 20 hypothetical Y-chromosomal microsatellites, assuming a mutation rate of 0.0030 per generation and a constant or exponentially increasing population size, and used these data to evaluate the parameter estimation capacity of each tool. The datasets comprised between 100 and 1000 samples. In addition to runtime, the practical utility of the tools of interest can also be expected to depend critically upon the convergence behavior of the actual MCMC implementation. In fact, we found that runtime increased, and convergence rate decreased, with increasing sample size as expected. BATWING performed best with respect to runtime and convergence behavior, but only supports simple evolutionary models. As regards the spectrum of evolutionary models covered, and also in terms of cross-platform usability, BEAST provided the greatest flexibility. Finally, IMa2 and LAMARC turned out best to incorporate elaborate migration models in the analysis process.

VarWatch-A stand-alone software tool for variant matching.

Fredrich, Broder; Schmöhl, Marcus; Junge, Olaf; Gundlach, Sven; Ellinghaus, David; Pfeufer, Arne; Bettecken, Thomas; Siddiqui, Roman; Franke, Andre; Wienker, Thomas F; Hoeppner, Marc P; Krawczak, Michael.

PLoS One ; 14(4): e0215618, 2019.

Article in English | MEDLINE | ID: mdl-31022234

ABSTRACT

Massively parallel DNA sequencing of clinical samples holds great promise for the gene-based diagnosis of human inherited diseases because it allows rapid detection of putatively causative mutations at genome-wide level. Without additional evidence complementing their initial bioinformatics evaluation, however, the clinical relevance of such candidate genetic variants often remains unclear. In consequence, dedicated 'matching' services have been established in recent years that aim at the discovery of other, comparable case reports to facilitate individual diagnoses. However, legal concerns have been raised about the global sharing of genetic data, particularly in Europe where the recently enacted General Data Protection Regulation EU-2016/679 classifies genetic data as highly sensitive. Hence, unrestricted sharing of genetic data from clinical cases on platforms outside the national jurisdiction increasingly may be perceived as problematic. To allow collaborative data producers, particularly large consortia of diagnostic laboratories, to acknowledge these concerns while still practicing efficient case matching internally, novel tools are required. To this end, we developed VarWatch, an easy-to-deploy and highly scalable case matching software that provides users with comprehensive programmatic tools and a user-friendly interface to fulfil said purpose.

Subject(s)

Computational Biology/instrumentation , Genetic Diseases, Inborn/diagnosis , Genetic Testing/instrumentation , Genomics/instrumentation , Software , Datasets as Topic , Genetic Diseases, Inborn/genetics , Genetic Variation , High-Throughput Nucleotide Sequencing , Humans , Sequence Analysis, DNA

Surrogate minimal depth as an importance measure for variables in random forests.

Seifert, Stephan; Gundlach, Sven; Szymczak, Silke.

Bioinformatics ; 35(19): 3663-3671, 2019 10 01.

Article in English | MEDLINE | ID: mdl-30824905

ABSTRACT

MOTIVATION: It has been shown that the machine learning approach random forest can be successfully applied to omics data, such as gene expression data, for classification or regression and to select variables that are important for prediction. However, the complex relationships between predictor variables, in particular between causal predictor variables, make the interpretation of currently applied variable selection techniques difficult. RESULTS: Here we propose a new variable selection approach called surrogate minimal depth (SMD) that incorporates surrogate variables into the concept of minimal depth (MD) variable importance. Applying SMD, we show that simulated correlation patterns can be reconstructed and that the increased consideration of variable relationships improves variable selection. When compared with existing state-of-the-art methods and MD, SMD has higher empirical power to identify causal variables while the resulting variable lists are equally stable. In conclusion, SMD is a promising approach to get more insight into the complex interplay of predictor variables and outcome in a high-dimensional data setting. AVAILABILITY AND IMPLEMENTATION: https://github.com/StephanSeifert/SurrogateMinimalDepth. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Machine Learning

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL