Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
1.
Bioinformatics ; 39(12)2023 12 01.
Article in English | MEDLINE | ID: mdl-38039146

ABSTRACT

SUMMARY: Due to their flexibility and superior performance, machine learning models frequently complement and outperform traditional statistical survival models. However, their widespread adoption is hindered by a lack of user-friendly tools to explain their internal operations and prediction rationales. To tackle this issue, we introduce the survex R package, which provides a cohesive framework for explaining any survival model by applying explainable artificial intelligence techniques. The capabilities of the proposed software encompass understanding and diagnosing survival models, which can lead to their improvement. By revealing insights into the decision-making process, such as variable effects and importances, survex enables the assessment of model reliability and the detection of biases. Thus, transparency and responsibility may be promoted in sensitive areas, such as biomedical research and healthcare applications. AVAILABILITY AND IMPLEMENTATION: survex is available under the GPL3 public license at https://github.com/modeloriented/survex and on CRAN with documentation available at https://modeloriented.github.io/survex.


Subject(s)
Artificial Intelligence , Biomedical Research , Reproducibility of Results , Software , Machine Learning
2.
Diabetologia ; 66(10): 1914-1924, 2023 10.
Article in English | MEDLINE | ID: mdl-37420130

ABSTRACT

AIMS/HYPOTHESIS: There is increasing evidence for the existence of shared genetic predictors of metabolic traits and neurodegenerative disease. We previously observed a U-shaped association between fasting insulin in middle-aged women and dementia up to 34 years later. In the present study, we performed genome-wide association (GWA) analyses for fasting serum insulin in European children with a focus on variants associated with the tails of the insulin distribution. METHODS: Genotyping was successful in 2825 children aged 2-14 years at the time of insulin measurement. Because insulin levels vary during childhood, GWA analyses were based on age- and sex-specific z scores. Five percentile ranks of z-insulin were selected and modelled using logistic regression, i.e. the 15th, 25th, 50th, 75th and 85th percentile ranks (P15-P85). Additive genetic models were adjusted for age, sex, BMI, survey year, survey country and principal components derived from genetic data to account for ethnic heterogeneity. Quantile regression was used to determine whether associations with variants identified by GWA analyses differed across quantiles of log-insulin. RESULTS: A variant in the SLC28A1 gene (rs2122859) was associated with the 85th percentile rank of the insulin z score (P85, p value=3×10-8). Two variants associated with low z-insulin (P15, p value <5×10-6) were located on the RBFOX1 and SH3RF3 genes. These genes have previously been associated with both metabolic traits and dementia phenotypes. While variants associated with P50 showed stable associations across the insulin spectrum, we found that associations with variants identified through GWA analyses of P15 and P85 varied across quantiles of log-insulin. CONCLUSIONS/INTERPRETATION: The above results support the notion of a shared genetic architecture for dementia and metabolic traits. Our approach identified genetic variants that were associated with the tails of the insulin spectrum only. Because traditional heritability estimates assume that genetic effects are constant throughout the phenotype distribution, the new findings may have implications for understanding the discrepancy in heritability estimates from GWA and family studies and for the study of U-shaped biomarker-disease associations.


Subject(s)
Dementia , Neurodegenerative Diseases , Male , Female , Humans , Genome-Wide Association Study , Insulin , Fasting , Polymorphism, Single Nucleotide , Ubiquitin-Protein Ligases
3.
PeerJ ; 10: e13728, 2022.
Article in English | MEDLINE | ID: mdl-35910765

ABSTRACT

This article describes a data-driven framework based on spatiotemporal machine learning to produce distribution maps for 16 tree species (Abies alba Mill., Castanea sativa Mill., Corylus avellana L., Fagus sylvatica L., Olea europaea L., Picea abies L. H. Karst., Pinus halepensis Mill., Pinus nigra J. F. Arnold, Pinus pinea L., Pinus sylvestris L., Prunus avium L., Quercus cerris L., Quercus ilex L., Quercus robur L., Quercus suber L. and Salix caprea L.) at high spatial resolution (30 m). Tree occurrence data for a total of three million of points was used to train different algorithms: random forest, gradient-boosted trees, generalized linear models, k-nearest neighbors, CART and an artificial neural network. A stack of 305 coarse and high resolution covariates representing spectral reflectance, different biophysical conditions and biotic competition was used as predictors for realized distributions, while potential distribution was modelled with environmental predictors only. Logloss and computing time were used to select the three best algorithms to tune and train an ensemble model based on stacking with a logistic regressor as a meta-learner. An ensemble model was trained for each species: probability and model uncertainty maps of realized distribution were produced for each species using a time window of 4 years for a total of six distribution maps per species, while for potential distributions only one map per species was produced. Results of spatial cross validation show that the ensemble model consistently outperformed or performed as good as the best individual model in both potential and realized distribution tasks, with potential distribution models achieving higher predictive performances (TSS = 0.898, R2 logloss = 0.857) than realized distribution ones on average (TSS = 0.874, R2 logloss = 0.839). Ensemble models for Q. suber achieved the best performances in both potential (TSS = 0.968, R2 logloss = 0.952) and realized (TSS = 0.959, R2 logloss = 0.949) distribution, while P. sylvestris (TSS = 0.731, 0.785, R2 logloss = 0.585, 0.670, respectively, for potential and realized distribution) and P. nigra (TSS = 0.658, 0.686, R2 logloss = 0.623, 0.664) achieved the worst. Importance of predictor variables differed across species and models, with the green band for summer and the Normalized Difference Vegetation Index (NDVI) for fall for realized distribution and the diffuse irradiation and precipitation of the driest quarter (BIO17) being the most frequent and important for potential distribution. On average, fine-resolution models outperformed coarse resolution models (250 m) for realized distribution (TSS = +6.5%, R2 logloss = +7.5%). The framework shows how combining continuous and consistent Earth Observation time series data with state of the art machine learning can be used to derive dynamic distribution maps. The produced predictions can be used to quantify temporal trends of potential forest degradation and species composition change.


Subject(s)
Abies , Fagus , Pinus , Quercus , Europe
4.
Genet Epidemiol ; 45(5): 485-536, 2021 07.
Article in English | MEDLINE | ID: mdl-33942369

ABSTRACT

The Translational Machine (TM) is a machine learning (ML)-based analytic pipeline that translates genotypic/variant call data into biologically contextualized features that richly characterize complex variant architectures and permit greater interpretability and biological replication. It also reduces potentially confounding effects of population substructure on outcome prediction. The TM consists of three main components. First, replicable but flexible feature engineering procedures translate genome-scale data into biologically informative features that appropriately contextualize simple variant calls/genotypes within biological and functional contexts. Second, model-free, nonparametric ML-based feature filtering procedures empirically reduce dimensionality and noise of both original genotype calls and engineered features. Third, a powerful ML algorithm for feature selection is used to differentiate risk variant contributions across variant frequency and functional prediction spectra. The TM simultaneously evaluates potential contributions of variants operative under polygenic and heterogeneous models of genetic architecture. Our TM enables integration of biological information (e.g., genomic annotations) within conceptual frameworks akin to geneset-/pathways-based and collapsing methods, but overcomes some of these methods' limitations. The full TM pipeline is executed in R. Our approach and initial findings from its application to a whole-exome schizophrenia case-control data set are presented. These TM procedures extend the findings of the primary investigation and yield novel results.


Subject(s)
Machine Learning , Models, Genetic , Algorithms , Genomics , Genotype , Humans
5.
Int J Obes (Lond) ; 45(6): 1321-1330, 2021 06.
Article in English | MEDLINE | ID: mdl-33753884

ABSTRACT

BACKGROUND: Childhood obesity is a complex multifaceted condition, which is influenced by genetics, environmental factors, and their interaction. However, these interactions have mainly been studied in twin studies and evidence from population-based cohorts is limited. Here, we analyze the interaction of an obesity-related genome-wide polygenic risk score (PRS) with sociodemographic and lifestyle factors for BMI and waist circumference (WC) in European children and adolescents. METHODS: The analyses are based on 8609 repeated observations from 3098 participants aged 2-16 years from the IDEFICS/I.Family cohort. A genome-wide polygenic risk score (PRS) was calculated using summary statistics from independent genome-wide association studies of BMI. Associations were estimated using generalized linear mixed models adjusted for sex, age, region of residence, parental education, dietary intake, relatedness, and population stratification. RESULTS: The PRS was associated with BMI (beta estimate [95% confidence interval (95%-CI)] = 0.33 [0.30, 0.37], r2 = 0.11, p value = 7.9 × 10-81) and WC (beta [95%-CI] = 0.36 [0.32, 0.40], r2 = 0.09, p value = 1.8 × 10-71). We observed significant interactions with demographic and lifestyle factors for BMI as well as WC. Children from Southern Europe showed increased genetic liability to obesity (BMI: beta [95%-CI] = 0.40 [0.34, 0.45]) in comparison to children from central Europe (beta [95%-CI] = 0.29 [0.23, 0.34]), p-interaction = 0.0066). Children of parents with a low level of education showed an increased genetic liability to obesity (BMI: beta [95%-CI] = 0.48 [0.38, 0.59]) in comparison to children of parents with a high level of education (beta [95%-CI] = 0.30 [0.26, 0.34]), p-interaction = 0.0012). Furthermore, the genetic liability to obesity was attenuated by a higher intake of fiber (BMI: beta [95%-CI] interaction = -0.02 [-0.04,-0.01]) and shorter screen times (beta [95%-CI] interaction = 0.02 [0.00, 0.03]). CONCLUSIONS: Our results highlight that a healthy childhood environment might partly offset a genetic predisposition to obesity during childhood and adolescence.


Subject(s)
Life Style , Pediatric Obesity/epidemiology , Pediatric Obesity/genetics , Adolescent , Child , Child, Preschool , Cohort Studies , Europe/epidemiology , Female , Genome-Wide Association Study , Humans , Male , Social Factors
6.
J Phys Act Health ; 17(10): 1025-1033, 2020 08 28.
Article in English | MEDLINE | ID: mdl-32858522

ABSTRACT

BACKGROUND: To evaluate a multicomponent health promotion program targeting preschoolers' physical activity (PA). METHODS: PA of children from 23 German daycare facilities (DFs; 13 intervention DFs, 10 control DFs) was measured via accelerometry at baseline and after 12 months. Children's sedentary time, light PA, and moderate to vigorous PA were estimated. Adherence was tracked with paper-and-pencil calendars. Mixed-model regression analyses were used to assess intervention effects. RESULTS: PA data were analyzed from 183 (4.2 [0.8] y, 48.1% boys) children. At follow-up, children in DF groups with more than 50% adherence to PA intervention components showed an increase of 9 minutes of moderate to vigorous PA per day (ß = 9.28; 95% confidence interval [CI], -0.16 to 18.72) and a 19-minute decrease in sedentary time (ß = -19.25; 95% CI, -43.66 to 5.16) compared with the control group, whereas children's PA of those who were exposed to no or less than 50% adherence remained unchanged (moderate to vigorous PA: ß = 0.34; 95% CI, -13.73 to 14.41; sedentary time: ß = 1.78; 95% CI, -26.54 to 30.09). Notable effects were found in children with migration background. CONCLUSIONS: Only small benefits in PA outcomes were observed after 1 year. A minimum of 50% adherence to the intervention seems to be crucial for facilitating intervention effects.


Subject(s)
Child Day Care Centers , Exercise , Accelerometry , Child , Female , Health Promotion , Humans , Male , Sedentary Behavior
7.
J Comput Graph Stat ; 29(3): 639-658, 2020.
Article in English | MEDLINE | ID: mdl-34121830

ABSTRACT

Random forests have become an established tool for classification and regression, in particular in high-dimensional settings and in the presence of non-additive predictor-response relationships. For bounded outcome variables restricted to the unit interval, however, classical modeling approaches based on mean squared error loss may severely suffer as they do not account for heteroscedasticity in the data. To address this issue, we propose a random forest approach for relating a beta dis-tributed outcome to a set of explanatory variables. Our approach explicitly makes use of the likelihood function of the beta distribution for the selection of splits dur-ing the tree-building procedure. In each iteration of the tree-building algorithm it chooses one explanatory variable in combination with a split point that maximizes the log-likelihood function of the beta distribution with the parameter estimates de-rived from the nodes of the currently built tree. Results of several simulation studies and an application using data from the U.S.A. National Lakes Assessment Survey demonstrate the properties and usefulness of the method, in particular when compared to random forest approaches based on mean squared error loss and parametric regression models.

8.
Hum Genet ; 139(1): 73-84, 2020 Jan.
Article in English | MEDLINE | ID: mdl-31049651

ABSTRACT

In this paper, we give an overview of methodological issues related to the use of statistical learning approaches when analyzing high-dimensional genetic data. The focus is set on regression models and machine learning algorithms taking genetic variables as input and returning a classification or a prediction for the target variable of interest; for example, the present or future disease status, or the future course of a disease. After briefly explaining the basic motivation and principle of these methods, we review different procedures that can be used to evaluate the accuracy of the obtained models and discuss common flaws that may lead to over-optimistic conclusions with respect to their prediction performance and usefulness.


Subject(s)
Algorithms , Disease/genetics , Machine Learning , Models, Statistical , Molecular Epidemiology , Artificial Intelligence , Humans
9.
BMC Bioinformatics ; 20(1): 358, 2019 Jun 27.
Article in English | MEDLINE | ID: mdl-31248362

ABSTRACT

BACKGROUND: In the last years more and more multi-omics data are becoming available, that is, data featuring measurements of several types of omics data for each patient. Using multi-omics data as covariate data in outcome prediction is both promising and challenging due to the complex structure of such data. Random forest is a prediction method known for its ability to render complex dependency patterns between the outcome and the covariates. Against this background we developed five candidate random forest variants tailored to multi-omics covariate data. These variants modify the split point selection of random forest to incorporate the block structure of multi-omics data and can be applied to any outcome type for which a random forest variant exists, such as categorical, continuous and survival outcomes. Using 20 publicly available multi-omics data sets with survival outcome we compared the prediction performances of the block forest variants with alternatives. We also considered the common special case of having clinical covariates and measurements of a single omics data type available. RESULTS: We identify one variant termed "block forest" that outperformed all other approaches in the comparison study. In particular, it performed significantly better than standard random survival forest (adjusted p-value: 0.027). The two best performing variants have in common that the block choice is randomized in the split point selection procedure. In the case of having clinical covariates and a single omics data type available, the improvements of the variants over random survival forest were larger than in the case of the multi-omics data. The degrees of improvements over random survival forest varied strongly across data sets. Moreover, considering all clinical covariates mandatorily improved the performance. This result should however be interpreted with caution, because the level of predictive information contained in clinical covariates depends on the specific application. CONCLUSIONS: The new prediction method block forest for multi-omics data can significantly improve the prediction performance of random forest and outperformed alternatives in the comparison. Block forest is particularly effective for the special case of using clinical covariates in combination with measurements of a single omics data type.


Subject(s)
Machine Learning , Genomics , Humans , Survival Analysis
10.
PeerJ ; 7: e6339, 2019.
Article in English | MEDLINE | ID: mdl-30746306

ABSTRACT

One reason for the widespread success of random forests (RFs) is their ability to analyze most datasets without preprocessing. For example, in contrast to many other statistical methods and machine learning approaches, no recoding such as dummy coding is required to handle ordinal and nominal predictors. The standard approach for nominal predictors is to consider all 2 k - 1 - 1 2-partitions of the k predictor categories. However, this exponential relationship produces a large number of potential splits to be evaluated, increasing computational complexity and restricting the possible number of categories in most implementations. For binary classification and regression, it was shown that ordering the predictor categories in each split leads to exactly the same splits as the standard approach. This reduces computational complexity because only k - 1 splits have to be considered for a nominal predictor with k categories. For multiclass classification and survival prediction no ordering method producing equivalent splits exists. We therefore propose to use a heuristic which orders the categories according to the first principal component of the weighted covariance matrix in multiclass classification and by log-rank scores in survival prediction. This ordering of categories can be done either in every split or a priori, that is, just once before growing the forest. With this approach, the nominal predictor can be treated as ordinal in the entire RF procedure, speeding up the computation and avoiding category limits. We compare the proposed methods with the standard approach, dummy coding and simply ignoring the nominal nature of the predictors in several simulation settings and on real data in terms of prediction performance and computational efficiency. We show that ordering the categories a priori is at least as good as the standard approach of considering all 2-partitions in all datasets considered, while being computationally faster. We recommend to use this approach as the default in RFs.

11.
PeerJ ; 6: e5518, 2018.
Article in English | MEDLINE | ID: mdl-30186691

ABSTRACT

Random forest and similar Machine Learning techniques are already used to generate spatial predictions, but spatial location of points (geography) is often ignored in the modeling process. Spatial auto-correlation, especially if still existent in the cross-validation residuals, indicates that the predictions are maybe biased, and this is suboptimal. This paper presents a random forest for spatial predictions framework (RFsp) where buffer distances from observation points are used as explanatory variables, thus incorporating geographical proximity effects into the prediction process. The RFsp framework is illustrated with examples that use textbook datasets and apply spatial and spatio-temporal prediction to numeric, binary, categorical, multivariate and spatiotemporal variables. Performance of the RFsp framework is compared with the state-of-the-art kriging techniques using fivefold cross-validation with refitting. The results show that RFsp can obtain equally accurate and unbiased predictions as different versions of kriging. Advantages of using RFsp over kriging are that it needs no rigid statistical assumptions about the distribution and stationarity of the target variable, it is more flexible towards incorporating, combining and extending covariates of different types, and it possibly yields more informative maps characterizing the prediction error. RFsp appears to be especially attractive for building multivariate spatial prediction models that can be used as "knowledge engines" in various geoscience fields. Some disadvantages of RFsp are the exponentially growing computational intensity with increase of calibration data and covariates and the high sensitivity of predictions to input data quality. The key to the success of the RFsp framework might be the training data quality-especially quality of spatial sampling (to minimize extrapolation problems and any type of bias in data), and quality of model validation (to ensure that accuracy is not effected by overfitting). For many data sets, especially those with lower number of points and covariates and close-to-linear relationships, model-based geostatistics can still lead to more accurate predictions than RFsp.

12.
Article in German | MEDLINE | ID: mdl-30027343

ABSTRACT

Adverse drug reactions are among the leading causes of death. Pharmacovigilance aims to monitor drugs after they have been released to the market in order to detect potential risks. Data sources commonly used to this end are spontaneous reports sent in by doctors or pharmaceutical companies. Reports alone are rather limited when it comes to detecting potential health risks. Routine statutory health insurance data, however, are a richer source since they not only provide a detailed picture of the patients' wellbeing over time, but also contain information on concomitant medication and comorbidities.To take advantage of their potential and to increase drug safety, we will further develop statistical methods that have shown their merit in other fields as a source of inspiration. A plethora of methods have been proposed over the years for spontaneous reporting data: a comprehensive comparison of these methods and their potential use for longitudinal data should be explored. In addition, we show how methods from machine learning could aid in identifying rare risks. We discuss these so-called enrichment analyses and how utilizing pharmaceutical similarities between drugs and similarities between comorbidities could help to construct risk profiles of the patients prone to experience an adverse drug event.Summarizing these methods will further push drug safety research based on healthcare claim data from German health insurances which form, due to their size, longitudinal coverage, and timeliness, an excellent basis for investigating adverse effects of drugs.


Subject(s)
Adverse Drug Reaction Reporting Systems , Drug-Related Side Effects and Adverse Reactions , Insurance, Health , Pharmacovigilance , Germany , Humans , Insurance, Health/statistics & numerical data
13.
Bioinformatics ; 34(21): 3711-3718, 2018 11 01.
Article in English | MEDLINE | ID: mdl-29757357

ABSTRACT

Motivation: Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results: We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation: The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Genome-Wide Association Study , Gene Frequency , Genome , Machine Learning , Software
14.
Sci Rep ; 8(1): 5872, 2018 04 12.
Article in English | MEDLINE | ID: mdl-29651131

ABSTRACT

Mutations in mitochondrial DNA (mtDNA) lead to heteroplasmy, i.e., the intracellular coexistence of wild-type and mutant mtDNA strands, which impact a wide spectrum of diseases but also physiological processes, including endurance exercise performance in athletes. However, the phenotypic consequences of limited levels of naturally arising heteroplasmy have not been experimentally studied to date. We hence generated a conplastic mouse strain carrying the mitochondrial genome of an AKR/J mouse strain (B6-mtAKR) in a C57BL/6 J nuclear genomic background, leading to >20% heteroplasmy in the origin of light-strand DNA replication (OriL). These conplastic mice demonstrate a shorter lifespan as well as dysregulation of multiple metabolic pathways, culminating in impaired glucose metabolism, compared to that of wild-type C57BL/6 J mice carrying lower levels of heteroplasmy. Our results indicate that physiologically relevant differences in mtDNA heteroplasmy levels at a single, functionally important site impair the metabolic health and lifespan in mice.


Subject(s)
DNA Replication/genetics , DNA, Mitochondrial/genetics , Longevity/genetics , Mitochondria/genetics , Animals , Glucose/genetics , Glucose/metabolism , Humans , Metabolic Networks and Pathways/genetics , Mice , Mitochondria/pathology , Mutation
15.
Methods Mol Biol ; 1666: 629-647, 2017.
Article in English | MEDLINE | ID: mdl-28980267

ABSTRACT

The advancement of high-throughput sequencing technologies enables sequencing of human genomes at steadily decreasing costs and increasing quality. Before variants can be analyzed, e.g., in association studies, the raw data obtained from the sequencer need to be preprocessed. These preprocessing steps include the removal of adapters, duplicates, and contaminations, alignment to a reference genome and the postprocessing of the alignment. All later steps, such as variant discovery, rely on high data quality and proper preprocessing, emphasizing the great importance of quality control. This chapter presents a workflow for preprocessing Illumina HiSeq X sequencing data. Code snippets are provided for illustrating all necessary steps, along with a brief description of the tools and underlying methods.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Whole Genome Sequencing/methods , Genome, Human , Humans , INDEL Mutation , Quality Control , Software , Workflow
16.
PLoS One ; 12(2): e0169748, 2017.
Article in English | MEDLINE | ID: mdl-28207752

ABSTRACT

This paper describes the technical development and accuracy assessment of the most recent and improved version of the SoilGrids system at 250m resolution (June 2016 update). SoilGrids provides global predictions for standard numeric soil properties (organic carbon, bulk density, Cation Exchange Capacity (CEC), pH, soil texture fractions and coarse fragments) at seven standard depths (0, 5, 15, 30, 60, 100 and 200 cm), in addition to predictions of depth to bedrock and distribution of soil classes based on the World Reference Base (WRB) and USDA classification systems (ca. 280 raster layers in total). Predictions were based on ca. 150,000 soil profiles used for training and a stack of 158 remote sensing-based soil covariates (primarily derived from MODIS land products, SRTM DEM derivatives, climatic images and global landform and lithology maps), which were used to fit an ensemble of machine learning methods-random forest and gradient boosting and/or multinomial logistic regression-as implemented in the R packages ranger, xgboost, nnet and caret. The results of 10-fold cross-validation show that the ensemble models explain between 56% (coarse fragments) and 83% (pH) of variation with an overall average of 61%. Improvements in the relative accuracy considering the amount of variation explained, in comparison to the previous version of SoilGrids at 1 km spatial resolution, range from 60 to 230%. Improvements can be attributed to: (1) the use of machine learning instead of linear regression, (2) to considerable investments in preparing finer resolution covariate layers and (3) to insertion of additional soil profiles. Further development of SoilGrids could include refinement of methods to incorporate input uncertainties and derivation of posterior probability distributions (per pixel), and further automation of spatial modeling so that soil maps can be generated for potentially hundreds of soil variables. Another area of future research is the development of methods for multiscale merging of SoilGrids predictions with local and/or national gridded soil products (e.g. up to 50 m spatial resolution) so that increasingly more accurate, complete and consistent global soil information can be produced. SoilGrids are available under the Open Data Base License.


Subject(s)
Environmental Monitoring , Geographic Information Systems , Machine Learning , Models, Theoretical , Soil/chemistry , Algorithms , Conservation of Natural Resources , Humans
17.
Stat Med ; 36(8): 1272-1284, 2017 04 15.
Article in English | MEDLINE | ID: mdl-28088842

ABSTRACT

The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd.


Subject(s)
Data Interpretation, Statistical , Models, Statistical , Survival Analysis , Effect Modifier, Epidemiologic , Humans , Monte Carlo Method , Proportional Hazards Models
19.
Arthritis Rheumatol ; 68(12): 2953-2963, 2016 12.
Article in English | MEDLINE | ID: mdl-27333332

ABSTRACT

OBJECTIVE: To compare the phenotype, clinical course, and outcome of myeloperoxidase (MPO)-antineutrophil cytoplasmic antibody (ANCA)-positive granulomatosis with polyangiitis (Wegener's) (GPA) to proteinase 3 (PR3)-ANCA-positive GPA and to MPO-ANCA-positive microscopic polyangiitis (MPA). METHODS: We characterized all MPO-ANCA-positive patients classified as having GPA by the European Medicines Agency algorithm who attended our center, in a retrospective chart review. A second cohort of patients with PR3-ANCA-positive GPA matched for age and sex was characterized. Patients with MPO-ANCA-positive MPA from a recently published cohort were also included in the analysis. All patients were diagnosed and treated according to a standardized interdisciplinary approach at a vasculitis referral center. RESULTS: Comprehensive data were available for 59 patients with MPO-ANCA-positive GPA, and they were compared to 118 patients with PR3-ANCA-positive GPA and 138 patients with MPO-ANCA-positive MPA. We observed a distinct phenotype in MPO-ANCA-positive GPA as compared to the other 2 cohorts. Patients with MPO-ANCA-positive GPA frequently had limited disease without severe organ involvement, had a high prevalence of subglottic stenosis, and had less need for aggressive immunosuppressive therapy (cyclophosphamide/rituximab). The patients with MPO-ANCA-positive GPA were also younger than the MPA patients and were predominantly female (significantly different than the MPA cohort). While GPA patients had higher survival rates compared to MPA patients (due to a high prevalence of pulmonary fibrosis in MPA), patients with MPO-ANCA had significantly lower relapse rates than those with PR3-ANCA. CONCLUSION: Patients with MPO-ANCA-positive GPA show significantly different clinical courses compared to those with PR3-ANCA-positive GPA or MPO-ANCA-positive MPA, which should be considered in their clinical management. Classification according to ANCA specificity may improve the evaluation of relapse risk.


Subject(s)
Antibodies, Antineutrophil Cytoplasmic/immunology , Granulomatosis with Polyangiitis/immunology , Peroxidase/immunology , Adolescent , Adult , Age Distribution , Aged , Case-Control Studies , Cyclophosphamide/therapeutic use , Eye Diseases/epidemiology , Eye Diseases/etiology , Female , Germany , Granulomatosis with Polyangiitis/complications , Granulomatosis with Polyangiitis/drug therapy , Granulomatosis with Polyangiitis/epidemiology , Humans , Immunologic Factors/therapeutic use , Immunosuppressive Agents/therapeutic use , Kidney Diseases/epidemiology , Kidney Diseases/etiology , Laryngostenosis/epidemiology , Laryngostenosis/etiology , Male , Middle Aged , Myeloblastin/immunology , Otorhinolaryngologic Diseases/epidemiology , Otorhinolaryngologic Diseases/etiology , Peripheral Nervous System Diseases/epidemiology , Peripheral Nervous System Diseases/etiology , Proportional Hazards Models , Recurrence , Retrospective Studies , Rituximab/therapeutic use , Survival Rate , Young Adult
20.
BMC Bioinformatics ; 17: 145, 2016 Mar 31.
Article in English | MEDLINE | ID: mdl-27029549

ABSTRACT

BACKGROUND: Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. RESULTS: Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. CONCLUSIONS: Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions.


Subject(s)
Models, Genetic , Epistasis, Genetic , Linkage Disequilibrium , Polymorphism, Single Nucleotide
SELECTION OF CITATIONS
SEARCH DETAIL
...