Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 5 de 5
Filter
Add more filters










Database
Language
Publication year range
1.
N Engl J Stat Data Sci ; 1(1): 46-61, 2023 Apr.
Article in English | MEDLINE | ID: mdl-37986713

ABSTRACT

Random forests are a powerful machine learning tool that capture complex relationships between independent variables and an outcome of interest. Trees built in a random forest are dependent on several hyperparameters, one of the more critical being the node size. The original algorithm of Breiman, controls for node size by limiting the size of the parent node, so that a node cannot be split if it has less than a specified number of observations. We propose that this hyperparameter should instead be defined as the minimum number of observations in each terminal node. The two existing random forest approaches are compared in the regression context based on estimated generalization error, bias-squared, and variance of resulting predictions in a number of simulated datasets. Additionally the two approaches are applied to type 2 diabetes data obtained from the National Health and Nutrition Examination Survey. We have developed a straightforward method for incorporating weights into the random forest analysis of survey data. Our results demonstrate that generalization error under the proposed approach is competitive to that attained from the original random forest approach when data have large random error variability. The R code created from this work is available and includes an illustration.

2.
Stat Methods Med Res ; 32(9): 1799-1810, 2023 09.
Article in English | MEDLINE | ID: mdl-37621099

ABSTRACT

Lexis diagrams are rectangular arrays of event rates indexed by age and period. Analysis of Lexis diagrams is a cornerstone of cancer surveillance research. Typically, population-based descriptive studies analyze multiple Lexis diagrams defined by sex, tumor characteristics, race/ethnicity, geographic region, etc. Inevitably the amount of information per Lexis diminishes with increasing stratification. Several methods have been proposed to smooth observed Lexis diagrams up front to clarify salient patterns and improve summary estimates of averages, gradients, and trends. In this article, we develop a novel bivariate kernel-based smoother that incorporates two key innovations. First, for any given kernel, we calculate its singular values decomposition, and select an optimal truncation point-the number of leading singular vectors to retain-based on the bias-corrected Akaike information criterion. Second, we model-average over a panel of candidate kernels with diverse shapes and bandwidths. The truncated model averaging approach is fast, automatic, has excellent performance, and provides a variance-covariance matrix that takes model selection into account. We present an in-depth case study (invasive estrogen receptor-negative breast cancer incidence among non-Hispanic white women in the United States) and simulate operating characteristics for 20 representative cancers. The truncated model averaging approach consistently outperforms any fixed kernel. Our results support the routine use of the truncated model averaging approach in descriptive studies of cancer.


Subject(s)
Breast Neoplasms , Humans , Female , United States , Incidence
3.
Stat Med ; 42(22): 3936-3955, 2023 09 30.
Article in English | MEDLINE | ID: mdl-37401188

ABSTRACT

Probability based criteria are proposed for the assessment of cost-effectiveness of a new treatment compared to a standard treatment when there are multiple effectiveness measures. Depending on the preferences of a policy maker, there are several options to define such criteria. Two such metrics are investigated in detail. One metric is defined as the conditional probability that a new treatment is more effective with respect to the multiple effectiveness measures for patients having lower costs under the new treatment. A second metric is defined as the conditional probability that a new treatment is less costly for patients having greater health benefits under the new treatment. The metrics offer considerable flexibility to a policy maker as thresholds for cost and effectiveness can be incorporated into the metrics. Parametric confidence limits are developed using a percentile bootstrap approach assuming multivariate normality for the joint distribution of the log(cost) and effectiveness measures. A non-parametric estimation procedure is also developed using the theory of U-statistics. Numerical results indicate that the proposed confidence limits accurately maintain coverage probabilities. The methodologies are illustrated using a study on the treatment of type two diabetes. Code implementing the proposed methods are provided in the supporting information.


Subject(s)
Cost-Effectiveness Analysis , Humans , Cost-Benefit Analysis
4.
Sci Rep ; 12(1): 15113, 2022 09 06.
Article in English | MEDLINE | ID: mdl-36068261

ABSTRACT

Random forests are a popular type of machine learning model, which are relatively robust to overfitting, unlike some other machine learning models, and adequately capture non-linear relationships between an outcome of interest and multiple independent variables. There are relatively few adjustable hyperparameters in the standard random forest models, among them the minimum size of the terminal nodes on each tree. The usual stopping rule, as proposed by Breiman, stops tree expansion by limiting the size of the parent nodes, so that a node cannot be split if it has less than a specified number of observations. Recently an alternative stopping criterion has been proposed, stopping tree expansion so that all terminal nodes have at least a minimum number of observations. The present paper proposes three generalisations of this idea, limiting the growth in regression random forests, based on the variance, range, or inter-centile range. The new approaches are applied to diabetes data obtained from the National Health and Nutrition Examination Survey and four other datasets (Tasmanian Abalone data, Boston Housing crime rate data, Los Angeles ozone concentration data, MIT servo data). Empirical analysis presented herein demonstrate that the new stopping rules yield competitive mean square prediction error to standard random forest models. In general, use of the intercentile range statistic to control tree expansion yields much less variation in mean square prediction error, and mean square prediction error is also closer to the optimal. The Fortran code developed is provided in the Supplementary Material.


Subject(s)
Machine Learning , Boston , Los Angeles , Nutrition Surveys
SELECTION OF CITATIONS
SEARCH DETAIL
...