Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 54
Filter
1.
Biom J ; 66(3): e2300094, 2024 Apr.
Article in English | MEDLINE | ID: mdl-38581099

ABSTRACT

Conditional power (CP) serves as a widely utilized approach for futility monitoring in group sequential designs. However, adopting the CP methods may lead to inadequate control of the type II error rate at the desired level. In this study, we introduce a flexible beta spending function tailored to regulate the type II error rate while employing CP based on a predetermined standardized effect size for futility monitoring (a so-called CP-beta spending function). This function delineates the expenditure of type II error rate across the entirety of the trial. Unlike other existing beta spending functions, the CP-beta spending function seamlessly incorporates beta spending concept into the CP framework, facilitating precise stagewise control of the type II error rate during futility monitoring. In addition, the stopping boundaries derived from the CP-beta spending function can be calculated via integration akin to other traditional beta spending function methods. Furthermore, the proposed CP-beta spending function accommodates various thresholds on the CP-scale at different stages of the trial, ensuring its adaptability across different information time scenarios. These attributes render the CP-beta spending function competitive among other forms of beta spending functions, making it applicable to any trials in group sequential designs with straightforward implementation. Both simulation study and example from an acute ischemic stroke trial demonstrate that the proposed method accurately captures expected power, even when the initially determined sample size does not consider futility stopping, and exhibits a good performance in maintaining overall type I error rates for evident futility.


Subject(s)
Ischemic Stroke , Research Design , Humans , Sample Size , Computer Simulation , Medical Futility
2.
Br J Math Stat Psychol ; 77(3): 651-671, 2024 Nov.
Article in English | MEDLINE | ID: mdl-38623032

ABSTRACT

Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. We outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a simple way of approximating the probability of correctly selecting the best applicants which allows us to compute error probabilities of the selection procedure (i.e., false positive and false negative rate) or their lower bounds. We draw connections between the IRR and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures. We also discuss other possible uses of the explored connections in other contexts, such as educational testing, psychological assessment, and health-related measurement, and implement the computations in the R package IRR2FPR.


Subject(s)
Models, Statistical , Humans , Reproducibility of Results , False Positive Reactions , Computer Simulation , Observer Variation , Probability , Peer Review/methods
3.
Front Med (Lausanne) ; 10: 1215927, 2023.
Article in English | MEDLINE | ID: mdl-37663663

ABSTRACT

One of the most important statistical analyses when designing animal and human studies is the calculation of the required sample size. In this review, we define central terms in the context of sample size determination, including mean, standard deviation, statistical hypothesis testing, type I/II error, power, direction of effect, effect size, expected attrition, corrected sample size, and allocation ratio. We also provide practical examples of sample size calculations for animal and human studies based on pilot studies, larger studies similar to the proposed study-or if no previous studies are available-estimated magnitudes of the effect size per Cohen and Sawilowsky.

4.
Int J Biostat ; 19(1): 1-19, 2023 05 01.
Article in English | MEDLINE | ID: mdl-35749155

ABSTRACT

It has been reported that about half of biological discoveries are irreproducible. These irreproducible discoveries were partially attributed to poor statistical power. The poor powers are majorly owned to small sample sizes. However, in molecular biology and medicine, due to the limit of biological resources and budget, most molecular biological experiments have been conducted with small samples. Two-sample t-test controls bias by using a degree of freedom. However, this also implicates that t-test has low power in small samples. A discovery found with low statistical power suggests that it has a poor reproducibility. So, promotion of statistical power is not a feasible way to enhance reproducibility in small-sample experiments. An alternative way is to reduce type I error rate. For doing so, a so-called t α -test was developed. Both theoretical analysis and simulation study demonstrate that t α -test much outperforms t-test. However, t α -test is reduced to t-test when sample sizes are over 15. Large-scale simulation studies and real experiment data show that t α -test significantly reduced type I error rate compared to t-test and Wilcoxon test in small-sample experiments. t α -test had almost the same empirical power with t-test. Null p-value density distribution explains why t α -test had so lower type I error rate than t-test. One real experimental dataset provides a typical example to show that t α -test outperforms t-test and a microarray dataset showed that t α -test had the best performance among five statistical methods. In addition, the density distribution and probability cumulative function of t α -statistic were given in mathematics and the theoretical and observed distributions are well matched.


Subject(s)
Models, Statistical , Reproducibility of Results , Computer Simulation , Likelihood Functions , Sample Size
5.
Behav Res Methods ; 55(7): 3494-3503, 2023 10.
Article in English | MEDLINE | ID: mdl-36223007

ABSTRACT

Currently, the design standards for single-case experimental designs (SCEDs) are based on validity considerations as prescribed by the What Works Clearinghouse. However, there is a need for design considerations such as power based on statistical analyses. We compute and derive power using computations for (AB)k designs with multiple cases which are common in SCEDs. Our computations show that effect size has the maximum impact on power followed by the number of subjects and then the number of phase reversals. An effect size of 0.75 or higher, at least one set of phase reversals (i.e., where k > 1), and at least three subjects showed high power. The latter two conditions agree with current standards about either having at least an ABAB design or a multiple baseline design with three subjects to meet design standards. An effect size of 0.75 or higher is not uncommon in SCEDs either. Autocorrelations, the number of time-points per phase, and intraclass correlations had a smaller but non-negligible impact on power. In sum, power analyses in the present study show that conditions to meet power requirements are not unreasonable in SCEDs. The software code to compute power is available on GitHub for the use of the reader.


Subject(s)
Research Design , Humans
6.
BMC Med Res Methodol ; 22(1): 244, 2022 Sep 19.
Article in English | MEDLINE | ID: mdl-36123631

ABSTRACT

BACKGROUND: Null Hypothesis Significance Testing (NHST) has been well criticised over the years yet remains a pillar of statistical inference. Although NHST is well described in terms of statistical models, most textbooks for non-statisticians present the null and alternative hypotheses (H0 and HA, respectively) in terms of differences between groups such as (µ1 = µ2) and (µ1 ≠ µ2) and HA is often stated to be the research hypothesis. Here we use propositional calculus to analyse the internal logic of NHST when couched in this popular terminology. The testable H0 is determined by analysing the scope and limits of the P-value and the test statistic's probability distribution curve. RESULTS: We propose a minimum axiom set NHST in which it is taken as axiomatic that H0 is rejected if P-value< α. Using the common scenario of the comparison of the means of two sample groups as an example, the testable H0 is {(µ1 = µ2) and [([Formula: see text] 1 ≠ [Formula: see text] 2) due to chance alone]}. The H0 and HA pair should be exhaustive to avoid false dichotomies. This entails that HA is ¬{(µ1 = µ2) and [([Formula: see text] 1 ≠ [Formula: see text] 2) due to chance alone]}, rather than the research hypothesis (HT). To see the relationship between HA and HT, HA can be rewritten as the disjunction HA: ({(µ1 = µ2) ∧ [([Formula: see text] 1 ≠ [Formula: see text] 2) not due to chance alone]} ∨ {(µ1 ≠ µ2) ∧ [[Formula: see text] 1 ≠ [Formula: see text] 2) not due to (µ1 ≠ µ2) alone]} ∨ {(µ1 ≠ µ2) ∧ [([Formula: see text] 1 ≠ [Formula: see text] 2) due to (µ1 ≠ µ2) alone]}). This reveals that HT (the last disjunct in bold) is just one possibility within HA. It is only by adding premises to NHST that HT or other conclusions can be reached. CONCLUSIONS: Using this popular terminology for NHST, analysis shows that the definitions of H0 and HA differ from those found in textbooks. In this framework, achieving a statistically significant result only justifies the broad conclusion that the results are not due to chance alone, not that the research hypothesis is true. More transparency is needed concerning the premises added to NHST to rig particular conclusions such as HT. There are also ramifications for the interpretation of Type I and II errors, as well as power, which do not specifically refer to HT as claimed by texts.

7.
Ecology ; 103(10): e3780, 2022 10.
Article in English | MEDLINE | ID: mdl-35657174

ABSTRACT

The Mantel test has been widely used in ecology and evolution, but over the last two decades it has been frequently critiqued because results were inconsistent with expectations and there were issues with Type I (false-positive) and Type II (false-negative) error rates. Three-matrix extensions of the Mantel test have been challenged for similar reasons. Even the null hypotheses underlying the Mantel test have been questioned. As a result, use of the Mantel test and its variants has been discouraged or limited to special situations. Here, we examine Mantel test criticisms including the lack of agreement between traditional variable-based Pearson correlations (r) and observation-based Mantel correlations (rm ), and the unusual Type I and Type II error rates. We propose an alternate proximity measure that resolves these issues. We use simulations and examples to contrast Mantel results based on Euclidean distance, squared Euclidean distance, and the simple difference (Diff) with traditional bivariate Pearson correlations. We demonstrate that use of the simple difference in Mantel tests can resolve the underlying problems with poor agreement between bivariate Pearson and Mantel correlations, as well as appropriate Type I and Type II errors (i.e., where r = cor(x,y) and rm = cor(dx ,dy ), if dx = Diff(x) and dy = Diff(y), r = rm ). We also show that the simple difference can provide solutions to issues with partial Mantel tests and distance-based MANOVA. Because our results resolve many of the issues with Mantel tests, we hope that these findings will restore the popularity of the Mantel test.


Subject(s)
Ecology , Ecology/methods
8.
J Rheumatol ; 49(8): 867-870, 2022 08.
Article in English | MEDLINE | ID: mdl-35105710

ABSTRACT

Power calculations are a key study design step in research studies. However, such power analysis is often inappropriately performed in the medical literature by attempting to help interpret the findings of a completed study, instead of attempting to aid in choosing an optimal sample size for a future study. The aim of this article is to provide a brief discussion of the drawbacks of performing these post hoc power calculations, and to correspondingly suggest best practices regarding the use of statistical power and the interpretation of study results. Specifically, power analysis should always be considered before any research study in order to choose an ideal sample size and/or to examine the feasibility of properly evaluating study aims, but it should never be used in order to help interpret the results of an already completed study. Alternatively, 95% confidence intervals for effect sizes (eg, odds ratio, hazard ratio, mean difference) or other relevant parameter estimates should be used when attempting to draw conclusions from results, such as the likelihood of a type II error (ie, a false negative finding).


Subject(s)
Research Design , Humans , Odds Ratio , Sample Size
9.
Global Spine J ; 12(5): 1027-1028, 2022 Jun.
Article in English | MEDLINE | ID: mdl-34865556
10.
Methods Mol Biol ; 2249: 281-305, 2021.
Article in English | MEDLINE | ID: mdl-33871850

ABSTRACT

Performing well-powered, randomized, controlled trials is of fundamental importance in clinical research. The goal of sample size calculations is to assure that statistical power is sufficiently high when the probability of falsely rejecting a true null hypothesis (type I error) is kept acceptably small. This chapter overviews the fundamental of sample size calculation for standard types of outcomes for 2-group studies. It also considers (1) the problem of determining the size of the treatment effect that a study should be designed to detect, (2) modifications to sample size calculations to account for loss to follow-up and nonadherence, (3) options that can be used when initial calculations indicate that the feasible sample size is insufficient to provide adequate power, (4) implications of using multiple primary end points. In addition, a discussion of cluster randomized trials is provided. Sample size estimates for longitudinal cohort studies must take account of confounding by baseline factors.


Subject(s)
Randomized Controlled Trials as Topic/methods , Research Design , Cohort Studies , Data Interpretation, Statistical , Humans , Models, Statistical , Sample Size
11.
J Neurosci Methods ; 357: 109155, 2021 06 01.
Article in English | MEDLINE | ID: mdl-33781790

ABSTRACT

BACKGROUND: Methods for p-value correction are criticized for either increasing Type II error or improperly reducing Type I error in large exploratory data analysis. This text considers patterns in probability vectors resulting from mass univariate analysis to correct p-values, where clusters of significant p-values may indicate true H0 rejection. NEW METHOD: We used ERP experimental data from control and ADHD boys to test the method. The Log10 of p-vector was convolved with a Gaussian window whose length was set as the shortest lag above which autocorrelation of each ERP wave may be assumed to have vanished. We realized Monte-Carlo simulations (MC) to (1) evaluate confidence intervals of rejected and non-rejected areas of our data, (2) to evaluate differences between corrected and uncorrected p-vectors or simulated ones in terms of distribution of significant p-values, and (3) to empirically verify the type-I error rate (comparing 10,000 pairs of mixed samples whit control and ADHD subjects). RESULTS: The differences between simulation or raw p-vector and corrected p-vectors were, respectively, minimal and maximal for window length set by autocorrelation in p-vector convolution. COMPARISON WITH EXISTING METHODS: Our method was less conservative while FDR methods rejected basically all significant p-values.The MC simulations presented 2.78 ± 4.83% of difference (20 channels) from corrected p-vector, while difference from raw p-vector was 596 ± 5.00% (p = 0.0003). CONCLUSION: As a cluster-based correction, the present new method seems to be biological and statistically suitable to correct p-values in mass univariate analysis of ERP waves, which adopts adaptive parameters to correction.


Subject(s)
Monte Carlo Method , Computer Simulation , Humans , Male , Probability
12.
Twin Res Hum Genet ; 23(2): 87-89, 2020 04.
Article in English | MEDLINE | ID: mdl-32638684

ABSTRACT

Dr Nick Martin has made enormous contributions to the field of behavior genetics over the past 50 years. Of his many seminal papers that have had a profound impact, we focus on his early work on the power of twin studies. He was among the first to recognize the importance of sample size calculation before conducting a study to ensure sufficient power to detect the effects of interest. The elegant approach he developed, based on the noncentral chi-squared distribution, has been adopted by subsequent researchers for other genetic study designs, and today remains a standard tool for power calculations in structural equation modeling and other areas of statistical analysis. The present brief article discusses the main aspects of his seminal paper, and how it led to subsequent developments, by him and others, as the field of behavior genetics evolved into the present era.


Subject(s)
Genetics, Behavioral/history , Twin Studies as Topic/history , Twins/genetics , Genetics, Behavioral/statistics & numerical data , History, 20th Century , History, 21st Century , Humans , Sample Size , Twin Studies as Topic/statistics & numerical data , Twins/statistics & numerical data
13.
Am J Phys Anthropol ; 172(4): 521-527, 2020 08.
Article in English | MEDLINE | ID: mdl-32570289

ABSTRACT

Statistically nonsignificant (p > .05) results from a null hypothesis significance test (NHST) are often mistakenly interpreted as evidence that the null hypothesis is true-that there is "no effect" or "no difference." However, many of these results occur because the study had low statistical power to detect an effect. Power below 50% is common, in which case a result of no statistical significance is more likely to be incorrect than correct. The inference of "no effect" is not valid even if power is high. NHST assumes that the null hypothesis is true; p is the probability of the data under the assumption that there is no effect. A statistical test cannot confirm what it assumes. These incorrect statistical inferences could be eliminated if decisions based on p values were replaced by a biological evaluation of effect sizes and their confidence intervals. For a single study, the observed effect size is the best estimate of the population effect size, regardless of the p value. Unlike p values, confidence intervals provide information about the precision of the observed effect. In the biomedical and pharmacology literature, methods have been developed to evaluate whether effects are "equivalent," rather than zero, as tested with NHST. These methods could be used by biological anthropologists to evaluate the presence or absence of meaningful biological effects. Most of what appears to be known about no difference or no effect between sexes, between populations, between treatments, and other circumstances in the biological anthropology literature is based on invalid statistical inference.


Subject(s)
Anthropology, Physical , Data Interpretation, Statistical , Models, Statistical , Anthropology, Physical/standards , Anthropology, Physical/statistics & numerical data , Humans
15.
Pharm Stat ; 19(5): 720-732, 2020 09.
Article in English | MEDLINE | ID: mdl-32338443

ABSTRACT

In monitoring clinical trials, the question of futility, or whether the data thus far suggest that the results at the final analysis are unlikely to be statistically successful, is regularly of interest over the course of a study. However, the opposite viewpoint of whether the study is sufficiently demonstrating proof of concept (POC) and should continue is a valuable consideration and ultimately should be addressed with high POC power so that a promising study is not prematurely terminated. Conditional power is often used to assess futility, and this article interconnects the ideas of assessing POC for the purpose of study continuation with conditional power, while highlighting the importance of the POC type I error and the POC type II error for study continuation or not at the interim analysis. Methods for analyzing subgroups motivate the interim analyses to maintain high POC power via an adjusted interim POC significance level criterion for study continuation or testing against an inferiority margin. Furthermore, two versions of conditional power based on the assumed effect size or the observed interim effect size are considered. Graphical displays illustrate the relationship of the POC type II error for premature study termination to the POC type I error for study continuation and the associated conditional power criteria.


Subject(s)
Clinical Trials as Topic/methods , Research Design , Humans , Medical Futility , Proof of Concept Study , Sample Size
16.
Int J Epidemiol ; 49(3): 968-978, 2020 06 01.
Article in English | MEDLINE | ID: mdl-32176282

ABSTRACT

BACKGROUND: It is unclear how multiple treatment comparisons are managed in the analysis of multi-arm trials, particularly related to reducing type I (false positive) and type II errors (false negative). METHODS: We conducted a cohort study of clinical-trial protocols that were approved by research ethics committees in the UK, Switzerland, Germany and Canada in 2012. We examined the use of multiple-testing procedures to control the overall type I error rate. We created a decision tool to determine the need for multiple-testing procedures. We compared the result of the decision tool to the analysis plan in the protocol. We also compared the pre-specified analysis plans in trial protocols to their publications. RESULTS: Sixty-four protocols for multi-arm trials were identified, of which 50 involved multiple testing. Nine of 50 trials (18%) used a single-step multiple-testing procedures such as a Bonferroni correction and 17 (38%) used an ordered sequence of primary comparisons to control the overall type I error. Based on our decision tool, 45 of 50 protocols (90%) required use of a multiple-testing procedure but only 28 of the 45 (62%) accounted for multiplicity in their analysis or provided a rationale if no multiple-testing procedure was used. We identified 32 protocol-publication pairs, of which 8 planned a global-comparison test and 20 planned a multiple-testing procedure in their trial protocol. However, four of these eight trials (50%) did not use the global-comparison test. Likewise, 3 of the 20 trials (15%) did not perform the multiple-testing procedure in the publication. The sample size of our study was small and we did not have access to statistical-analysis plans for the included trials in our study. CONCLUSIONS: Strategies to reduce type I and type II errors are inconsistently employed in multi-arm trials. Important analytical differences exist between planned analyses in clinical-trial protocols and subsequent publications, which may suggest selective reporting of analyses.


Subject(s)
Clinical Trials as Topic , Clinical Trials as Topic/methods , Cohort Studies , Humans , Multilevel Analysis , Research Design
17.
Pharm Stat ; 19(4): 454-467, 2020 07.
Article in English | MEDLINE | ID: mdl-32061188

ABSTRACT

Phase II clinical trials designed for evaluating a drug's treatment effect can be either single-arm or double-arm. A single-arm design tests the null hypothesis that the response rate of a new drug is lower than a fixed threshold, whereas a double-arm scheme takes a more objective comparison of the response rate between the new treatment and the standard of care through randomization. Although the randomized design is the gold standard for efficacy assessment, various situations may arise where a single-arm pilot study prior to a randomized trial is necessary. To combine the single- and double-arm phases and pool the information together for better decision making, we propose a Single-To-double ARm Transition design (START) with switching hypotheses tests, where the first stage compares the new drug's response rate with a minimum required level and imposes a continuation criterion, and the second stage utilizes randomization to determine the treatment's superiority. We develop a software package in R to calibrate the frequentist error rates and perform simulation studies to assess the trial characteristics. Finally, a metastatic pancreatic cancer trial is used for illustrating the decision rules under the proposed START design.


Subject(s)
Clinical Trials, Phase II as Topic/methods , Research Design , Computer Simulation , Humans
18.
Ecology ; 101(3): e02945, 2020 03.
Article in English | MEDLINE | ID: mdl-31834622

ABSTRACT

Identifying species interactions and detecting when ecological communities are structured by them is an important problem in ecology and biogeography. Ecologists have developed specialized statistical hypothesis tests to detect patterns indicative of community-wide processes in their field data. In this respect, null model approaches have proved particularly popular. The freedom allowed in choosing the null model and statistic to construct a hypothesis test leads to a proliferation of possible hypothesis tests from which ecologists can choose to detect these processes. Here, we point out some serious shortcomings of a popular approach to choosing the best hypothesis for the ecological problem at hand that involves benchmarking different hypothesis tests by assessing their performance on artificially constructed data sets. Terminological errors concerning the use of Type I and Type II errors that underlie these approaches are discussed. We argue that the key benchmarking methods proposed in the literature are not a sound guide for selecting null hypothesis tests, and further, that there is no simple way to benchmark null hypothesis tests. Surprisingly, the basic problems identified here do not appear to have been addressed previously, and these methods are still being used to develop and test new null models and summary statistics, from quantifying community structure (e.g., nestedness and modularity) to analyzing ecological networks.


Subject(s)
Benchmarking , Biota
19.
Ecology ; 100(4): e02640, 2019 04.
Article in English | MEDLINE | ID: mdl-30712257

ABSTRACT

Researchers have long viewed patterns of species association as key to understanding the processes that structure communities. Community-level tests of species association have received the most attention; however, pairwise species associations may offer greater opportunity for linking patterns to specific mechanisms. Although several tests of pairwise association have been developed, there remain gaps in our understanding of their performance. Consequently, it is unclear whether these methods reliably detect patterns of association, or if any one method is superior. We maximized association patterns for single species pairs in synthetic community matrices and examined how accurately five pairwise association tests found that pair, while not finding others (i.e., type I and II error rates). All tests are more likely to miss patterns of association than to falsely detect them. When we maximized association for a species pair that included one or more rare or common species, tests were frequently unable to identify that pair as significantly associated. Consequently, these tests are best suited for identifying significant associations between pairs of species that occur in an intermediate number of samples; for such pairs, three of the five tests considered here detected 100% of the pairs for which we maximized associations.

20.
Nutr Clin Pract ; 34(1): 60-72, 2019 Feb.
Article in English | MEDLINE | ID: mdl-30570169

ABSTRACT

Evidence-based medicine (EBM) has become a fixture in today's medical practice. Evidence consists of memorialized observations and should be contrasted with dogmatic pronouncements and/or hypotheses. Evidence has varying degrees of reliability. The randomized clinical trial (RCT) or a systematic review of RCTs is accorded the highest level of credibility and expert opinion the lowest. This ranking reflects the internal validity (degree to which factors in the study interfere with the gathering or interpretation of the observations) of the study design; more valid designs are more credible. The provision of healthcare requires an almost constant assessment of evidence. In so doing, there are a number of principles of EBM that need to be kept in mind: Association can never prove causation. Various methodologic biases can influence conclusions made in both RCTs and observational studies. The strength of RCTs is in the elimination of confounding bias. Surrogate outcomes must be validated in RCTs assessing how they are changed compared with the clinical outcomes. Subgroup analyses cannot prove hypotheses although they can generate them. P < 0.05 is not the same as truth. Type I errors are more likely to occur when multiple analyses are performed, when trials are prematurely stopped for perceived benefit when there was no a priori plan to do so, or in small papers with dramatic results that are selectively published. The failure to find a difference does not mean that no difference exists (type II error).


Subject(s)
Evidence-Based Medicine , Humans , Reproducibility of Results , Research Design , Thinking
SELECTION OF CITATIONS
SEARCH DETAIL