Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 35
Filter
1.
Biom J ; 66(1): e2200237, 2024 Jan.
Article in English | MEDLINE | ID: mdl-38285404

ABSTRACT

The two-sample problem is one of the earliest problems in statistics: given two samples, the question is whether or not the observations were sampled from the same distribution. Many statistical tests have been developed for this problem, and many tests have been evaluated in simulation studies, but hardly any study has tried to set up a neutral comparison study. In this paper, we introduce an open science initiative that potentially allows for neutral comparisons of two-sample tests. It is designed as an open-source R package, a repository, and an online R Shiny app. This paper describes the principles, the design of the system and illustrates the use of the system.


Subject(s)
Computer Simulation
2.
Rev. peru. med. exp. salud publica ; 40(2): 220-228, abr.-jun. 2023. tab
Article in Spanish | LILACS, INS-PERU | ID: biblio-1509038

ABSTRACT

RESUMEN En este artículo se introducen los ensayos clínicos aleatorizados y conceptos básicos de la inferencia estadística. Se presenta como calcular el tamaño de muestra por tipo de desenlace e hipótesis a probar, junto con el código en el lenguaje de programación R para realizar su aplicación. Se presentan cuatro métodos para realizar el ajuste del tamaño de muestra original, cuando se planean análisis interinos. De una manera sencilla y concreta se busca realizar una introducción a estos temas, considerando las expresiones matemáticas que soportan los resultados y su implementación en programas estadísticos disponibles. Con el fin de acercar a los estudiantes de áreas de la salud a la estadística y al uso de programas estadísticos, aspectos poco considerados en su formación.


ABSTRACT This article introduces randomized clinical trials and basic concepts of statistical inference. We present methods for calculating the sample size by outcome type and the hypothesis to be tested, together with the code in the R programming language. We describe four methods for adjusting the original sample size for interim analyses. We sought to introduce these topics in a simple and concrete way, considering the mathematical expressions that support the results and their implementation in available statistical programs; therefore, bringing health students closer to statistics and the use of statistical programs, which are aspects that are rarely considered during their training.

3.
Entropy (Basel) ; 25(5)2023 Apr 28.
Article in English | MEDLINE | ID: mdl-37238489

ABSTRACT

We obtain expressions for the asymptotic distributions of the Rényi and Tsallis of order q entropies and Fisher information when computed on the maximum likelihood estimator of probabilities from multinomial random samples. We verify that these asymptotic models, two of which (Tsallis and Fisher) are normal, describe well a variety of simulated data. In addition, we obtain test statistics for comparing (possibly different types of) entropies from two samples without requiring the same number of categories. Finally, we apply these tests to social survey data and verify that the results are consistent but more general than those obtained with a χ2 test.

4.
Psychometrika ; 88(2): 636-655, 2023 06.
Article in English | MEDLINE | ID: mdl-36892727

ABSTRACT

Research questions in the human sciences often seek to answer if and when a process changes across time. In functional MRI studies, for instance, researchers may seek to assess the onset of a shift in brain state. For daily diary studies, the researcher may seek to identify when a person's psychological process shifts following treatment. The timing and presence of such a change may be meaningful in terms of understanding state changes. Currently, dynamic processes are typically quantified as static networks where edges indicate temporal relations among nodes, which may be variables reflecting emotions, behaviors, or brain activity. Here we describe three methods for detecting changes in such correlation networks from a data-driven perspective. Networks here are quantified using the lag-0 pair-wise correlation (or covariance) estimates as the representation of the dynamic relations among variables. We present three methods for change point detection: dynamic connectivity regression, max-type method, and a PCA-based method. The change point detection methods each include different ways to test if two given correlation network patterns from different segments in time are significantly different. These tests can also be used outside of the change point detection approaches to test any two given blocks of data. We compare the three methods for change point detection as well as the complementary significance testing approaches on simulated and empirical functional connectivity fMRI data examples.


Subject(s)
Brain Mapping , Magnetic Resonance Imaging , Humans , Magnetic Resonance Imaging/methods , Brain Mapping/methods , Neural Pathways , Psychometrics , Brain/diagnostic imaging
5.
J Appl Stat ; 49(14): 3659-3676, 2022.
Article in English | MEDLINE | ID: mdl-36246862

ABSTRACT

The problem of testing the intercept and slope parameters of doubly multivariate linear models with site-dependent covariates using Rao's score test (RST) is studied. The RST statistic is developed for a block exchangeable covariance structure on the error vector under the assumption of multivariate normality. We compare our developed RST statistic with the likelihood ratio test (LRT) statistic. Monte Carlo simulations indicate that the RST statistic is much more accurate than its counterpart LRT statistic and it takes significantly less computation time than the LRT statistic. The proposed method is illustrated with an example of multiple response variables measured on multiple trees in a single plot in an agricultural study.

6.
8.
Ann Appl Probab ; 32(4): 2967-3003, 2022 Aug.
Article in English | MEDLINE | ID: mdl-36034074

ABSTRACT

We study the sample covariance matrix for real-valued data with general population covariance, as well as MANOVA-type covariance estimators in variance components models under null hypotheses of global sphericity. In the limit as matrix dimensions increase proportionally, the asymptotic spectra of such estimators may have multiple disjoint intervals of support, possibly intersecting the negative half line. We show that the distribution of the extremal eigenvalue at each regular edge of the support has a GOE Tracy-Widom limit. Our proof extends a comparison argument of Ji Oon Lee and Kevin Schnelli, replacing a continuous Green function flow by a discrete Lindeberg swapping scheme.

9.
Int J Mol Sci ; 23(14)2022 Jul 10.
Article in English | MEDLINE | ID: mdl-35886973

ABSTRACT

Making statistical inference on quantities defining various characteristics of a temporally measured biochemical process and analyzing its variability across different experimental conditions is a core challenge in various branches of science. This problem is particularly difficult when the amount of data that can be collected is limited in terms of both the number of replicates and the number of time points per process trajectory. We propose a method for analyzing the variability of smooth functionals of the growth or production trajectories associated with such processes across different experimental conditions. Our modeling approach is based on a spline representation of the mean trajectories. We also develop a bootstrap-based inference procedure for the parameters while accounting for possible multiple comparisons. This methodology is applied to study two types of quantities-the "time to harvest" and "maximal productivity"-in the context of an experiment on the production of recombinant proteins. We complement the findings with extensive numerical experiments comparing the effectiveness of different types of bootstrap procedures for various tests of hypotheses. These numerical experiments convincingly demonstrate that the proposed method yields reliable inference on complex characteristics of the processes even in a data-limited environment where more traditional methods for statistical inference are typically not reliable.


Subject(s)
Research Design , Recombinant Proteins/genetics
10.
MethodsX ; 9: 101660, 2022.
Article in English | MEDLINE | ID: mdl-35345788

ABSTRACT

Large sets of autocorrelated data are common in fields such as remote sensing and genomics. For example, remote sensing can produce maps of information for millions of pixels, and the information from nearby pixels will likely be spatially autocorrelated. Although there are well-established statistical methods for testing hypotheses using autocorrelated data, these methods become computationally impractical for large datasets. • The method developed here makes it feasible to perform F-tests, likelihood ratio tests, and t-tests for large autocorrelated datasets. The method involves subsetting the dataset into partitions, analyzing each partition separately, and then combining the separate tests to give an overall test. • The separate statistical tests on partitions are non-independent, because the points in different partitions are not independent. Therefore, combining separate analyses of partitions requires accounting for the non-independence of the test statistics among partitions. • The methods can be applied to a wide range of data, including not only purely spatial data but also spatiotemporal data. For spatiotemporal data, it is possible to estimate coefficients from time-series models at different spatial locations and then analyze the spatial distribution of the estimates. The spatial analysis can be simplified by estimating spatial autocorrelation directly from the spatial autocorrelation among time series.

11.
Genes (Basel) ; 13(1)2022 01 14.
Article in English | MEDLINE | ID: mdl-35052480

ABSTRACT

The inference of ancestry has become a part of the services many forensic genetic laboratories provide. Interest in ancestry may be to provide investigative leads or identify the region of origin in cases of unidentified missing persons. There exist many biostatistical methods developed for the study of population structure in the area of population genetics. However, the challenges and questions are slightly different in the context of forensic genetics, where the origin of a specific sample is of interest compared to the understanding of population histories and genealogies. In this paper, the methodologies for modelling population admixture and inferring ancestral populations are reviewed with a focus on their strengths and weaknesses in relation to ancestry inference in the forensic context.


Subject(s)
Ethnicity/genetics , Forensic Genetics/methods , Genetic Markers , Genetics, Population , Polymorphism, Single Nucleotide , Racial Groups/genetics , Humans
12.
Preprint in Portuguese | SciELO Preprints | ID: pps-3389

ABSTRACT

Data analysis is a fundamental step in the development of scientific projects. Before starting a project, the researcher needs to plan their experiments and analyzes clearly, ensuring a robust approach, protected from the most elementary biases. This document reports the creation of the "heRcules" repository, which will provide public access to script models in R language for the analysis of scientific data, with an emphasis on disciplines of the Biological and Health Sciences. The model presented here provides scripts for essential tasks in planning, analysis, visualization, and hypothesis testing procedures, including sample size calculation, statistical power calculation, spreadsheet import, vector and data frame creation, descriptive statistics, file export, plot creation (base R and ggplot2), outlier tests, normality tests, hypothesis tests and notebook creation with R markdown. The "heRcules" repository is deposited on GitHub, which will ensure efficient, free, and collaborative access to these resources.


El análisis de datos es un paso fundamental en el desarrollo de proyectos científicos. Antes de iniciar un proyecto, el investigador debe planificar sus experimentos y análisis con claridad, asegurando un enfoque robusto, protegido de los sesgos más elementales. Este documento reporta la creación del repositorio "Hércules", que brindará acceso público a modelos de script en lenguaje R para el análisis de datos científicos, con énfasis en disciplinas de las Ciencias Biológicas y de la Salud. El modelo que se presenta aquí proporciona scripts para tareas esenciales en la planificación, análisis, visualización y procedimientos de prueba de hipótesis, incluido el cálculo del tamaño de la muestra, el cálculo de la potencia estadística, la importación de hojas de cálculo, la creación de vectores y marcos de datos, estadísticas descriptivas, exportación de archivos, creación de gráficos (base R y ggplot2), pruebas de valores atípicos, pruebas de normalidad, pruebas de hipótesis y creación de libretas con R markdown. El repositorio "heRcules" se deposita en GitHub, lo que garantizará un acceso eficiente, gratuito y colaborativo a estos recursos.


A análise de dados é uma etapa fundamental do desenvolvimento de projetos científicos. Antes mesmo de iniciar o trabalho, o pesquisador precisa planejar seus experimentos e análises de forma clara, garantindo uma abordagem robusta e protegida dos vieses mais elementares. O presente documento reporta a criação do repositório "heRcules," que dará acesso público a modelos de scripts em linguagem R para a análise de dados científicos, com ênfase nas disciplinas das Ciências Biológicas e da Saúde. Nesse primeiro modelo, reportado aqui, estão inclusos scripts para tarefas essenciais no planejamento, análise, visualização e teste de hipóteses, incluindo cálculo de tamanho amostral, cálculo de poder estatístico, importação de planilhas, criação de vetores e data frames, estatística descritiva, exportação de arquivos, criação de gráficos (base R e ggplot2), testes de outliers,testes de normalidade, testes de hipóteses e criação de notebooks com o R markdown. O repositório está depositado na plataforma GitHub1, o que garantirá acesso aos recursos de forma eficiente, gratuita e colaborativa.

13.
Econom J ; 24(2): C1-C39, 2021 May.
Article in English | MEDLINE | ID: mdl-34594155

ABSTRACT

This paper presents a simple decision-theoretic economic approach for analyzing social experiments with compromised random assignment protocols that are only partially documented. We model administratively constrained experimenters who satisfice in seeking covariate balance. We develop design-based small-sample hypothesis tests that use worst-case (least favorable) randomization null distributions. Our approach accommodates a variety of compromised experiments, including imperfectly documented re-randomization designs. To make our analysis concrete, we focus much of our discussion on the influential Perry Preschool Project. We reexamine previous estimates of program effectiveness using our methods. The choice of how to model reassignment vitally affects inference.

14.
Biometrics ; 77(3): 1037-1049, 2021 09.
Article in English | MEDLINE | ID: mdl-33434289

ABSTRACT

Changepoint detection methods are used in many areas of science and engineering, for example, in the analysis of copy number variation data to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or the presence) of given changepoints post-selection are lacking. Post-selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low-powered hypothesis tests and leaves open several important questions about practical usability. In this work, we carefully tailor post-selection inference methods toward changepoint detection, focusing on copy number variation data. To accomplish this, we study commonly used changepoint algorithms: binary segmentation, as well as two of its most popular variants, wild and circular, and the fused lasso. We implement some of the latest developments in post-selection inference theory, mainly auxiliary randomization. This improves the power, which requires implementations of Markov chain Monte Carlo algorithms (importance sampling and hit-and-run sampling) to carry out our tests. We also provide recommendations for improving practical useability, detailed simulations, and example analyses on array comparative genomic hybridization as well as sequencing data.


Subject(s)
Algorithms , DNA Copy Number Variations , Comparative Genomic Hybridization , DNA Copy Number Variations/genetics , Markov Chains , Monte Carlo Method
15.
BMC Med Res Methodol ; 20(1): 244, 2020 09 30.
Article in English | MEDLINE | ID: mdl-32998683

ABSTRACT

BACKGROUND: Researchers often misinterpret and misrepresent statistical outputs. This abuse has led to a large literature on modification or replacement of testing thresholds and P-values with confidence intervals, Bayes factors, and other devices. Because the core problems appear cognitive rather than statistical, we review some simple methods to aid researchers in interpreting statistical outputs. These methods emphasize logical and information concepts over probability, and thus may be more robust to common misinterpretations than are traditional descriptions. METHODS: We use the Shannon transform of the P-value p, also known as the binary surprisal or S-value s = -log2(p), to provide a measure of the information supplied by the testing procedure, and to help calibrate intuitions against simple physical experiments like coin tossing. We also use tables or graphs of test statistics for alternative hypotheses, and interval estimates for different percentile levels, to thwart fallacies arising from arbitrary dichotomies. Finally, we reinterpret P-values and interval estimates in unconditional terms, which describe compatibility of data with the entire set of analysis assumptions. We illustrate these methods with a reanalysis of data from an existing record-based cohort study. CONCLUSIONS: In line with other recent recommendations, we advise that teaching materials and research reports discuss P-values as measures of compatibility rather than significance, compute P-values for alternative hypotheses whenever they are computed for null hypotheses, and interpret interval estimates as showing values of high compatibility with data, rather than regions of confidence. Our recommendations emphasize cognitive devices for displaying the compatibility of the observed data with various hypotheses of interest, rather than focusing on single hypothesis tests or interval estimates. We believe these simple reforms are well worth the minor effort they require.


Subject(s)
Cognition , Semantics , Bayes Theorem , Cohort Studies , Confidence Intervals , Humans , Probability
16.
BMC Med Res Methodol ; 20(1): 197, 2020 07 25.
Article in English | MEDLINE | ID: mdl-32711456

ABSTRACT

BACKGROUND: Under competing risks, the commonly used sub-distribution hazard ratio (SHR) is not easy to interpret clinically and is valid only under the proportional sub-distribution hazard (SDH) assumption. This paper introduces an alternative statistical measure: the restricted mean time lost (RMTL). METHODS: First, the definition and estimation methods of the measures are introduced. Second, based on the differences in RMTLs, a basic difference test (Diff) and a supremum difference test (sDiff) are constructed. Then, the corresponding sample size estimation method is proposed. The statistical properties of the methods and the estimated sample size are evaluated using Monte Carlo simulations, and these methods are also applied to two real examples. RESULTS: The simulation results show that sDiff performs well and has relatively high test efficiency in most situations. Regarding sample size calculation, sDiff exhibits good performance in various situations. The methods are illustrated using two examples. CONCLUSIONS: RMTL can meaningfully summarize treatment effects for clinical decision making, which can then be reported with the SDH ratio for competing risks data. The proposed sDiff test and the two calculated sample size formulas have wide applicability and can be considered in real data analysis and trial design.


Subject(s)
Proportional Hazards Models , Computer Simulation , Humans , Monte Carlo Method , Sample Size , Time Factors
17.
Stat Med ; 39(17): 2291-2307, 2020 07 30.
Article in English | MEDLINE | ID: mdl-32478440

ABSTRACT

In lifetime data, like cancer studies, there may be long term survivors, which lead to heavy censoring at the end of the follow-up period. Since a standard survival model is not appropriate to handle these data, a cure model is needed. In the literature, covariate hypothesis tests for cure models are limited to parametric and semiparametric methods. We fill this important gap by proposing a nonparametric covariate hypothesis test for the probability of cure in mixture cure models. A bootstrap method is proposed to approximate the null distribution of the test statistic. The procedure can be applied to any type of covariate, and could be extended to the multivariate setting. Its efficiency is evaluated in a Monte Carlo simulation study. Finally, the method is applied to a colorectal cancer dataset.


Subject(s)
Models, Statistical , Survivors , Computer Simulation , Humans , Monte Carlo Method , Probability
18.
J Res Natl Inst Stand Technol ; 125: 125003, 2020.
Article in English | MEDLINE | ID: mdl-38343525

ABSTRACT

Given a composite null hypothesis ℋ0, test supermartingales are non-negative supermartingales with respect to ℋ0 with an initial value of 1. Large values of test supermartingales provide evidence against ℋ0. As a result, test supermartingales are an effective tool for rejecting ℋ0, particularly when the p-values obtained are very small and serve as certificates against the null hypothesis. Examples include the rejection of local realism as an explanation of Bell test experiments in the foundations of physics and the certification of entanglement in quantum information science. Test supermartingales have the advantage of being adaptable during an experiment and allowing for arbitrary stopping rules. By inversion of acceptance regions, they can also be used to determine confidence sets. We used an example to compare the performance of test supermartingales for computing p-values and confidence intervals to Chernoff-Hoeffding bounds and the "exact" p-value. The example is the problem of inferring the probability of success in a sequence of Bernoulli trials. There is a cost in using a technique that has no restriction on stopping rules, and, for a particular test supermartingale, our study quantifies this cost.

19.
Biom J ; 61(1): 162-165, 2019 01.
Article in English | MEDLINE | ID: mdl-30417414

ABSTRACT

A well-known problem in classical two-tailed hypothesis testing is that P-values go to zero when the sample size goes to infinity, irrespectively of the effect size. This pitfall can make the testing of data consisting of large sample sizes potentially unreliable. In this note, we propose to test for relevant differences to overcome this issue. We illustrate the proposed test a on real data set of about 40 million privately insured patients.


Subject(s)
Biometry/methods , Emergency Service, Hospital/statistics & numerical data , Humans , Sample Size , Virus Diseases/epidemiology
20.
Cancer Epidemiol ; 56: 83-89, 2018 10.
Article in English | MEDLINE | ID: mdl-30099328

ABSTRACT

BACKGROUND: Biomarker candidates are often ranked using P-values. Standard P-value calculations use normal or logit-normal approximations, which may not be correct for small P-values and small sample sizes common in discovery research. METHODS: We compared exact P-values, correct by definition, with logit-normal approximations in a simulated study of 40 cases and 160 controls. The key measure of biomarker performance was sensitivity at 90% specificity. Data for 3000 uninformative false markers and 30 informative true markers were generated randomly. We also analyzed real data for 2371 plasma protein markers measured in 121 breast cancer cases and 121 controls. RESULTS: In our simulation, using the same discovery criterion, exact P-values led to discovery of 24 true and 82 false biomarkers, while logit-normal approximate P-values yielded 20 true and 106 false biomarkers. The estimated true discovery rate was substantially off for approximate P-values: logit-normal estimated 42 but found 20. The exact method estimated 22, very close to 24, which was the actual number of true discoveries. Although these results are based on one specific simulation, qualitatively similar results were obtained from 10 random repetitions. With real data, ranking candidate biomarkers by exact P-values, versus approximate P-values, resulted in a very different ordering of these markers. CONCLUSIONS: Exact P-values, which correspond to permutation tests with non-parametric rank statistics such as empirical ROC statistics, are preferred over approximate P-values. Approximate P-values can lead to inappropriate biomarker selection rules and incorrect conclusions. IMPACT: Exact P-values in place of approximate P-values in discovery research may improve the yield of biomarkers that validate clinically.


Subject(s)
Biomarkers/analysis , Computational Biology/methods , Computational Biology/standards , Data Interpretation, Statistical , Models, Statistical , Humans
SELECTION OF CITATIONS
SEARCH DETAIL
...