Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 86
Filter
1.
Biom J ; 66(7): e202400033, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39377280

ABSTRACT

In survival analysis, it often happens that some individuals, referred to as cured individuals, never experience the event of interest. When analyzing time-to-event data with a cure fraction, it is crucial to check the assumption of "sufficient follow-up," which means that the right extreme of the censoring time distribution is larger than that of the survival time distribution for the noncured individuals. However, the available methods to test this assumption are limited in the literature. In this article, we study the problem of testing whether follow-up is sufficient for light-tailed distributions and develop a simple novel test. The proposed test statistic compares an estimator of the noncure proportion under sufficient follow-up to one without the assumption of sufficient follow-up. A bootstrap procedure is employed to approximate the critical values of the test. We also carry out extensive simulations to evaluate the finite sample performance of the test and illustrate the practical use with applications to leukemia and breast cancer data sets.


Subject(s)
Breast Neoplasms , Humans , Survival Analysis , Breast Neoplasms/mortality , Leukemia/mortality , Follow-Up Studies , Models, Statistical , Biometry/methods , Data Interpretation, Statistical , Female , Computer Simulation
2.
Multivariate Behav Res ; 59(5): 957-977, 2024.
Article in English | MEDLINE | ID: mdl-39097830

ABSTRACT

When examining whether two continuous variables are associated, tests based on Pearson's, Kendall's, and Spearman's correlation coefficients are typically used. This paper explores modern nonparametric independence tests as an alternative, which, unlike traditional tests, have the ability to potentially detect any type of relationship. In addition to existing modern nonparametric independence tests, we developed and considered two novel variants of existing tests, most notably the Heller-Heller-Gorfine-Pearson (HHG-Pearson) test. We conducted a simulation study to compare traditional independence tests, such as Pearson's correlation, and the modern nonparametric independence tests in situations commonly encountered in psychological research. As expected, no test had the highest power across all relationships. However, the distance correlation and the HHG-Pearson tests were found to have substantially greater power than all traditional tests for many relationships and only slightly less power in the worst case. A similar pattern was found in favor of the HHG-Pearson test compared to the distance correlation test. However, given that distance correlation performed better for linear relationships and is more widely accepted, we suggest considering its use in place or additional to traditional methods when there is no prior knowledge of the relationship type, as is often the case in psychological research.


Subject(s)
Computer Simulation , Humans , Statistics, Nonparametric , Data Interpretation, Statistical , Psychology/methods , Behavioral Research/methods , Models, Statistical
3.
Heliyon ; 10(9): e30470, 2024 May 15.
Article in English | MEDLINE | ID: mdl-38726202

ABSTRACT

Coastal terrestrial-aquatic interfaces (TAIs) are crucial contributors to global biogeochemical cycles and carbon exchange. The soil carbon dioxide (CO2) efflux in these transition zones is however poorly understood due to the high spatiotemporal dynamics of TAIs, as various sub-ecosystems in this region are compressed and expanded by complex influences of tides, changes in river levels, climate, and land use. We focus on the Chesapeake Bay region to (i) investigate the spatial heterogeneity of the coastal ecosystem and identify spatial zones with similar environmental characteristics based on the spatial data layers, including vegetation phenology, climate, landcover, diversity, topography, soil property, and relative tidal elevation; (ii) understand the primary driving factors affecting soil respiration within sub-ecosystems of the coastal ecosystem. Specifically, we employed hierarchical clustering analysis to identify spatial regions with distinct environmental characteristics, followed by the determination of main driving factors using Random Forest regression and SHapley Additive exPlanations. Maximum and minimum temperature are the main drivers common to all sub-ecosystems, while each region also has additional unique major drivers that differentiate them from one another. Precipitation exerts an influence on vegetated lands, while soil pH value holds importance specifically in forested lands. In croplands characterized by high clay content and low sand content, the significant role is attributed to bulk density. Wetlands demonstrate the importance of both elevation and sand content, with clay content being more relevant in non-inundated wetlands than in inundated wetlands. The topographic wetness index significantly contributes to the mixed vegetation areas, including shrub, grass, pasture, and forest. Additionally, our research reveals that dense vegetation land covers and urban/developed areas exhibit distinct soil property drivers. Overall, our research demonstrates an efficient method of employing various open-source remote sensing and GIS datasets to comprehend the spatial variability and soil respiration mechanisms in coastal TAI. There is no one-size-fits-all approach to modeling carbon fluxes released by soil respiration in coastal TAIs, and our study highlights the importance of further research and monitoring practices to improve our understanding of carbon dynamics and promote the sustainable management of coastal TAIs.

4.
Entropy (Basel) ; 26(1)2024 Jan 04.
Article in English | MEDLINE | ID: mdl-38248176

ABSTRACT

Change points indicate significant shifts in the statistical properties in data streams at some time points. Detecting change points efficiently and effectively are essential for us to understand the underlying data-generating mechanism in modern data streams with versatile parameter-varying patterns. However, it becomes a highly challenging problem to locate multiple change points in the noisy data. Although the Bayesian information criterion has been proven to be an effective way of selecting multiple change points in an asymptotical sense, its finite sample performance could be deficient. In this article, we have reviewed a list of information criterion-based methods for multiple change point detection, including Akaike information criterion, Bayesian information criterion, minimum description length, and their variants, with the emphasis on their practical applications. Simulation studies are conducted to investigate the actual performance of different information criteria in detecting multiple change points with possible model mis-specification for the practitioners. A case study on the SCADA signals of wind turbines is conducted to demonstrate the actual change point detection power of different information criteria. Finally, some key challenges in the development and application of multiple change point detection are presented for future research work.

5.
BMC Bioinformatics ; 24(1): 322, 2023 Aug 26.
Article in English | MEDLINE | ID: mdl-37633901

ABSTRACT

BACKGROUND: The identification of genomic regions affected by selection is one of the most important goals in population genetics. If temporal data are available, allele frequency changes at SNP positions are often used for this purpose. Here we provide a new testing approach that uses haplotype frequencies instead of allele frequencies. RESULTS: Using simulated data, we show that compared to SNP based test, our approach has higher power, especially when the number of candidate haplotypes is small or moderate. To improve power when the number of haplotypes is large, we investigate methods to combine them with a moderate number of haplotype subsets. Haplotype frequencies can often be recovered with less noise than SNP frequencies, especially under pool sequencing, giving our test an additional advantage. Furthermore, spurious outlier SNPs may lead to false positives, a problem usually not encountered when working with haplotypes. Post hoc tests for the number of selected haplotypes and for differences between their selection coefficients are also provided for a better understanding of the underlying selection dynamics. An application on a real data set further illustrates the performance benefits. CONCLUSIONS: Due to less multiple testing correction and noise reduction, haplotype based testing is able to outperform SNP based tests in terms of power in most scenarios.


Subject(s)
Genomics , Polymorphism, Single Nucleotide , Haplotypes , Gene Frequency
6.
Foodborne Pathog Dis ; 20(9): 414-418, 2023 09.
Article in English | MEDLINE | ID: mdl-37578455

ABSTRACT

CDC and health departments investigate foodborne disease outbreaks to identify a source. To generate and test hypotheses about vehicles, investigators typically compare exposure prevalence among case-patients with the general population using a one-sample binomial test. We propose a Bayesian alternative that also accounts for uncertainty in the estimate of exposure prevalence in the reference population. We compared exposure prevalence in a 2020 outbreak of Escherichia coli O157:H7 illnesses linked to leafy greens with 2018-2019 FoodNet Population Survey estimates. We ran prospective simulations using our Bayesian approach at three time points during the investigation. The posterior probability that leafy green consumption prevalence was higher than the general population prevalence increased as additional case-patients were interviewed. Probabilities were >0.70 for multiple leafy green items 2 weeks before the exact binomial p-value was statistically significant. A Bayesian approach to assessing exposure prevalence among cases could be superior to the one-sample binomial test typically used during foodborne outbreak investigations.


Subject(s)
Escherichia coli O157 , Foodborne Diseases , Humans , Bayes Theorem , Prevalence , Foodborne Diseases/epidemiology , Disease Outbreaks
7.
Stat Methods Med Res ; 32(8): 1559-1575, 2023 08.
Article in English | MEDLINE | ID: mdl-37325816

ABSTRACT

Nonlinear mixed effects models have been widely applied to analyses of data that arise from biological, agricultural, and environmental sciences. Estimation of and inference on parameters in nonlinear mixed effects models are often based on the specification of a likelihood function. Maximizing this likelihood function can be complicated by the specification of the random effects distribution, especially in the presence of multiple random effects. The implementation of nonlinear mixed effects models can be further complicated by left-censored responses, representing measurements from bioassays where the exact quantification below a certain threshold is not possible. Motivated by the need to characterize the nonlinear human immunodeficiency virus RNA viral load trajectories after the interruption of antiretroviral therapy, we propose a smoothed simulated pseudo-maximum likelihood estimation approach to fit nonlinear mixed effects models in the presence of left-censored observations. We establish the consistency and asymptotic normality of the resulting estimators. We develop testing procedures for the correlation among random effects and for testing the distributional assumptions on random effects against a specific alternative. In contrast to the existing variants of expectation-maximization approaches, the proposed methods offer flexibility in the specification of the random effects distribution and convenience in making inference about higher-order correlation parameters. We evaluate the finite-sample performance of the proposed methods through extensive simulation studies and illustrate them on a combined dataset from six AIDS Clinical Trials Group treatment interruption studies.


Subject(s)
HIV Infections , Humans , Likelihood Functions , Computer Simulation , HIV Infections/drug therapy , Nonlinear Dynamics , Models, Statistical
8.
Front Psychiatry ; 14: 1102811, 2023.
Article in English | MEDLINE | ID: mdl-36970281

ABSTRACT

Background: A greatly growing body of literature has revealed the mediating role of DNA methylation in the influence path from childhood maltreatment to psychiatric disorders such as post-traumatic stress disorder (PTSD) in adult. However, the statistical method is challenging and powerful mediation analyses regarding this issue are lacking. Methods: To study how the maltreatment in childhood alters long-lasting DNA methylation changes which further affect PTSD in adult, we here carried out a gene-based mediation analysis from a perspective of composite null hypothesis in the Grady Trauma Project (352 participants and 16,565 genes) with childhood maltreatment as exposure, multiple DNA methylation sites as mediators, and PTSD or its relevant scores as outcome. We effectively addressed the challenging issue of gene-based mediation analysis by taking its composite null hypothesis testing nature into consideration and fitting a weighted test statistic. Results: We discovered that childhood maltreatment could substantially affected PTSD or PTSD-related scores, and that childhood maltreatment was associated with DNA methylation which further had significant roles in PTSD and these scores. Furthermore, using the proposed mediation method, we identified multiple genes within which DNA methylation sites exhibited mediating roles in the influence path from childhood maltreatment to PTSD-relevant scores in adult, with 13 for Beck Depression Inventory and 6 for modified PTSD Symptom Scale, respectively. Conclusion: Our results have the potential to confer meaningful insights into the biological mechanism for the impact of early adverse experience on adult diseases; and our proposed mediation methods can be applied to other similar analysis settings.

9.
J Appl Stat ; 50(3): 495-511, 2023.
Article in English | MEDLINE | ID: mdl-36819081

ABSTRACT

Network (graph) data analysis is a popular research topic in statistics and machine learning. In application, one is frequently confronted with graph two-sample hypothesis testing where the goal is to test the difference between two graph populations. Several statistical tests have been devised for this purpose in the context of binary graphs. However, many of the practical networks are weighted and existing procedures cannot be directly applied to weighted graphs. In this paper, we study the weighted graph two-sample hypothesis testing problem and propose a practical test statistic. We prove that the proposed test statistic converges in distribution to the standard normal distribution under the null hypothesis and analyze its power theoretically. The simulation study shows that the proposed test has satisfactory performance and it substantially outperforms the existing counterpart in the binary graph case. A real data application is provided to illustrate the method.

10.
Entropy (Basel) ; 25(2)2023 Jan 28.
Article in English | MEDLINE | ID: mdl-36832605

ABSTRACT

In this paper, we focus on the homogeneity test that evaluates whether two multivariate samples come from the same distribution. This problem arises naturally in various applications, and there are many methods available in the literature. Based on data depth, several tests have been proposed for this problem but they may not be very powerful. In light of the recent development of data depth as an important measure in quality assurance, we propose two new test statistics for the multivariate two-sample homogeneity test. The proposed test statistics have the same χ2(1) asymptotic null distribution. The generalization of the proposed tests into the multivariate multisample situation is discussed as well. Simulations studies demonstrate the superior performance of the proposed tests. The test procedure is illustrated through two real data examples.

11.
Int J Biostat ; 19(1): 1-19, 2023 05 01.
Article in English | MEDLINE | ID: mdl-35749155

ABSTRACT

It has been reported that about half of biological discoveries are irreproducible. These irreproducible discoveries were partially attributed to poor statistical power. The poor powers are majorly owned to small sample sizes. However, in molecular biology and medicine, due to the limit of biological resources and budget, most molecular biological experiments have been conducted with small samples. Two-sample t-test controls bias by using a degree of freedom. However, this also implicates that t-test has low power in small samples. A discovery found with low statistical power suggests that it has a poor reproducibility. So, promotion of statistical power is not a feasible way to enhance reproducibility in small-sample experiments. An alternative way is to reduce type I error rate. For doing so, a so-called t α -test was developed. Both theoretical analysis and simulation study demonstrate that t α -test much outperforms t-test. However, t α -test is reduced to t-test when sample sizes are over 15. Large-scale simulation studies and real experiment data show that t α -test significantly reduced type I error rate compared to t-test and Wilcoxon test in small-sample experiments. t α -test had almost the same empirical power with t-test. Null p-value density distribution explains why t α -test had so lower type I error rate than t-test. One real experimental dataset provides a typical example to show that t α -test outperforms t-test and a microarray dataset showed that t α -test had the best performance among five statistical methods. In addition, the density distribution and probability cumulative function of t α -statistic were given in mathematics and the theoretical and observed distributions are well matched.


Subject(s)
Models, Statistical , Reproducibility of Results , Computer Simulation , Likelihood Functions , Sample Size
12.
Stat Med ; 42(1): 68-88, 2023 01 15.
Article in English | MEDLINE | ID: mdl-36372072

ABSTRACT

The primary benefit of identifying a valid surrogate marker is the ability to use it in a future trial to test for a treatment effect with shorter follow-up time or less cost. However, previous work has demonstrated potential heterogeneity in the utility of a surrogate marker. When such heterogeneity exists, existing methods that use the surrogate to test for a treatment effect while ignoring this heterogeneity may lead to inaccurate conclusions about the treatment effect, particularly when the patient population in the new study has a different mix of characteristics than the study used to evaluate the utility of the surrogate marker. In this article, we develop a novel test for a treatment effect using surrogate marker information that accounts for heterogeneity in the utility of the surrogate. We compare our testing procedure to a test that uses primary outcome information (gold standard) and a test that uses surrogate marker information, but ignores heterogeneity. We demonstrate the validity of our approach and derive the asymptotic properties of our estimator and variance estimates. Simulation studies examine the finite sample properties of our testing procedure and demonstrate when our proposed approach can outperform the testing approach that ignores heterogeneity. We illustrate our methods using data from an AIDS clinical trial to test for a treatment effect using CD4 count as a surrogate marker for RNA.


Subject(s)
Computer Simulation , Humans , Biomarkers , CD4 Lymphocyte Count
13.
Stat Probab Lett ; 1932023 Feb.
Article in English | MEDLINE | ID: mdl-38584807

ABSTRACT

This work defines a new correction for the likelihood ratio test for a two-sample problem within the multivariate normal context. This correction applies to decomposable graphical models, where testing equality of distributions can be decomposed into lower dimensional problems.

14.
Front Genet ; 13: 1009428, 2022.
Article in English | MEDLINE | ID: mdl-36468009

ABSTRACT

Combining SNP p-values from GWAS summary data is a promising strategy for detecting novel genetic factors. Existing statistical methods for the p-value-based SNP-set testing confront two challenges. First, the statistical power of different methods depends on unknown patterns of genetic effects that could drastically vary over different SNP sets. Second, they do not identify which SNPs primarily contribute to the global association of the whole set. We propose a new signal-adaptive analysis pipeline to address these challenges using the omnibus thresholding Fisher's method (oTFisher). The oTFisher remains robustly powerful over various patterns of genetic effects. Its adaptive thresholding can be applied to estimate important SNPs contributing to the overall significance of the given SNP set. We develop efficient calculation algorithms to control the type I error rate, which accounts for the linkage disequilibrium among SNPs. Extensive simulations show that the oTFisher has robustly high power and provides a higher balanced accuracy in screening SNPs than the traditional Bonferroni and FDR procedures. We applied the oTFisher to study the genetic association of genes and haplotype blocks of the bone density-related traits using the summary data of the Genetic Factors for Osteoporosis Consortium. The oTFisher identified more novel and literature-reported genetic factors than existing p-value combination methods. Relevant computation has been implemented into the R package TFisher to support similar data analysis.

15.
Front Psychol ; 13: 980261, 2022.
Article in English | MEDLINE | ID: mdl-36533060

ABSTRACT

The identification of an empirically adequate theoretical construct requires determining whether a theoretically predicted effect is sufficiently similar to an observed effect. To this end, we propose a simple similarity measure, describe its application in different research designs, and use computer simulations to estimate the necessary sample size for a given observed effect. As our main example, we apply this measure to recent meta-analytical research on precognition. Results suggest that the evidential basis is too weak for a predicted precognition effect of d = 0.20 to be considered empirically adequate. As additional examples, we apply this measure to object-level experimental data from dissonance theory and a recent crowdsourcing hypothesis test, as well as to meta-analytical data on the correlation of personality traits and life outcomes.

16.
Prev Med ; 164: 107127, 2022 11.
Article in English | MEDLINE | ID: mdl-35787846

ABSTRACT

It is well known that the statistical analyses in health-science and medical journals are frequently misleading or even wrong. Despite many decades of reform efforts by hundreds of scientists and statisticians, attempts to fix the problem by avoiding obvious error and encouraging good practice have not altered this basic situation. Statistical teaching and reporting remain mired in damaging yet editorially enforced jargon of "significance", "confidence", and imbalanced focus on null (no-effect or "nil") hypotheses, leading to flawed attempts to simplify descriptions of results in ordinary terms. A positive development amidst all this has been the introduction of interval estimates alongside or in place of significance tests and P-values, but intervals have been beset by similar misinterpretations. Attempts to remedy this situation by calling for replacement of traditional statistics with competitors (such as pure-likelihood or Bayesian methods) have had little impact. Thus, rather than ban or replace P-values or confidence intervals, we propose to replace traditional jargon with more accurate and modest ordinary-language labels that describe these statistics as measures of compatibility between data and hypotheses or models, which have long been in use in the statistical modeling literature. Such descriptions emphasize the full range of possibilities compatible with observations. Additionally, a simple transform of the P-value called the surprisal or S-value provides a sense of how much or how little information the data supply against those possibilities. We illustrate these reforms using some examples from a highly charged topic: trials of ivermectin treatment for Covid-19.


Subject(s)
COVID-19 , Humans , Data Interpretation, Statistical , Bayes Theorem , COVID-19/prevention & control , Probability , Models, Statistical , Confidence Intervals
17.
Stat Med ; 41(17): 3349-3364, 2022 07 30.
Article in English | MEDLINE | ID: mdl-35491388

ABSTRACT

We propose an inferential framework for fixed effects in longitudinal functional models and introduce tests for the correlation structures induced by the longitudinal sampling procedure. The framework provides a natural extension of standard longitudinal correlation models for scalar observations to functional observations. Using simulation studies, we compare fixed effects estimation under correctly and incorrectly specified correlation structures and also test the longitudinal correlation structure. Finally, we apply the proposed methods to a longitudinal functional dataset on physical activity. The computer code for the proposed method is available at https://github.com/rli20ST758/FILF.


Subject(s)
Exercise , Research Design , Computer Simulation , Humans , Longitudinal Studies
18.
Stat Med ; 41(13): 2417-2426, 2022 06 15.
Article in English | MEDLINE | ID: mdl-35253259

ABSTRACT

Testing a global null hypothesis that there are no significant predictors for a binary outcome of interest among a large set of biomarker measurements is an important task in biomedical studies. We seek to improve the power of such testing methods by leveraging ensemble machine learning methods. Ensemble machine learning methods such as random forest, bagging, and adaptive boosting model the relationship between the outcome and the predictor nonparametrically, while stacking combines the strength of multiple learners. We demonstrate the power of the proposed testing methods through Monte Carlo studies and show the use of the methods by applying them to the immunologic biomarkers dataset from the RV144 HIV vaccine efficacy trial.


Subject(s)
Machine Learning , Humans
19.
Sensors (Basel) ; 22(2)2022 Jan 07.
Article in English | MEDLINE | ID: mdl-35062397

ABSTRACT

Distinguishing between wireless and wired traffic in a network middlebox is an essential ingredient for numerous applications including security monitoring and quality-of-service (QoS) provisioning. The majority of existing approaches have exploited the greater delay statistics, such as round-trip-time and inter-packet arrival time, observed in wireless traffic to infer whether the traffic is originated from Ethernet (i.e., wired) or Wi-Fi (i.e., wireless) based on the assumption that the capacity of the wireless link is much slower than that of the wired link. However, this underlying assumption is no longer valid due to increases in wireless data rates over Gbps enabled by recent Wi-Fi technologies such as 802.11ac/ax. In this paper, we revisit the problem of identifying Wi-Fi traffic in network middleboxes as the wireless link capacity approaches the capacity of the wired. We present Weigh-in-Motion, a lightweight online detection scheme, that analyzes the traffic patterns observed at the middleboxes and infers whether the traffic is originated from high-speed Wi-Fi devices. To this end, we introduce the concept of ACKBunch that captures the unique characteristics of high-speed Wi-Fi, which is further utilized to distinguish whether the observed traffic is originated from a wired or wireless device. The effectiveness of the proposed scheme is evaluated via extensive real experiments, demonstrating its capability of accurately identifying wireless traffic from/to Gigabit 802.11 devices.

20.
Pharm Stat ; 21(1): 133-149, 2022 01.
Article in English | MEDLINE | ID: mdl-34350678

ABSTRACT

In multiregional randomized clinical trials (MRCTs), determining the regional treatment effect of a new treatment over an existing one is important to both the sponsor and related regulatory agencies. Also of particular interest is to test the null hypothesis that the treatment benefit is the same among all the regions. Existing methods are mainly for continuous endpoint and use parametric models, which are not robust. MRCTs are known for facing increased variation and heterogeneity and a robust model for its design and analysis would be desirable. We consider clinical trials with a binary primary endpoint and propose a robust semiparametric logistic model which has a known parametric and an unknown nonparametric component. The parametric component represents our prior knowledge about the model, and the nonparametric part reflects uncertainty. Compared to the classic logistic model for this problem, the proposed model has the following advantages: robust to model assumption, more flexible and accurate to model the relationship between the response and covariates, and possibly more accurate parameter estimates. The model parameters are estimated by profile maximum likelihood approach, and the null hypothesis of regional treatment difference being the same is tested by the profile likelihood ratio statistic. Asymptotic properties of the estimates are derived. Simulation studies are conducted to evaluate the performance of the proposed model, which demonstrated clear advantages over the classic logistic model. The method is then applied to analyzing a real MRCT.


Subject(s)
Models, Statistical , Computer Simulation , Humans , Likelihood Functions , Logistic Models , Randomized Controlled Trials as Topic
SELECTION OF CITATIONS
SEARCH DETAIL