Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.584
Filtrar
1.
J Gen Intern Med ; 2024 Oct 02.
Artículo en Inglés | MEDLINE | ID: mdl-39358502

RESUMEN

BACKGROUND: Early identification of a patient with resistant hypertension (RH) enables quickly intensified treatment, short-interval follow-up, or perhaps case management to bring his or her blood pressure under control and reduce the risk of complications. OBJECTIVE: To identify predictors of RH among individuals with newly diagnosed hypertension (HTN), while comparing different prediction models and techniques for managing missing covariates using electronic health records data. DESIGN: Risk prediction study in a retrospective cohort. PARTICIPANTS: Adult patients with incident HTN treated in any of the primary care clinics of one health system between April 2013 and December 2016. MAIN MEASURES: Predicted risk of RH at the time of HTN identification and candidate predictors for variable selection in future model development. KEY RESULTS: Among 26,953 individuals with incident HTN, 613 (2.3%) met criteria for RH after 4.7 months (interquartile range, 1.2-11.3). Variables selected by the least absolute shrinkage and selection operator (LASSO), included baseline systolic blood pressure (SBP) and its missing indicator (a dummy variable created if baseline SBP is absent), use of antihypertensive medication at the time of cohort entry, body mass index, and atherosclerosis risk. The random forest technique achieved the highest area under the curve (AUC) of 0.893 (95% CI, 0.881-0.904) and the best calibration with a calibration slope of 1.01. Complete case analysis is not a valuable option (AUC = 0.625). CONCLUSIONS: Machine learning techniques and traditional logistic regression exhibited comparable levels of predictive performance after handling the missingness. We suggest that the variables identified by this study may be good candidates for clinical prediction models to alert clinicians to the need for short-interval follow up and more intensive early therapy for HTN.

2.
Caspian J Intern Med ; 15(4): 615-622, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39359440

RESUMEN

Background: Diabetes, a currently threatening disease, has severe consequences for individuals' health conditions. The present study aimed to investigate the factors affecting the changes in the longitudinal outcome of blood sugar using a three-level analysis with the presence of missing data in diabetic patients. Methods: A total of 526 diabetic patients were followed longitudinally selected from the annual data collected from the rural population monitored by Tonekabon health centers in the North of Iran during 2018-2019 from the Iranian Integrated Health System (SIB) database. In analyzing this longitudinal data, the three-level model (level 1: observation (time), level 2: subject, level 3: health center) was carried out with multiple imputations of possible missing values in longitudinal data. Results: Results of fitting the three-level model indicated that every unit of change in the body mass index (BMI) significantly increased the fasting blood sugar by an average of 0.5 mg/dl (p=0.024). The impact of level 1 (observations) was insignificant in the three-level model. Still, the random effect of level 3 (healthcare centers) showed a highly significant measure for health centers (14.62, p<0.001). Conclusion: The BMI reduction, the healthcare centers' socioeconomic status, and the health services provided have potential effects in controlling diabetes.

3.
Ann Appl Stat ; 18(2): 1195-1212, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-39360180

RESUMEN

Multivariate longitudinal data are frequently encountered in practice such as in our motivating longitudinal microbiome study. It is of general interest to associate such high-dimensional, longitudinal measures with some univariate continuous outcome. However, incomplete observations are common in a regular study design, as not all samples are measured at every time point, giving rise to the so-called blockwise missing values. Such missing structure imposes significant challenges for association analysis and defies many existing methods that require complete samples. In this paper we propose to represent multivariate longitudinal data as a three-way tensor array (i.e., sample-by-feature-by-time) and exploit a parsimonious scalar-on-tensor regression model for association analysis. We develop a regularized covariance-based estimation procedure that effectively leverages all available observations without imputation. The method achieves variable selection and smooth estimation of time-varying effects. The application to the motivating microbiome study reveals interesting links between the preterm infant's gut microbiome dynamics and their neurodevelopment. Additional numerical studies on synthetic data and a longitudinal aging study further demonstrate the efficacy of the proposed method.

4.
BMC Med Res Methodol ; 24(1): 217, 2024 Sep 27.
Artículo en Inglés | MEDLINE | ID: mdl-39333923

RESUMEN

BACKGROUND: In computer-aided diagnosis (CAD) studies utilizing multireader multicase (MRMC) designs, missing data might occur when there are instances of misinterpretation or oversight by the reader or problems with measurement techniques. Improper handling of these missing data can lead to bias. However, little research has been conducted on addressing the missing data issue within the MRMC framework. METHODS: We introduced a novel approach that integrates multiple imputation with MRMC analysis (MI-MRMC). An elaborate simulation study was conducted to compare the efficacy of our proposed approach with that of the traditional complete case analysis strategy within the MRMC design. Furthermore, we applied these approaches to a real MRMC design CAD study on aneurysm detection via head and neck CT angiograms to further validate their practicality. RESULTS: Compared with traditional complete case analysis, the simulation study demonstrated the MI-MRMC approach provides an almost unbiased estimate of diagnostic capability, alongside satisfactory performance in terms of statistical power and the type I error rate within the MRMC framework, even in small sample scenarios. In the real CAD study, the proposed MI-MRMC method further demonstrated strong performance in terms of both point estimates and confidence intervals compared with traditional complete case analysis. CONCLUSION: Within MRMC design settings, the adoption of an MI-MRMC approach in the face of missing data can facilitate the attainment of unbiased and robust estimates of diagnostic capability.


Asunto(s)
Simulación por Computador , Humanos , Proyectos de Investigación , Algoritmos , Interpretación Estadística de Datos
5.
Eur J Cancer ; 212: 114313, 2024 Sep 18.
Artículo en Inglés | MEDLINE | ID: mdl-39305741

RESUMEN

BACKGROUND: Patient-reported outcomes (PROs) play a crucial role in cancer clinical trials. Despite the availability of validated PRO measures (PROMs), challenges related to low completion rates and missing data remain, potentially affecting the trial results' validity. This review explored strategies to improve and maintain high PROM completion rates in cancer clinical trials. METHODOLOGY: A scoping review was performed across Medline, Embase and Scopus and regulatory guidelines. Key recommendations were synthesized into categories such as stakeholder involvement, study design, PRO assessment, mode of assessment, participant support, and monitoring. RESULTS: The review identified 114 recommendations from 18 papers (16 peer-reviewed articles and 2 policy documents). The recommendations included integrating comprehensive PRO information into the study protocol, enhancing patient involvement during the protocol development phase and in education, and collecting relevant PRO data at clinically meaningful time points. Electronic data collection, effective monitoring systems, and sufficient time, capacity, workforce and financial resources were highlighted. DISCUSSION: Further research needs to evaluate the effectiveness of these strategies in various context and to tailor these recommendations into practical and effective strategies. This will enhance PRO completion rates and patient-centred care. However, obstacles such as patient burden, low health literacy, and conflicting recommendations may present challenges in application.

6.
Genome Biol ; 25(1): 236, 2024 Sep 03.
Artículo en Inglés | MEDLINE | ID: mdl-39227979

RESUMEN

Missing covariate data is a common problem that has not been addressed in observational studies of gene expression. Here, we present a multiple imputation method that accommodates high dimensional gene expression data by incorporating principal component analysis of the transcriptome into the multiple imputation prediction models to avoid bias. Simulation studies using three datasets show that this method outperforms complete case and single imputation analyses at uncovering true positive differentially expressed genes, limiting false discovery rates, and minimizing bias. This method is easily implemented via an R Bioconductor package, RNAseqCovarImpute that integrates with the limma-voom pipeline for differential expression analysis.


Asunto(s)
Programas Informáticos , Humanos , Perfilación de la Expresión Génica/métodos , Transcriptoma , Análisis de Componente Principal , Análisis de Secuencia de ARN/métodos
7.
Front Big Data ; 7: 1422650, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39234189

RESUMEN

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

8.
BMC Med Res Methodol ; 24(1): 194, 2024 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-39243025

RESUMEN

BACKGROUND: Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction. METHODS: We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). RESULTS: Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models. CONCLUSION: MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.


Asunto(s)
Aprendizaje Automático , Miopía , Humanos , Miopía/diagnóstico , Miopía/epidemiología , Femenino , Niño , Masculino , Modelos Logísticos , Modelos Estadísticos , Medición de Riesgo/métodos , Medición de Riesgo/estadística & datos numéricos , Factores de Riesgo , Curva ROC , Teorema de Bayes , China/epidemiología , Estudios de Cohortes , Edad de Inicio
9.
BMC Med Res Methodol ; 24(1): 203, 2024 Sep 13.
Artículo en Inglés | MEDLINE | ID: mdl-39272007

RESUMEN

BACKGROUND: Evaluating outcome reliability is critical in real-world evidence studies. Overall survival is a common outcome in these studies; however, its capture in real-world data (RWD) sources is often incomplete and supplemented with linked mortality information from external sources. Conflicting recommendations exist for censoring overall survival in real-world evidence studies. This simulation study aimed to understand the impact of different censoring methods on estimating median survival and log hazard ratios when external mortality information is partially captured. METHODS: We used Monte Carlo simulation to emulate a non-randomized comparative effectiveness study of two treatments with RWD from electronic health records and linked external mortality data. We simulated the time to death, the time to last database activity, and the time to data cutoff. Death events after the last database activity were attributed to linked external mortality data and randomly set to missing to reflect the sensitivity of contemporary real-world data sources. Two censoring schemes were evaluated: (1) censoring at the last activity date and (2) censoring at the end of data availability (data cutoff) without an observed death. We assessed the performance of each method in estimating median survival and log hazard ratios using bias, coverage, variance, and rejection rate under varying amounts of incomplete mortality information and varying treatment effects, length of follow-up, and sample size. RESULTS: When mortality information was fully captured, median survival estimates were unbiased when censoring at data cutoff and underestimated when censoring at the last activity. When linked mortality information was missing, censoring at the last activity date underestimated the median survival, while censoring at the data cutoff overestimated it. As missing linked mortality information increased, bias decreased when censoring at the last activity date and increased when censoring at data cutoff. CONCLUSIONS: Researchers should consider the completeness of linked external mortality information when choosing how to censor the analysis of overall survival using RWD. Substantial bias in median survival estimates can occur if an inappropriate censoring scheme is selected. We advocate for RWD providers to perform validation studies of their mortality data and publish their findings to inform methodological decisions better.


Asunto(s)
Simulación por Computador , Humanos , Análisis de Supervivencia , Método de Montecarlo , Registros Electrónicos de Salud/estadística & datos numéricos , Modelos de Riesgos Proporcionales , Reproducibilidad de los Resultados , Mortalidad/tendencias
10.
Stat Med ; 2024 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-39248704

RESUMEN

Analyzing longitudinal data in health studies is challenging due to sparse and error-prone measurements, strong within-individual correlation, missing data and various trajectory shapes. While mixed-effect models (MM) effectively address these challenges, they remain parametric models and may incur computational costs. In contrast, functional principal component analysis (FPCA) is a non-parametric approach developed for regular and dense functional data that flexibly describes temporal trajectories at a potentially lower computational cost. This article presents an empirical simulation study evaluating the behavior of FPCA with sparse and error-prone repeated measures and its robustness under different missing data schemes in comparison with MM. The results show that FPCA is well-suited in the presence of missing at random data caused by dropout, except in scenarios involving most frequent and systematic dropout. Like MM, FPCA fails under missing not at random mechanism. The FPCA was applied to describe the trajectories of four cognitive functions before clinical dementia and contrast them with those of matched controls in a case-control study nested in a population-based aging cohort. The average cognitive declines of future dementia cases showed a sudden divergence from those of their matched controls with a sharp acceleration 5 to 2.5 years prior to diagnosis.

11.
Behav Res Methods ; 2024 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-39251529

RESUMEN

The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little's MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.

12.
J Contam Hydrol ; 266: 104418, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39217676

RESUMEN

Scarcity of stream salinity data poses a challenge to understanding salinity dynamics and its implications for water supply management in water-scarce salt-prone regions around the world. This paper introduces a framework for generating continuous daily stream salinity estimates using instance-based transfer learning (TL) and assessing the reliability of the synthetic salinity data through uncertainty quantification via prediction intervals (PIs). The framework was developed using two temporally distinct specific conductance (SC) datasets from the Upper Red River Basin (URRB) located in southwestern Oklahoma and Texas Panhandle, United States. The instance-based TL approach was implemented by calibrating Feedforward Neural Networks (FFNNs) on a source SC dataset of around 1200 instantaneous grab samples collected by United States Geological Survey (USGS) from 1959 to 1993. The trained FFNNs were subsequently tested on a target dataset (1998-present) of 220 instantaneous grab samples collected by the Oklahoma Water Resources Board (OWRB). The framework's generalizability was assessed in the data-rich Bird Creek watershed in Oklahoma by manipulating continuous SC data to simulate data-scarce conditions for training the models and using the complete Bird Creek dataset for model evaluation. The Lower Upper Bound Estimation (LUBE) method was used with FFNNs to estimate PIs for uncertainty quantification. Autoregressive SC prediction methods via FFNN were found to be reliable with Nash Sutcliffe Efficiency (NSE) values of 0.65 and 0.45 on in-sample and out-of-sample test data, respectively. The same modeling scenario resulted in an NSE of 0.54 for the Bird Creek data using a similar missing data ratio, whereas a higher ratio of observed data increased the accuracy (NSE = 0.84). The relatively narrow estimated PIs for the North Fork Red River in the URRB indicated satisfactory stream salinity predictions, showing an average width equivalent to 25 % of the observed range and a confidence level of 70 %.


Asunto(s)
Monitoreo del Ambiente , Ríos , Salinidad , Ríos/química , Incertidumbre , Oklahoma , Monitoreo del Ambiente/métodos , Texas , Redes Neurales de la Computación , Modelos Teóricos
13.
BMC Med Res Methodol ; 24(1): 193, 2024 Sep 04.
Artículo en Inglés | MEDLINE | ID: mdl-39232661

RESUMEN

BACKGROUND: Missing data are common in observational studies and often occur in several of the variables required when estimating a causal effect, i.e. the exposure, outcome and/or variables used to control for confounding. Analyses involving multiple incomplete variables are not as straightforward as analyses with a single incomplete variable. For example, in the context of multivariable missingness, the standard missing data assumptions ("missing completely at random", "missing at random" [MAR], "missing not at random") are difficult to interpret and assess. It is not clear how the complexities that arise due to multivariable missingness are being addressed in practice. The aim of this study was to review how missing data are managed and reported in observational studies that use multiple imputation (MI) for causal effect estimation, with a particular focus on missing data summaries, missing data assumptions, primary and sensitivity analyses, and MI implementation. METHODS: We searched five top general epidemiology journals for observational studies that aimed to answer a causal research question and used MI, published between January 2019 and December 2021. Article screening and data extraction were performed systematically. RESULTS: Of the 130 studies included in this review, 108 (83%) derived an analysis sample by excluding individuals with missing data in specific variables (e.g., outcome) and 114 (88%) had multivariable missingness within the analysis sample. Forty-four (34%) studies provided a statement about missing data assumptions, 35 of which stated the MAR assumption, but only 11/44 (25%) studies provided a justification for these assumptions. The number of imputations, MI method and MI software were generally well-reported (71%, 75% and 88% of studies, respectively), while aspects of the imputation model specification were not clear for more than half of the studies. A secondary analysis that used a different approach to handle the missing data was conducted in 69/130 (53%) studies. Of these 69 studies, 68 (99%) lacked a clear justification for the secondary analysis. CONCLUSION: Effort is needed to clarify the rationale for and improve the reporting of MI for estimation of causal effects from observational data. We encourage greater transparency in making and reporting analytical decisions related to missing data.


Asunto(s)
Estudios Observacionales como Asunto , Proyectos de Investigación , Causalidad , Interpretación Estadística de Datos , Proyectos de Investigación/normas
14.
Struct Equ Modeling ; 31(5): 891-908, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39308934

RESUMEN

Dynamic structural equation modeling (DSEM) is a useful technique for analyzing intensive longitudinal data. A challenge of applying DSEM is the missing data problem. The impact of missing data on DSEM, especially on widely applied DSEM such as the two-level vector autoregressive (VAR) cross-lagged models, however, is understudied. To fill the research gap, we evaluated how well the fixed effects and variance parameters in two-level bivariate VAR models are recovered under different missingness percentages, sample sizes, the number of time points, and heterogeneity in missingness distributions through two simulation studies. To facilitate the use of DSEM under customized data and model scenarios (different from those in our simulations), we provided illustrative examples of how to conduct Monte Carlo simulations in Mplus to determine whether a data configuration is sufficient to obtain accurate and precise results from a specific DSEM.

15.
J Am Stat Assoc ; 119(547): 2282-2293, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39328784

RESUMEN

In this paper, we investigate the Gaussian graphical model inference problem in a novel setting that we call erose measurements, referring to irregularly measured or observed data. For graphs, this results in different node pairs having vastly different sample sizes which frequently arises in data integration, genomics, neuroscience, and sensor networks. Existing works characterize the graph selection performance using the minimum pairwise sample size, which provides little insights for erosely measured data, and no existing inference method is applicable. We aim to fill in this gap by proposing the first inference method that characterizes the different uncertainty levels over the graph caused by the erose measurements, named GI-JOE (Graph Inference when Joint Observations are Erose). Specifically, we develop an edge-wise inference method and an affiliated FDR control procedure, where the variance of each edge depends on the sample sizes associated with corresponding neighbors. We prove statistical validity under erose measurements, thanks to careful localized edge-wise analysis and disentangling the dependencies across the graph. Finally, through simulation studies and a real neuroscience data example, we demonstrate the advantages of our inference methods for graph selection from erosely measured data.

16.
HGG Adv ; 5(4): 100338, 2024 Aug 02.
Artículo en Inglés | MEDLINE | ID: mdl-39095990

RESUMEN

Multivariable Mendelian randomization allows simultaneous estimation of direct causal effects of multiple exposure variables on an outcome. When the exposure variables of interest are quantitative omic features, obtaining complete data can be economically and technically challenging: the measurement cost is high, and the measurement devices may have inherent detection limits. In this paper, we propose a valid and efficient method to handle unmeasured and undetectable values of the exposure variables in a one-sample multivariable Mendelian randomization analysis with individual-level data. We estimate the direct causal effects with maximum likelihood estimation and develop an expectation-maximization algorithm to compute the estimators. We show the advantages of the proposed method through simulation studies and provide an application to the Hispanic Community Health Study/Study of Latinos, which has a large amount of unmeasured exposure data.

17.
Am J Epidemiol ; 2024 Aug 27.
Artículo en Inglés | MEDLINE | ID: mdl-39191658

RESUMEN

Auxiliary variables are used in multiple imputation (MI) to reduce bias and increase efficiency. These variables may often themselves be incomplete. We explored how missing data in auxiliary variables influenced estimates obtained from MI. We implemented a simulation study with three different missing data mechanisms for the outcome. We then examined the impact of increasing proportions of missing data and different missingness mechanisms for the auxiliary variable on bias of an unadjusted linear regression coefficient and the fraction of missing information. We illustrate our findings with an applied example in the Avon Longitudinal Study of Parents and Children. We found that where complete records analyses were biased, increasing proportions of missing data in auxiliary variables, under any missing data mechanism, reduced the ability of MI including the auxiliary variable to mitigate this bias. Where there was no bias in the complete records analysis, inclusion of a missing not at random auxiliary variable in MI introduced bias of potentially important magnitude (up to 17% of the effect size in our simulation). Careful consideration of the quantity and nature of missing data in auxiliary variables needs to be made when selecting them for use in MI models.

18.
Artículo en Inglés | MEDLINE | ID: mdl-39138951

RESUMEN

IMPORTANCE: Scales often arise from multi-item questionnaires, yet commonly face item non-response. Traditional solutions use weighted mean (WMean) from available responses, but potentially overlook missing data intricacies. Advanced methods like multiple imputation (MI) address broader missing data, but demand increased computational resources. Researchers frequently use survey data in the All of Us Research Program (All of Us), and it is imperative to determine if the increased computational burden of employing MI to handle non-response is justifiable. OBJECTIVES: Using the 5-item Physical Activity Neighborhood Environment Scale (PANES) in All of Us, this study assessed the tradeoff between efficacy and computational demands of WMean, MI, and inverse probability weighting (IPW) when dealing with item non-response. MATERIALS AND METHODS: Synthetic missingness, allowing 1 or more item non-response, was introduced into PANES across 3 missing mechanisms and various missing percentages (10%-50%). Each scenario compared WMean of complete questions, MI, and IPW on bias, variability, coverage probability, and computation time. RESULTS: All methods showed minimal biases (all <5.5%) for good internal consistency, with WMean suffered most with poor consistency. IPW showed considerable variability with increasing missing percentage. MI required significantly more computational resources, taking >8000 and >100 times longer than WMean and IPW in full data analysis, respectively. DISCUSSION AND CONCLUSION: The marginal performance advantages of MI for item non-response in highly reliable scales do not warrant its escalated cloud computational burden in All of Us, particularly when coupled with computationally demanding post-imputation analyses. Researchers using survey scales with low missingness could utilize WMean to reduce computing burden.

19.
Sci Rep ; 14(1): 19268, 2024 Aug 20.
Artículo en Inglés | MEDLINE | ID: mdl-39164405

RESUMEN

Due to various unavoidable reasons or gross error elimination, missing data inevitably exist in global navigation satellite system (GNSS) position time series, which may result in many analysis methods not being applicable. Typically, interpolating the missing data is a crucial preprocessing step before analyzing the time series. The conventional methods for filling missing data do not consider the influence of adjacent stations. In this work, an improved Gaussian process (GP) approach is developed to fill the missing data of GNSS time series, in which the time series of adjacent stations are applied to construct impact factors, together with a comparison of the conventional GP and the commonly used cubic spline methods. For the simulation experiments, the root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R) are adopted to evaluate the performance of the improved GP. The results show that the filled missing data of the improved GP are closer to the true values than those of the conventional GP and cubic spline methods, regardless of the missing percentages ranging from 5 to 30%, with an interval of 5%. Specifically, the mean relative RMSE and MAE improvements for the improved GP with respect to the conventional GP are 21.2%, 21.3% and 8.3% and 12.7%, 16.2% and 11.01% for the North (N), East (E) and Up (U) components, respectively. In the real experiment, eight GNSS stations are analyzed using improved GP, together with conventional GP and a cubic spline. The results indicate that the first three principal components (PCs) of the improved GP can perverse 98.3%, 99.8% and 77.0% of the total variance for the N, E and U components, respectively. This value is obviously higher than those of the conventional GP and cubic spline. Therefore, we can conclude that the improved GP can better fill in the missing data in GNSS position time series than the conventional GP and cubic spline because of the impacts of adjacent stations.

20.
Int J Epidemiol ; 53(5)2024 Aug 14.
Artículo en Inglés | MEDLINE | ID: mdl-39186942

RESUMEN

MOTIVATION: The Peter Clark (PC) algorithm is a popular causal discovery method to learn causal graphs in a data-driven way. Until recently, existing PC algorithm implementations in R had important limitations regarding missing values, temporal structure or mixed measurement scales (categorical/continuous), which are all common features of cohort data. The new R packages presented here, micd and tpc, fill these gaps. IMPLEMENTATION: micd and tpc packages are R packages. GENERAL FEATURES: The micd package provides add-on functionality for dealing with missing values to the existing pcalg R package, including methods for multiple imputations relying on the Missing At Random assumption. Also, micd allows for mixed measurement scales assuming conditional Gaussianity. The tpc package efficiently exploits temporal information in a way that results in a more informative output that is less prone to statistical errors. AVAILABILITY: The tpc and micd packages are freely available on the Comprehensive R Archive Network (CRAN). Their source code is also available on GitHub (https://github.com/bips-hb/micd; https://github.com/bips-hb/tpc).


Asunto(s)
Algoritmos , Causalidad , Programas Informáticos , Humanos , Estudios de Cohortes , Interpretación Estadística de Datos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA