Search | VHL Regional Portal

1.

Efficient use of binned data for imputing univariate time series data.

Darji, Jay; Biswas, Nupur; Padul, Vijay; Gill, Jaya; Kesari, Santosh; Ashili, Shashaanka.

Front Big Data ; 7: 1422650, 2024.

Article in English | MEDLINE | ID: mdl-39234189

ABSTRACT

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

2.

RNAseqCovarImpute: a multiple imputation procedure that outperforms complete case and single imputation differential expression analysis.

Baker, Brennan H; Sathyanarayana, Sheela; Szpiro, Adam A; MacDonald, James W; Paquette, Alison G.

Genome Biol ; 25(1): 236, 2024 Sep 03.

Article in English | MEDLINE | ID: mdl-39227979

ABSTRACT

Missing covariate data is a common problem that has not been addressed in observational studies of gene expression. Here, we present a multiple imputation method that accommodates high dimensional gene expression data by incorporating principal component analysis of the transcriptome into the multiple imputation prediction models to avoid bias. Simulation studies using three datasets show that this method outperforms complete case and single imputation analyses at uncovering true positive differentially expressed genes, limiting false discovery rates, and minimizing bias. This method is easily implemented via an R Bioconductor package, RNAseqCovarImpute that integrates with the limma-voom pipeline for differential expression analysis.

Subject(s)

Software , Humans , Gene Expression Profiling/methods , Transcriptome , Principal Component Analysis , Sequence Analysis, RNA/methods

3.

Gaps in the usage and reporting of multiple imputation for incomplete data: findings from a scoping review of observational studies addressing causal questions.

Mainzer, Rheanna M; Moreno-Betancur, Margarita; Nguyen, Cattram D; Simpson, Julie A; Carlin, John B; Lee, Katherine J.

BMC Med Res Methodol ; 24(1): 193, 2024 Sep 04.

Article in English | MEDLINE | ID: mdl-39232661

ABSTRACT

BACKGROUND: Missing data are common in observational studies and often occur in several of the variables required when estimating a causal effect, i.e. the exposure, outcome and/or variables used to control for confounding. Analyses involving multiple incomplete variables are not as straightforward as analyses with a single incomplete variable. For example, in the context of multivariable missingness, the standard missing data assumptions ("missing completely at random", "missing at random" [MAR], "missing not at random") are difficult to interpret and assess. It is not clear how the complexities that arise due to multivariable missingness are being addressed in practice. The aim of this study was to review how missing data are managed and reported in observational studies that use multiple imputation (MI) for causal effect estimation, with a particular focus on missing data summaries, missing data assumptions, primary and sensitivity analyses, and MI implementation. METHODS: We searched five top general epidemiology journals for observational studies that aimed to answer a causal research question and used MI, published between January 2019 and December 2021. Article screening and data extraction were performed systematically. RESULTS: Of the 130 studies included in this review, 108 (83%) derived an analysis sample by excluding individuals with missing data in specific variables (e.g., outcome) and 114 (88%) had multivariable missingness within the analysis sample. Forty-four (34%) studies provided a statement about missing data assumptions, 35 of which stated the MAR assumption, but only 11/44 (25%) studies provided a justification for these assumptions. The number of imputations, MI method and MI software were generally well-reported (71%, 75% and 88% of studies, respectively), while aspects of the imputation model specification were not clear for more than half of the studies. A secondary analysis that used a different approach to handle the missing data was conducted in 69/130 (53%) studies. Of these 69 studies, 68 (99%) lacked a clear justification for the secondary analysis. CONCLUSION: Effort is needed to clarify the rationale for and improve the reporting of MI for estimation of causal effects from observational data. We encourage greater transparency in making and reporting analytical decisions related to missing data.

Subject(s)

Observational Studies as Topic , Humans , Observational Studies as Topic/methods , Observational Studies as Topic/statistics & numerical data , Data Interpretation, Statistical , Causality , Research Design/standards , Research Design/statistics & numerical data

4.

Stream salinity prediction in data-scarce regions: Application of transfer learning and uncertainty quantification.

Khodkar, Kasra; Mirchi, Ali; Nourani, Vahid; Kaghazchi, Afsaneh; Sadler, Jeffrey M; Mansaray, Abubakarr; Wagner, Kevin; Alderman, Phillip D; Taghvaeian, Saleh; Bailey, Ryan T.

J Contam Hydrol ; 266: 104418, 2024 Aug 26.

Article in English | MEDLINE | ID: mdl-39217676

ABSTRACT

Scarcity of stream salinity data poses a challenge to understanding salinity dynamics and its implications for water supply management in water-scarce salt-prone regions around the world. This paper introduces a framework for generating continuous daily stream salinity estimates using instance-based transfer learning (TL) and assessing the reliability of the synthetic salinity data through uncertainty quantification via prediction intervals (PIs). The framework was developed using two temporally distinct specific conductance (SC) datasets from the Upper Red River Basin (URRB) located in southwestern Oklahoma and Texas Panhandle, United States. The instance-based TL approach was implemented by calibrating Feedforward Neural Networks (FFNNs) on a source SC dataset of around 1200 instantaneous grab samples collected by United States Geological Survey (USGS) from 1959 to 1993. The trained FFNNs were subsequently tested on a target dataset (1998-present) of 220 instantaneous grab samples collected by the Oklahoma Water Resources Board (OWRB). The framework's generalizability was assessed in the data-rich Bird Creek watershed in Oklahoma by manipulating continuous SC data to simulate data-scarce conditions for training the models and using the complete Bird Creek dataset for model evaluation. The Lower Upper Bound Estimation (LUBE) method was used with FFNNs to estimate PIs for uncertainty quantification. Autoregressive SC prediction methods via FFNN were found to be reliable with Nash Sutcliffe Efficiency (NSE) values of 0.65 and 0.45 on in-sample and out-of-sample test data, respectively. The same modeling scenario resulted in an NSE of 0.54 for the Bird Creek data using a similar missing data ratio, whereas a higher ratio of observed data increased the accuracy (NSE = 0.84). The relatively narrow estimated PIs for the North Fork Red River in the URRB indicated satisfactory stream salinity predictions, showing an average width equivalent to 25 % of the observed range and a confidence level of 70 %.

5.

Novel logarithmic imputation procedures using multi auxiliary information under ranked set sampling.

Kumar, Anoop; Bhushan, Shashi; Emam, Walid; Tashkandy, Yusra; Khan, M J S.

Sci Rep ; 14(1): 18027, 2024 Aug 04.

Article in English | MEDLINE | ID: mdl-39098844

ABSTRACT

Ranked set sampling (RSS) is known to increase the efficiency of the estimators while comparing it with simple random sampling. The problem of missingness creates a gap in the information that needs to be addressed before proceeding for estimation. Negligible amount of work has been carried out to deal with missingness utilizing RSS. This paper proposes some logarithmic type methods of imputation for the estimation of population mean under RSS using auxiliary information. The properties of the suggested imputation procedures are examined. A simulation study is accomplished to show that the proposed imputation procedures exhibit better results in comparison to some of the existing imputation procedures. Few real applications of the proposed imputation procedures is also provided to generalize the simulation study.

6.

An improved Gaussian process for filling the missing data in GNSS position time series considering the influence of adjacent stations.

Qiu, Xiaomeng; Wang, Fengwei; Zhang, Qiuxi; Tao, Guoqiang; Zhou, Shijian.

Sci Rep ; 14(1): 19268, 2024 Aug 20.

Article in English | MEDLINE | ID: mdl-39164405

ABSTRACT

Due to various unavoidable reasons or gross error elimination, missing data inevitably exist in global navigation satellite system (GNSS) position time series, which may result in many analysis methods not being applicable. Typically, interpolating the missing data is a crucial preprocessing step before analyzing the time series. The conventional methods for filling missing data do not consider the influence of adjacent stations. In this work, an improved Gaussian process (GP) approach is developed to fill the missing data of GNSS time series, in which the time series of adjacent stations are applied to construct impact factors, together with a comparison of the conventional GP and the commonly used cubic spline methods. For the simulation experiments, the root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R) are adopted to evaluate the performance of the improved GP. The results show that the filled missing data of the improved GP are closer to the true values than those of the conventional GP and cubic spline methods, regardless of the missing percentages ranging from 5 to 30%, with an interval of 5%. Specifically, the mean relative RMSE and MAE improvements for the improved GP with respect to the conventional GP are 21.2%, 21.3% and 8.3% and 12.7%, 16.2% and 11.01% for the North (N), East (E) and Up (U) components, respectively. In the real experiment, eight GNSS stations are analyzed using improved GP, together with conventional GP and a cubic spline. The results indicate that the first three principal components (PCs) of the improved GP can perverse 98.3%, 99.8% and 77.0% of the total variance for the N, E and U components, respectively. This value is obviously higher than those of the conventional GP and cubic spline. Therefore, we can conclude that the improved GP can better fill in the missing data in GNSS position time series than the conventional GP and cubic spline because of the impacts of adjacent stations.

7.

Balancing efficacy and computational burden: weighted mean, multiple imputation, and inverse probability weighting methods for item non-response in reliable scales.

Guide, Andrew; Garbett, Shawn; Feng, Xiaoke; Mapes, Brandy M; Cook, Justin; Sulieman, Lina; Cronin, Robert M; Chen, Qingxia.

J Am Med Inform Assoc ; 2024 Aug 13.

Article in English | MEDLINE | ID: mdl-39138951

ABSTRACT

IMPORTANCE: Scales often arise from multi-item questionnaires, yet commonly face item non-response. Traditional solutions use weighted mean (WMean) from available responses, but potentially overlook missing data intricacies. Advanced methods like multiple imputation (MI) address broader missing data, but demand increased computational resources. Researchers frequently use survey data in the All of Us Research Program (All of Us), and it is imperative to determine if the increased computational burden of employing MI to handle non-response is justifiable. OBJECTIVES: Using the 5-item Physical Activity Neighborhood Environment Scale (PANES) in All of Us, this study assessed the tradeoff between efficacy and computational demands of WMean, MI, and inverse probability weighting (IPW) when dealing with item non-response. MATERIALS AND METHODS: Synthetic missingness, allowing 1 or more item non-response, was introduced into PANES across 3 missing mechanisms and various missing percentages (10%-50%). Each scenario compared WMean of complete questions, MI, and IPW on bias, variability, coverage probability, and computation time. RESULTS: All methods showed minimal biases (all <5.5%) for good internal consistency, with WMean suffered most with poor consistency. IPW showed considerable variability with increasing missing percentage. MI required significantly more computational resources, taking >8000 and >100 times longer than WMean and IPW in full data analysis, respectively. DISCUSSION AND CONCLUSION: The marginal performance advantages of MI for item non-response in highly reliable scales do not warrant its escalated cloud computational burden in All of Us, particularly when coupled with computationally demanding post-imputation analyses. Researchers using survey scales with low missingness could utilize WMean to reduce computing burden.

8.

Multivariable Mendelian randomization with incomplete measurements on the exposure variables in the Hispanic Community Health Study/Study of Latinos.

Li, Yilun; Wong, Kin Yau; Howard, Annie Green; Gordon-Larsen, Penny; Highland, Heather M; Graff, Mariaelisa; North, Kari E; Downie, Carolina G; Avery, Christy L; Yu, Bing; Young, Kristin L; Buchanan, Victoria L; Kaplan, Robert; Hou, Lifang; Joyce, Brian Thomas; Qi, Qibin; Sofer, Tamar; Moon, Jee-Young; Lin, Dan-Yu.

HGG Adv ; 5(4): 100338, 2024 Aug 02.

Article in English | MEDLINE | ID: mdl-39095990

ABSTRACT

Multivariable Mendelian randomization allows simultaneous estimation of direct causal effects of multiple exposure variables on an outcome. When the exposure variables of interest are quantitative omic features, obtaining complete data can be economically and technically challenging: the measurement cost is high, and the measurement devices may have inherent detection limits. In this paper, we propose a valid and efficient method to handle unmeasured and undetectable values of the exposure variables in a one-sample multivariable Mendelian randomization analysis with individual-level data. We estimate the direct causal effects with maximum likelihood estimation and develop an expectation-maximization algorithm to compute the estimators. We show the advantages of the proposed method through simulation studies and provide an application to the Hispanic Community Health Study/Study of Latinos, which has a large amount of unmeasured exposure data.

9.

Deeply-Learned Generalized Linear Models with Missing Data.

Lim, David K; Rashid, Naim U; Oliva, Junier B; Ibrahim, Joseph G.

J Comput Graph Stat ; 33(2): 638-650, 2024.

Article in English | MEDLINE | ID: mdl-39184956

ABSTRACT

Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to various supervised learning problems. However, the greater prevalence and complexity of missing data in such datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, dlglm, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of the Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data. Supplementary materials for this article are available online.

10.

Analyses Using Multiple Imputation Need to Consider Missing Data in Auxiliary Variables.

Madley-Dowd, Paul; Curnow, Elinor; Hughes, Rachael A; Cornish, Rosie; Tilling, Kate; Heron, Jon.

Am J Epidemiol ; 2024 Aug 27.

Article in English | MEDLINE | ID: mdl-39191658

ABSTRACT

Auxiliary variables are used in multiple imputation (MI) to reduce bias and increase efficiency. These variables may often themselves be incomplete. We explored how missing data in auxiliary variables influenced estimates obtained from MI. We implemented a simulation study with three different missing data mechanisms for the outcome. We then examined the impact of increasing proportions of missing data and different missingness mechanisms for the auxiliary variable on bias of an unadjusted linear regression coefficient and the fraction of missing information. We illustrate our findings with an applied example in the Avon Longitudinal Study of Parents and Children. We found that where complete records analyses were biased, increasing proportions of missing data in auxiliary variables, under any missing data mechanism, reduced the ability of MI including the auxiliary variable to mitigate this bias. Where there was no bias in the complete records analysis, inclusion of a missing not at random auxiliary variable in MI introduced bias of potentially important magnitude (up to 17% of the effect size in our simulation). Careful consideration of the quantity and nature of missing data in auxiliary variables needs to be made when selecting them for use in MI models.

11.

Software application profile: tpc and micd-R packages for causal discovery with incomplete cohort data.

Andrews, Ryan M; Bang, Christine W; Didelez, Vanessa; Witte, Janine; Foraita, Ronja.

Int J Epidemiol ; 53(5)2024 Aug 14.

Article in English | MEDLINE | ID: mdl-39186942

ABSTRACT

MOTIVATION: The Peter Clark (PC) algorithm is a popular causal discovery method to learn causal graphs in a data-driven way. Until recently, existing PC algorithm implementations in R had important limitations regarding missing values, temporal structure or mixed measurement scales (categorical/continuous), which are all common features of cohort data. The new R packages presented here, micd and tpc, fill these gaps. IMPLEMENTATION: micd and tpc packages are R packages. GENERAL FEATURES: The micd package provides add-on functionality for dealing with missing values to the existing pcalg R package, including methods for multiple imputations relying on the Missing At Random assumption. Also, micd allows for mixed measurement scales assuming conditional Gaussianity. The tpc package efficiently exploits temporal information in a way that results in a more informative output that is less prone to statistical errors. AVAILABILITY: The tpc and micd packages are freely available on the Comprehensive R Archive Network (CRAN). Their source code is also available on GitHub (https://github.com/bips-hb/micd; https://github.com/bips-hb/tpc).

Subject(s)

Algorithms , Causality , Software , Humans , Cohort Studies , Data Interpretation, Statistical

12.

Machine Learning-Based Risk Prediction of Discharge Status for Sepsis.

Cai, Kaida; Lou, Yuqing; Wang, Zhengyan; Yang, Xiaofang; Zhao, Xin.

Entropy (Basel) ; 26(8)2024 Jul 25.

Article in English | MEDLINE | ID: mdl-39202095

ABSTRACT

As a severe inflammatory response syndrome, sepsis presents complex challenges in predicting patient outcomes due to its unclear pathogenesis and the unstable discharge status of affected individuals. In this study, we develop a machine learning-based method for predicting the discharge status of sepsis patients, aiming to improve treatment decisions. To enhance the robustness of our analysis against outliers, we incorporate robust statistical methods, specifically the minimum covariance determinant technique. We utilize the random forest imputation method to effectively manage and impute missing data. For feature selection, we employ Lasso penalized logistic regression, which efficiently identifies significant predictors and reduces model complexity, setting the stage for the application of more complex predictive methods. Our predictive analysis incorporates multiple machine learning methods, including random forest, support vector machine, and XGBoost. We compare the prediction performance of these methods with Lasso penalized logistic regression to identify the most effective approach. Each method's performance is rigorously evaluated through ten iterations of 10-fold cross-validation to ensure robust and reliable results. Our comparative analysis reveals that XGBoost surpasses the other models, demonstrating its exceptional capability to navigate the complexities of sepsis data effectively.

13.

Factors Associated with Missing Sociodemographic Data in the IRIS® (Intelligent Research in Sight) Registry.

Ross, Connor; Ivanov, Alexander; Elze, Tobias; Miller, Joan W; Lum, Flora; Lorch, Alice C; Oke, Isdin.

Ophthalmol Sci ; 4(6): 100542, 2024.

Article in English | MEDLINE | ID: mdl-39139543

ABSTRACT

Purpose: To describe the prevalence of missing sociodemographic data in the IRIS® (Intelligent Research in Sight) Registry and to identify practice-level characteristics associated with missing sociodemographic data. Design: Cross-sectional study. Participants: All patients with clinical encounters at practices participating in the IRIS Registry prior to December 31, 2020. Methods: We describe geographic and temporal trends in the prevalence of missing data for each sociodemographic variable (age, sex, race, ethnicity, geographic location, insurance type, and smoking status). Each practice contributing data to the registry was categorized based on the number of patients, number of physicians, geographic location, patient visit frequency, and patient population demographics. Main Outcome Measures: Multivariable linear regression was used to describe the association of practice-level characteristics with missing patient-level sociodemographic data. Results: This study included the electronic health records of 66 477 365 patients receiving care at 3306 practices participating in the IRIS Registry. The median number of patients per practice was 11 415 (interquartile range: 5849-24 148) and the median number of physicians per practice was 3 (interquartile range: 1-7). The prevalence of missing patient sociodemographic data were 0.1% for birth year, 0.4% for sex, 24.8% for race, 30.2% for ethnicity, 2.3% for 3-digit zip code, 14.8% for state, 5.5% for smoking status, and 17.0% for insurance type. The prevalence of missing data increased over time and varied at the state-level. Missing race data were associated with practices that had fewer visits per patient (P < 0.001), cared for a larger nonprivately insured patient population (P = 0.001), and were located in urban areas (P < 0.001). Frequent patient visits were associated with a lower prevalence of missing race (P < 0.001), ethnicity (P < 0.001), and insurance (P < 0.001), but a higher prevalence of missing smoking status (P < 0.001). Conclusions: There are geographic and temporal trends in missing race, ethnicity, and insurance type data in the IRIS Registry. Several practice-level characteristics, including practice size, geographic location, and patient population, are associated with missing sociodemographic data. While the prevalence and patterns of missing data may change in future versions of the IRIS registry, there will remain a need to develop standardized approaches for minimizing potential sources of bias and ensure reproducibility across research studies. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

14.

Phylogenomic analyses of Blattodea combining traditional methods, incremental tree-building, and quality-aware support.

Evangelista, Dominic A; Nelson, Dvorah; Kotyková Varadínová, Zuzana; Kotyk, Michael; Rousseaux, Nicolas; Shanahan, Tristan; Grandcolas, Phillippe; Legendre, Frédéric.

Mol Phylogenet Evol ; 200: 108177, 2024 Aug 13.

Article in English | MEDLINE | ID: mdl-39142526

ABSTRACT

Despite the many advances of the genomic era, there is a persistent problem in assessing the uncertainty of phylogenomic hypotheses. We see this in the recent history of phylogenetics for cockroaches and termites (Blattodea), where huge advances have been made, but there are still major inconsistencies between studies. To address this, we present a phylogenetic analysis of Blattodea that emphasizes identification and quantification of uncertainty. We analyze 1183 gene domains using three methods (multi-species coalescent inference, concatenation, and a supermatrix-supertree hybrid approach) and assess support for controversial relationships while considering data quality. The hybrid approach-here dubbed "tiered phylogenetic inference"-incorporates information about data quality into an incremental tree building framework. Leveraging this method, we are able to identify cases of low or misleading support that would not be possible otherwise, and explore them more thoroughly with follow-up tests. In particular, quality annotations pointed towards nodes with high bootstrap support that later turned out to have large ambiguities, sometimes resulting from low-quality data. We also clarify issues related to some recalcitrant nodes: Anaplectidae's placement lacks unbiased signal, Ectobiidae s.s. and Anaplectoideini need greater taxon sampling, the deepest relationships among most Blaberidae lack signal. As a result, several previous phylogenetic uncertainties are now closer to being resolved (e.g., African and Malagasy "Rhabdoblatta" spp. are the sister to all other Blaberidae, and Oxyhaloinae is sister to the remaining Blaberidae). Overall, we argue for more approaches to quantifying support that take data quality into account to uncover the nature of recalcitrant nodes.

15.

Bayesian Simultaneous Factorization and Prediction Using Multi-Omic Data.

Samorodnitsky, Sarah; Wendt, Chris H; Lock, Eric F.

Comput Stat Data Anal ; 1972024 Sep.

Article in English | MEDLINE | ID: mdl-38947282

ABSTRACT

Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including "blockwise" missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.

16.

Missing data imputation using classification and regression trees.

Chen, Cheng-Yang; Chang, Yu-Wei.

PeerJ Comput Sci ; 10: e2119, 2024.

Article in English | MEDLINE | ID: mdl-38983189

ABSTRACT

Background: Missing data are common when analyzing real data. One popular solution is to impute missing data so that one complete dataset can be obtained for subsequent data analysis. In the present study, we focus on missing data imputation using classification and regression trees (CART). Methods: We consider a new perspective on missing data in a CART imputation problem and realize the perspective through some resampling algorithms. Several existing missing data imputation methods using CART are compared through simulation studies, and we aim to investigate the methods with better imputation accuracy under various conditions. Some systematic findings are demonstrated and presented. These imputation methods are further applied to two real datasets: Hepatitis data and Credit approval data for illustration. Results: The method that performs the best strongly depends on the correlation between variables. For imputing missing ordinal categorical variables, the rpart package with surrogate variables is recommended under correlations larger than 0 with missing completely at random (MCAR) and missing at random (MAR) conditions. Under missing not at random (MNAR), chi-squared test methods and the rpart package with surrogate variables are suggested. For imputing missing quantitative variables, the iterative imputation method is most recommended under moderate correlation conditions.

17.

Multiple Imputation with Factor Scores: A Practical Approach for Handling Simultaneous Missingness Across Items in Longitudinal Designs.

Li, Yanling; Oravecz, Zita; Ji, Linying; Chow, Sy-Miin.

Multivariate Behav Res ; : 1-29, 2024 Jul 12.

Article in English | MEDLINE | ID: mdl-38997153

ABSTRACT

Missingness in intensive longitudinal data triggered by latent factors constitute one type of nonignorable missingness that can generate simultaneous missingness across multiple items on each measurement occasion. To address this issue, we propose a multiple imputation (MI) strategy called MI-FS, which incorporates factor scores, lag/lead variables, and missing data indicators into the imputation model. In the context of process factor analysis (PFA), we conducted a Monte Carlo simulation study to compare the performance of MI-FS to listwise deletion (LD), MI with manifest variables (MI-MV, which implements MI on both dependent variables and covariates), and partial MI with MVs (PMI-MV, which implements MI on covariates and handles missing dependent variables via full-information maximum likelihood) under different conditions. Across conditions, we found MI-based methods overall outperformed the LD; the MI-FS approach yielded lower root mean square errors (RMSEs) and higher coverage rates for auto-regression (AR) parameters compared to MI-MV; and the PMI-MV and MI-MV approaches yielded higher coverage rates for most parameters except AR parameters compared to MI-FS. These approaches were also compared using an empirical example investigating the relationships between negative affect and perceived stress over time. Recommendations on when and how to incorporate factor scores into MI processes were discussed.

18.

The FACT-GP5 as a global tolerability measure: responsiveness and robustness to missing assessments.

Arizmendi, Cara; Zhu, Yanyan; Khan, Maryam; Gable, Jonathon; Reeve, Bryce B; King-Kallimanis, Bellinda; Bell, Jill.

Qual Life Res ; 2024 Jul 24.

Article in English | MEDLINE | ID: mdl-39046616

ABSTRACT

PURPOSE: The Functional Assessment of Cancer Therapy item (FACT-GP5) has the potential to provide an understanding of global treatment tolerability from the patient perspective. Longitudinal evaluations of the FACT-GP5 and challenges posed by data missing-not-at-random (MNAR) have not been explored. Robustness of the FACT-GP5 to missing data assumptions and the responsiveness of the FACT-GP5 to key side-effects are evaluated. METHODS: In a randomized, double-blind study (NCT00065325), postmenopausal women (n = 618) with hormone receptor-positive (HR+), advanced breast cancer received either fulvestrant or exemestane and completed FACT measures monthly for seven months. Cumulative link mixed models (CLMM) were fit to evaluate: (1) the trajectory of the FACT-GP5 and (2) the responsiveness of the FACT-GP5 to CTCAE grade, Eastern Cooperative Oncology Group (ECOG) Performance Status scale, and key side-effects from the FACT. Sensitivity analyses of the missing-at-random (MAR) assumption were conducted. RESULTS: Odds of reporting worse side-effect bother increased over time. There were positive within-person relationships between level of side-effect bother (FACT-GP5) and severity of other FACT items, as well as ECOG performance status and Common Terminology Criteria for Adverse Events (CTCAE) grade. The number of missing FACT-GP5 assessments impacted the trajectory of the FACT-GP5 but did not impact the relationships between the FACT-GP5 and other items (except for nausea [FACT-GP2]). CONCLUSIONS: Results support the responsiveness of the FACT-GP5. Generally speaking, the responsiveness of the FACT-GP5 is robust to missing assessments. Missingness should be considered, however, when evaluating change over time of the FACT-GP5. TRIAL REGISTRATION: NCT00065325. TRIAL REGISTRATION YEAR: 2003.

Researchers have been exploring the use of a single question, FACT-GP5 ("I am bothered by side effects of treatment"), as a quick way to learn about drug tolerability from the patients' perspective. This study explores if this single question can capture changes in tolerability during treatment, and if the assessment is missed by patients, whether that impacts the interpretation of tolerability. In our study, we found that the FACT-GP5 can be used to understand how tolerability changes during treatment. Missing assessments of the FACT-GP5 are important to account for when interpreting results. The FACT-GP5 may be a useful question for capturing the patient experience of drug tolerability.

19.

A novel deep machine learning algorithm with dimensionality and size reduction approaches for feature elimination: thyroid cancer diagnoses with randomly missing data.

Tutsoy, Onder; Sumbul, Hilmi Erdem.

Brief Bioinform ; 25(4)2024 May 23.

Article in English | MEDLINE | ID: mdl-39007597

ABSTRACT

Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.

Subject(s)

Algorithms , Deep Learning , Thyroid Neoplasms , Thyroid Neoplasms/diagnosis , Humans , Machine Learning , Cluster Analysis

20.

Novel approach exploring the correlation between presepsin and routine laboratory parameters using explainable artificial intelligence.

Jeong, Jae-Seung; Kang, Tak Ho; Ju, Hyunsu; Cho, Chi-Hyun.

Heliyon ; 10(13): e33826, 2024 Jul 15.

Article in English | MEDLINE | ID: mdl-39027625

ABSTRACT

Although presepsin, a crucial biomarker for the diagnosis and management of sepsis, has gained prominence in contemporary medical research, its relationship with routine laboratory parameters, including demographic data and hospital blood test data, remains underexplored. This study integrates machine learning with explainable artificial intelligence (XAI) to provide insights into the relationship between presepsin and these parameters. Advanced machine learning classifiers provide a multilateral view of data and play an important role in highlighting the interrelationships between presepsin and other parameters. XAI enhances analysis by ensuring transparency in the model's decisions, especially in selecting key parameters that significantly enhance classification accuracy. Utilizing XAI, this study successfully identified critical parameters that increased the predictive accuracy for sepsis patients, achieving a remarkable ROC AUC of 0.97 and an accuracy of 0.94. This breakthrough is possibly attributed to the comprehensive utilization of XAI in refining parameter selection, thus leading to these significant predictive metrics. The presence of missing data in datasets is another concern; this study addresses it by employing Extreme Gradient Boosting (XGBoost) to manage missing data, effectively mitigating potential biases while preserving both the accuracy and relevance of the results. The perspective of examining data from higher dimensions using machine learning transcends traditional observation and analysis. The findings of this study hold the potential to enhance patient diagnoses and treatment, underscoring the value of merging traditional research methods with advanced analytical tools.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL