Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 389
Filter
1.
PeerJ Comput Sci ; 10: e2119, 2024.
Article in English | MEDLINE | ID: mdl-38983189

ABSTRACT

Background: Missing data are common when analyzing real data. One popular solution is to impute missing data so that one complete dataset can be obtained for subsequent data analysis. In the present study, we focus on missing data imputation using classification and regression trees (CART). Methods: We consider a new perspective on missing data in a CART imputation problem and realize the perspective through some resampling algorithms. Several existing missing data imputation methods using CART are compared through simulation studies, and we aim to investigate the methods with better imputation accuracy under various conditions. Some systematic findings are demonstrated and presented. These imputation methods are further applied to two real datasets: Hepatitis data and Credit approval data for illustration. Results: The method that performs the best strongly depends on the correlation between variables. For imputing missing ordinal categorical variables, the rpart package with surrogate variables is recommended under correlations larger than 0 with missing completely at random (MCAR) and missing at random (MAR) conditions. Under missing not at random (MNAR), chi-squared test methods and the rpart package with surrogate variables are suggested. For imputing missing quantitative variables, the iterative imputation method is most recommended under moderate correlation conditions.

2.
Chemosphere ; 363: 142697, 2024 Jun 24.
Article in English | MEDLINE | ID: mdl-38925515

ABSTRACT

The identification of arsenic (As)-contaminated areas is an important prerequisite for soil management and reclamation. Although previous studies have attempted to identify soil As contamination via machine learning (ML) methods combined with soil spectroscopy, they have ignored the rarity of As-contaminated soil samples, leading to an imbalanced learning problem. A novel ML framework was thus designed herein to solve the imbalance issue in identifying soil As contamination from soil visible and near-infrared spectra. Spectral preprocessing, imbalanced dataset resampling, and model comparisons were combined in the ML framework, and the optimal combination was selected based on the recall. In addition, Bayesian optimization was used to tune the model hyperparameters. The optimized model achieved recall, area under the curve, and balanced accuracy values of 0.83, 0.88, and 0.79, respectively, on the testing set. The recall was further improved to 0.87 with the threshold adjustment, indicating the model's excellent performance and generalization capability in classifying As-contaminated soil samples. The optimal model was applied to a global soil spectral dataset to predict areas at a high risk of soil As contamination on a global scale. The ML framework established in this study represents a milestone in the classification of soil As contamination and can serve as a valuable reference for contamination management in soil science.

3.
Sensors (Basel) ; 24(12)2024 Jun 18.
Article in English | MEDLINE | ID: mdl-38931742

ABSTRACT

Corn (Zea mays L.) is the most abundant food/feed crop, making accurate yield estimation a critical data point for monitoring global food production. Sensors with varying spatial/spectral configurations have been used to develop corn yield models from intra-field (0.1 m ground sample distance (GSD)) to regional scales (>250 m GSD). Understanding the spatial and spectral dependencies of these models is imperative to result interpretation, scaling, and deploying models. We leveraged high spatial resolution hyperspectral data collected with an unmanned aerial system mounted sensor (272 spectral bands from 0.4-1 µm at 0.063 m GSD) to estimate silage yield. We subjected our imagery to three band selection algorithms to quantitatively assess spectral reflectance features applicability to yield estimation. We then derived 11 spectral configurations, which were spatially resampled to multiple GSDs, and applied to a support vector regression (SVR) yield estimation model. Results indicate that accuracy degrades above 4 m GSD across all configurations, and a seven-band multispectral sensor which samples the red edge and multiple near-infrared bands resulted in higher accuracy in 90% of regression trials. These results bode well for our quest toward a definitive sensor definition for global corn yield modeling, with only temporal dependencies requiring additional investigation.

4.
Environ Sci Pollut Res Int ; 31(29): 42088-42110, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38862797

ABSTRACT

The temporal aspect of groundwater vulnerability to contaminants such as nitrate is often overlooked, assuming vulnerability has a static nature. This study bridges this gap by employing machine learning with Detecting Breakpoints and Estimating Segments in Trend (DBEST) algorithm to reveal the underlying relationship between nitrate, water table, vegetation cover, and precipitation time series, that are related to agricultural activities and groundwater demand in a semi-arid region. The contamination probability of Lenjanat Plain has been mapped by comparing random forest (RF), support vector machine (SVM), and K-nearest-neighbors (KNN) models, fed with 32 input variables (dem-derived factors, physiography, distance and density maps, time series data). Also, imbalanced learning and feature selection techniques were investigated as supplementary methods, adding up to four scenarios. Results showed that the RF model, integrated with forward sequential feature selection (SFS) and SMOTE-Tomek resampling method, outperformed the other models (F1-score: 0.94, MCC: 0.83). The SFS techniques outperformed other feature selection methods in enhancing the accuracy of the models with the cost of computational expenses, and the cost-sensitive function proved more efficient in tackling imbalanced data issues than the other investigated methods. The DBEST method identified significant breakpoints within each time series dataset, revealing a clear association between agricultural practices along the Zayandehrood River and substantial nitrate contamination within the Lenjanat region. Additionally, the groundwater vulnerability maps created using the candid RF model and an ensemble of the best RF, SVM, and KNN models predicted mid to high levels of vulnerability in the central parts and the downhills in the southwest.


Subject(s)
Environmental Monitoring , Groundwater , Machine Learning , Nitrates , Nitrates/analysis , Groundwater/chemistry , Iran , Environmental Monitoring/methods , Water Pollutants, Chemical/analysis , Support Vector Machine
5.
Sci Rep ; 14(1): 13097, 2024 Jun 07.
Article in English | MEDLINE | ID: mdl-38849493

ABSTRACT

Customer churn remains a critical concern for businesses, highlighting the significance of retaining existing customers over acquiring new ones. Effective prediction of potential churners aids in devising robust retention policies and efficient customer management strategies. This study dives into the realm of machine learning algorithms for predictive analysis in churn prediction, addressing the inherent challenge posed by diverse and imbalanced customer churn data distributions. This paper introduces a novel approach-the Ratio-based data balancing technique, which addresses data skewness as a pre-processing step, ensuring improved accuracy in predictive modelling. This study fills gaps in existing literature by highlighting the effectiveness of ensemble algorithms and the critical role of data balancing techniques in optimizing churn prediction models. While our research contributes a novel approach, there remain avenues for further exploration. This work evaluates several machine learning algorithms-Perceptron, Multi-Layer Perceptron, Naive Bayes, Logistic Regression, K-Nearest Neighbour, Decision Tree, alongside Ensemble techniques such as Gradient Boosting and Extreme Gradient Boosting (XGBoost)-on balanced datasets achieved through our proposed Ratio-based data balancing technique and the commonly used Data Resampling. Results reveal that our proposed Ratio-based data balancing technique notably outperforms traditional Over-Sampling and Under-Sampling methods in churn prediction accuracy. Additionally, using combined algorithms like Gradient Boosting and XGBoost showed better results than using single methods. Our study looked at different aspects like Accuracy, Precision, Recall, and F-Score, finding that these combined methods are better for predicting customer churn. Specifically, when we used a 75:25 ratio with the XGBoost method, we got the most promising results for our analysis which are presented in this work.

6.
Cortex ; 177: 130-149, 2024 May 28.
Article in English | MEDLINE | ID: mdl-38852224

ABSTRACT

Although event-related potential (ERP) research on language processing has capitalized on key, theoretically influential components such as the N400 and P600, their measurement properties-especially the variability in their temporal and spatial parameters-have rarely been examined. The current study examined the measurement properties of the N400 and P600 effects elicited by semantic and syntactic anomalies, respectively, during sentence processing. We used a bootstrap resampling procedure to randomly draw many thousands of resamples varying in sample size and stimulus count from a larger sample of 187 participants and 40 stimulus sentences of each type per condition. Our resampling investigation focused on three issues: (a) statistical power; (b) variability in the magnitudes of the effects; and (c) variability in the temporal and spatial profiles of the effects. At the level of grand averages, the N400 and P600 effects were both robust and substantial. However, across resamples, there was a high degree of variability in effect magnitudes, onset times, and scalp distributions, which may be greater than is currently appreciated in the literature, especially for the P600 effects. These results provide a useful basis for designing future studies using these two well-established ERP components. At the same time, the results also highlight challenges that need to be addressed in future research (e.g., how best to analyze the ERP data without engaging in such questionable research practices as p-hacking).

7.
Stat Med ; 43(14): 2783-2810, 2024 Jun 30.
Article in English | MEDLINE | ID: mdl-38705726

ABSTRACT

Propensity score matching is commonly used to draw causal inference from observational survival data. However, its asymptotic properties have yet to be established, and variance estimation is still open to debate. We derive the statistical properties of the propensity score matching estimator of the marginal causal hazard ratio based on matching with replacement and a fixed number of matches. We also propose a double-resampling technique for variance estimation that takes into account the uncertainty due to propensity score estimation prior to matching.


Subject(s)
Propensity Score , Proportional Hazards Models , Humans , Survival Analysis , Causality , Computer Simulation , Observational Studies as Topic/statistics & numerical data , Models, Statistical
8.
Diagnostics (Basel) ; 14(10)2024 May 08.
Article in English | MEDLINE | ID: mdl-38786282

ABSTRACT

Breast cancer is the most prevalent type of cancer in women. Risk factor assessment can aid in directing counseling regarding risk reduction and breast cancer surveillance. This research aims to (1) investigate the relationship between various risk factors and breast cancer incidence using the BCSC (Breast Cancer Surveillance Consortium) Risk Factor Dataset and create a prediction model for assessing the risk of developing breast cancer; (2) diagnose breast cancer using the Breast Cancer Wisconsin diagnostic dataset; and (3) analyze breast cancer survivability using the SEER (Surveillance, Epidemiology, and End Results) Breast Cancer Dataset. Applying resampling techniques on the training dataset before using various machine learning techniques can affect the performance of the classifiers. The three breast cancer datasets were examined using a variety of pre-processing approaches and classification models to assess their performance in terms of accuracy, precision, F-1 scores, etc. The PCA (principal component analysis) and resampling strategies produced remarkable results. For the BCSC Dataset, the Random Forest algorithm exhibited the best performance out of the applied classifiers, with an accuracy of 87.53%. Out of the different resampling techniques applied to the training dataset for training the Random Forest classifier, the Tomek Link exhibited the best test accuracy, at 87.47%. We compared all the models used with previously used techniques. After applying the resampling techniques, the accuracy scores of the test data decreased even if the training data accuracy increased. For the Breast Cancer Wisconsin diagnostic dataset, the K-Nearest Neighbor algorithm had the best accuracy with the original dataset test set, at 94.71%, and the PCA dataset test set exhibited 95.29% accuracy for detecting breast cancer. Using the SEER Dataset, this study also explores survival analysis, employing supervised and unsupervised learning approaches to offer insights into the variables affecting breast cancer survivability. This study emphasizes the significance of individualized approaches in the management and treatment of breast cancer by incorporating phenotypic variations and recognizing the heterogeneity of the disease. Through data-driven insights and advanced machine learning, this study contributes significantly to the ongoing efforts in breast cancer research, diagnostics, and personalized medicine.

9.
Appl Radiat Isot ; 210: 111341, 2024 Aug.
Article in English | MEDLINE | ID: mdl-38744039

ABSTRACT

We developed a novel quadratic resampling method for summing up γ-ray spectra with different calibration parameters. We investigated a long-term environmental background γ-ray spectrum by summing up 114 spectra measured using a 30% HPGe detector between 2017 and 2021. Gain variations in different measurement periods shift γ-ray peak positions by a fractional pulse-height bin size up to around 2 keV. The resampling method was applied to measure low-level background γ-ray peaks in the γ-ray spectrum in a wide energy range from 50 keV to 3 MeV. We additionally document temporal variations in the activities of major γ-ray peaks, such as 40K (1461 keV), 208Tl (2615 keV), and other typical nuclides, along with contributions from cosmic rays. The normal distribution of γ-ray background count rates, as evidenced by quantile-quantile plots, indicates consistent data collection throughout the measurement period. Consequently, we assert that the quadratic resampling method for accumulating γ-ray spectra surpasses the linear method (Bossew, 2005) in various aspects.

10.
Ecology ; 105(5): e4302, 2024 May.
Article in English | MEDLINE | ID: mdl-38594213

ABSTRACT

Identifying the mechanisms underlying the changes in the distribution of species is critical to accurately predict how species have responded and will respond to climate change. Here, we take advantage of a late-1950s study on ant assemblages in a canyon near Boulder, Colorado, USA, to understand how and why species distributions have changed over a 60-year period. Community composition changed over 60 years with increasing compositional similarity among ant assemblages. Community composition differed significantly between the periods, with aspect and tree cover influencing composition. Species that foraged in broader temperature ranges became more widespread over the 60-year period. Our work highlights that shifts in community composition and biotic homogenization can occur even in undisturbed areas without strong habitat degradation. We also show the power of pairing historical and contemporary data and encourage more mechanistic studies to predict species changes under climate change.


Subject(s)
Ants , Ecosystem , Temperature , Ants/physiology , Animals , Colorado , Climate Change , Time Factors
11.
Entropy (Basel) ; 26(3)2024 Mar 02.
Article in English | MEDLINE | ID: mdl-38539740

ABSTRACT

The knowledge of the causal mechanisms underlying one single system may not be sufficient to answer certain questions. One can gain additional insights from comparing and contrasting the causal mechanisms underlying multiple systems and uncovering consistent and distinct causal relationships. For example, discovering common molecular mechanisms among different diseases can lead to drug repurposing. The problem of comparing causal mechanisms among multiple systems is non-trivial, since the causal mechanisms are usually unknown and need to be estimated from data. If we estimate the causal mechanisms from data generated from different systems and directly compare them (the naive method), the result can be sub-optimal. This is especially true if the data generated by the different systems differ substantially with respect to their sample sizes. In this case, the quality of the estimated causal mechanisms for the different systems will differ, which can in turn affect the accuracy of the estimated similarities and differences among the systems via the naive method. To mitigate this problem, we introduced the bootstrap estimation and the equal sample size resampling estimation method for estimating the difference between causal networks. Both of these methods use resampling to assess the confidence of the estimation. We compared these methods with the naive method in a set of systematically simulated experimental conditions with a variety of network structures and sample sizes, and using different performance metrics. We also evaluated these methods on various real-world biomedical datasets covering a wide range of data designs.

12.
Proc Biol Sci ; 291(2018): 20240079, 2024 Mar 13.
Article in English | MEDLINE | ID: mdl-38471547

ABSTRACT

The fast rate of replacement of natural areas by expanding cities is a key threat to wildlife worldwide. Many wild species occur in cities, yet little is known on the dynamics of urban wildlife assemblages due to species' extinction and colonization that may occur in response to the rapidly evolving conditions within urban areas. Namely, species' ability to spread within urban areas, besides habitat preferences, is likely to shape the fate of species once they occur in a city. Here we use a long-term dataset on mammals occurring in one of the largest and most ancient cities in Europe to assess whether and how spatial spread and association with specific habitats drive the probability of local extinction within cities. Our analysis included mammalian records dating between years 1832 and 2023, and revealed that local extinctions in urban areas are biased towards species associated with wetlands and that were naturally rare within the city. Besides highlighting the role of wetlands within urban areas for conserving wildlife, our work also highlights the importance of long-term biodiversity monitoring in highly dynamic habitats such as cities, as a key asset to better understand wildlife trends and thus foster more sustainable and biodiversity-friendly cities.


Subject(s)
Ecosystem , Wetlands , Animals , Cities , Mammals , Biodiversity , Animals, Wild
13.
Stat Med ; 43(9): 1804-1825, 2024 Apr 30.
Article in English | MEDLINE | ID: mdl-38356231

ABSTRACT

Statistical data simulation is essential in the development of statistical models and methods as well as in their performance evaluation. To capture complex data structures, in particular for high-dimensional data, a variety of simulation approaches have been introduced including parametric and the so-called plasmode simulations. While there are concerns about the realism of parametrically simulated data, it is widely claimed that plasmodes come very close to reality with some aspects of the "truth" known. However, there are no explicit guidelines or state-of-the-art on how to perform plasmode data simulations. In the present paper, we first review existing literature and introduce the concept of statistical plasmode simulation. We then discuss advantages and challenges of statistical plasmodes and provide a step-wise procedure for their generation, including key steps to their implementation and reporting. Finally, we illustrate the concept of statistical plasmodes as well as the proposed plasmode generation procedure by means of a public real RNA data set on breast carcinoma patients.


Subject(s)
Models, Statistical , Humans , Computer Simulation
14.
Stat Med ; 43(10): 1849-1866, 2024 May 10.
Article in English | MEDLINE | ID: mdl-38402907

ABSTRACT

Several methods in survival analysis are based on the proportional hazards assumption. However, this assumption is very restrictive and often not justifiable in practice. Therefore, effect estimands that do not rely on the proportional hazards assumption are highly desirable in practical applications. One popular example for this is the restricted mean survival time (RMST). It is defined as the area under the survival curve up to a prespecified time point and, thus, summarizes the survival curve into a meaningful estimand. For two-sample comparisons based on the RMST, previous research found the inflation of the type I error of the asymptotic test for small samples and, therefore, a two-sample permutation test has already been developed. The first goal of the present paper is to further extend the permutation test for general factorial designs and general contrast hypotheses by considering a Wald-type test statistic and its asymptotic behavior. Additionally, a groupwise bootstrap approach is considered. Moreover, when a global test detects a significant difference by comparing the RMSTs of more than two groups, it is of interest which specific RMST differences cause the result. However, global tests do not provide this information. Therefore, multiple tests for the RMST are developed in a second step to infer several null hypotheses simultaneously. Hereby, the asymptotically exact dependence structure between the local test statistics is incorporated to gain more power. Finally, the small sample performance of the proposed global and multiple testing procedures is analyzed in simulations and illustrated in a real data example.


Subject(s)
Research Design , Humans , Survival Rate , Survival Analysis , Proportional Hazards Models
15.
Open Forum Infect Dis ; 11(2): ofad659, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38328495

ABSTRACT

Background: The conventional diagnostic for Schistosoma mansoni infection is stool microscopy with the Kato-Katz (KK) technique to detect eggs. Its outcomes are highly variable on a day-to-day basis and may lead to biased estimates of community infection used to inform public health programs. Our goal is to develop a resampling method that leverages data from a large-scale randomized trial to accurately predict community infection. Methods: We developed a resampling method that provides unbiased community estimates of prevalence, intensity and other statistics for S mansoni infection when a community survey is conducted using KK stool microscopy with a single sample per host. It leverages a large-scale data set, collected in the Schistosomiasis Consortium for Operational Research and Evaluation (SCORE) project, and allows linking single-stool specimen community screening to its putative multiday "true statistics." Results: SCORE data analysis reveals the limited sensitivity of KK stool microscopy and systematic bias of single-day community testing versus multiday testing; for prevalence estimate, it can fall up to 50% below the true value. The proposed SCORE cluster method reduces systematic bias and brings the estimated prevalence values within 5%-10% of the true value. This holds for a broad swath of transmission settings, including SCORE communities, and other data sets. Conclusions: Our SCORE cluster method can markedly improve the S mansoni prevalence estimate in settings using stool microscopy.

16.
BMC Health Serv Res ; 24(1): 37, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-38183029

ABSTRACT

BACKGROUND: No-show to medical appointments has significant adverse effects on healthcare systems and their clients. Using machine learning to predict no-shows allows managers to implement strategies such as overbooking and reminders targeting patients most likely to miss appointments, optimizing the use of resources. METHODS: In this study, we proposed a detailed analytical framework for predicting no-shows while addressing imbalanced datasets. The framework includes a novel use of z-fold cross-validation performed twice during the modeling process to improve model robustness and generalization. We also introduce Symbolic Regression (SR) as a classification algorithm and Instance Hardness Threshold (IHT) as a resampling technique and compared their performance with that of other classification algorithms, such as K-Nearest Neighbors (KNN) and Support Vector Machine (SVM), and resampling techniques, such as Random under Sampling (RUS), Synthetic Minority Oversampling Technique (SMOTE) and NearMiss-1. We validated the framework using two attendance datasets from Brazilian hospitals with no-show rates of 6.65% and 19.03%. RESULTS: From the academic perspective, our study is the first to propose using SR and IHT to predict the no-show of patients. Our findings indicate that SR and IHT presented superior performances compared to other techniques, particularly IHT, which excelled when combined with all classification algorithms and led to low variability in performance metrics results. Our results also outperformed sensitivity outcomes reported in the literature, with values above 0.94 for both datasets. CONCLUSION: This is the first study to use SR and IHT methods to predict patient no-shows and the first to propose performing z-fold cross-validation twice. Our study highlights the importance of avoiding relying on few validation runs for imbalanced datasets as it may lead to biased results and inadequate analysis of the generalization and stability of the models obtained during the training stage.


Subject(s)
Algorithms , Benchmarking , Humans , Brazil , Machine Learning , Decision Support Techniques
17.
Regen Biomater ; 11: rbad082, 2024.
Article in English | MEDLINE | ID: mdl-38213739

ABSTRACT

Biomaterials with surface nanostructures effectively enhance protein secretion and stimulate tissue regeneration. When nanoparticles (NPs) enter the living system, they quickly interact with proteins in the body fluid, forming the protein corona (PC). The accurate prediction of the PC composition is critical for analyzing the osteoinductivity of biomaterials and guiding the reverse design of NPs. However, achieving accurate predictions remains a significant challenge. Although several machine learning (ML) models like Random Forest (RF) have been used for PC prediction, they often fail to consider the extreme values in the abundance region of PC absorption and struggle to improve accuracy due to the imbalanced data distribution. In this study, resampling embedding was introduced to resolve the issue of imbalanced distribution in PC data. Various ML models were evaluated, and RF model was finally used for prediction, and good correlation coefficient (R2) and root-mean-square deviation (RMSE) values were obtained. Our ablation experiments demonstrated that the proposed method achieved an R2 of 0.68, indicating an improvement of approximately 10%, and an RMSE of 0.90, representing a reduction of approximately 10%. Furthermore, through the verification of label-free quantification of four NPs: hydroxyapatite (HA), titanium dioxide (TiO2), silicon dioxide (SiO2) and silver (Ag), and we achieved a prediction performance with an R2 value >0.70 using Random Oversampling. Additionally, the feature analysis revealed that the composition of the PC is most significantly influenced by the incubation plasma concentration, PDI and surface modification.

18.
Behav Res Methods ; 56(2): 750-764, 2024 Feb.
Article in English | MEDLINE | ID: mdl-36814007

ABSTRACT

Mediation analysis in repeated measures studies can shed light on the mechanisms through which experimental manipulations change the outcome variable. However, the literature on interval estimation for the indirect effect in the 1-1-1 single mediator model is sparse. Most simulation studies to date evaluating mediation analysis in multilevel data considered scenarios that do not match the expected numbers of level 1 and level 2 units typically encountered in experimental studies, and no study to date has compared resampling and Bayesian methods for constructing intervals for the indirect effect in this context. We conducted a simulation study to compare statistical properties of interval estimates of the indirect effect obtained using four bootstrap and two Bayesian methods in the 1-1-1 mediation model with and without random effects. Bayesian credibility intervals had coverage closest to the nominal value and no instances of excessive Type I error rates, but lower power than resampling methods. Findings indicated that the pattern of performance for resampling methods often depended on the presence of random effects. We provide suggestions for selecting an interval estimator for the indirect effect depending on the most important statistical property for a given study, as well as code in R for implementing all methods evaluated in the simulation study. Findings and code from this project will hopefully support the use of mediation analysis in experimental research with repeated measures.


Subject(s)
Mediation Analysis , Models, Statistical , Humans , Bayes Theorem , Computer Simulation , Multilevel Analysis
19.
Toxics ; 11(12)2023 Nov 23.
Article in English | MEDLINE | ID: mdl-38133356

ABSTRACT

Many countries have attempted to mitigate and manage issues related to harmful algal blooms (HABs) by monitoring and predicting their occurrence. The infrequency and duration of HABs occurrence pose the challenge of data imbalance when constructing machine learning models for their prediction. Furthermore, the appropriate selection of input variables is a significant issue because of the complexities between the input and output variables. Therefore, the objective of this study was to improve the predictive performance of HABs using feature selection and data resampling. Data resampling was used to address the imbalance in the minority class data. Two machine learning models were constructed to predict algal alert levels using 10 years of meteorological, hydrodynamic, and water quality data. The improvement in model accuracy due to changes in resampling methods was more noticeable than the improvement in model accuracy due to changes in feature selection methods. Models constructed using combinations of original and synthetic data across all resampling methods demonstrated higher prediction performance for the caution level (L-1) and warning level (L-2) than models constructed using the original data. In particular, the optimal artificial neural network and random forest models constructed using combinations of original and synthetic data showed significantly improved prediction accuracy for L-1 and L-2, representing the transition from normal to bloom formation states in the training and testing steps. The test results of the optimal RF model using the original data indicated prediction accuracies of 98.8% for L0, 50.0% for L1, and 50.0% for L2. In contrast, the optimal random forest model using the Synthetic Minority Oversampling Technique-Edited Nearest Neighbor (ENN) sampling method achieved accuracies of 85.0% for L0, 85.7% for L1, and 100% for L2. Therefore, applying synthetic data can address the imbalance in the observed data and improve the detection performance of machine learning models. Reliable predictions using improved models can support the design of management practices to mitigate HABs in reservoirs and ultimately ensure safe and clean water resources.

20.
Mathematics (Basel) ; 11(3)2023 Feb.
Article in English | MEDLINE | ID: mdl-37990696

ABSTRACT

High-dimensional data applications often entail the use of various statistical and machine-learning algorithms to identify an optimal signature based on biomarkers and other patient characteristics that predicts the desired clinical outcome in biomedical research. Both the composition and predictive performance of such biomarker signatures are critical in various biomedical research applications. In the presence of a large number of features, however, a conventional regression analysis approach fails to yield a good prediction model. A widely used remedy is to introduce regularization in fitting the relevant regression model. In particular, a L1 penalty on the regression coefficients is extremely useful, and very efficient numerical algorithms have been developed for fitting such models with different types of responses. This L1-based regularization tends to generate a parsimonious prediction model with promising prediction performance, i.e., feature selection is achieved along with construction of the prediction model. The variable selection, and hence the composition of the signature, as well as the prediction performance of the model depend on the choice of the penalty parameter used in the L1 regularization. The penalty parameter is often chosen by K-fold cross-validation. However, such an algorithm tends to be unstable and may yield very different choices of the penalty parameter across multiple runs on the same dataset. In addition, the predictive performance estimates from the internal cross-validation procedure in this algorithm tend to be inflated. In this paper, we propose a Monte Carlo approach to improve the robustness of regularization parameter selection, along with an additional cross-validation wrapper for objectively evaluating the predictive performance of the final model. We demonstrate the improvements via simulations and illustrate the application via a real dataset.

SELECTION OF CITATIONS
SEARCH DETAIL
...