RESUMO
BACKGROUND: ML predictive models have shown their capability to improve risk prediction and assist medical decision-making, nevertheless, there is a lack of accuracy systems to early identify future rapid CKD progressors in Colombia and even in South America. OBJECTIVE: The purpose of this study was to develop a series of interpretable machine learning models that predict GFR at 6-months, 9-months, and 12-months. STUDY DESIGN AND SETTING: Over 29,000 CKD patients stage 1 to 3b (estimated GFR, <60 mL/min/1.73 m2) with an average of 3-year follow-up data were included. We used the machine learning extreme gradient boosting (XGBoost) to build three models to predict the next eGFR. Models were internally and externally validated. In addition, we included SHapley Additive exPlanation (SHAP) values to offer interpretable global and local prediction models. RESULTS: All models showed a good performance in development and external validation. However, the 6-months XGBoost prediction model showed the best performance in internal (MAE average = 6.07; RSME = 78.87), and in external validation (MAE average = 6.45, RSME = 18.94). The top 3 most influential features that pushed the predicted eGFR value to lower values were the interpolated values for eGFR and creatinine, and eGFR at baseline. CONCLUSION: In the current study we have developed and validated machine learning models to predict the next eGFR value at different intervals. Furthermore, we attempted to approach the need for prediction explanation by offering transparent predictions.
RESUMO
Leptospirosis is a zoonosis with global public health impact, particularly in poor socio-economic settings in tropical regions. Transmitted through urine-contaminated water or soil from rodents, dogs, and livestock, leptospirosis causes over a million clinical cases annually. Risk factors include outdoor activities, livestock production, and substandard housing that foster high densities of animal reservoirs. This One Health study in southern Chile examined Leptospira serological evidence of exposure in people from urban slums, semi-rural settings, and farm settings, using the Extreme Gradient Boosting algorithm to identify key influencing factors. In urban slums, age, shrub terrain, distance to Leptospira-positive households, and neighborhood housing density were contributing factors. Human exposure in semi-rural communities was linked to environmental factors (trees, shrubs, and lower vegetation terrain) and animal variables (Leptospira-positive dogs and rodents and proximity to Leptospira-positive households). On farms, dog counts, animal Leptospira prevalence, and proximity to Leptospira-contaminated water samples were significant drivers. The study underscores that disease dynamics vary across landscapes, with distinct drivers in each community setting. This case study demonstrates how the integration of machine learning with comprehensive cross-sectional epidemiological and geospatial data provides valuable insights into leptospirosis eco-epidemiology. These insights are crucial for informing targeted public health strategies and generating hypotheses for future research.
RESUMO
PURPOSE: There are currently more than 480 primary immune deficiency (PID) diseases and about 7000 rare diseases that together afflict around 1 in every 17 humans. Computational aids based on data mining and machine learning might facilitate the diagnostic task by extracting rules from large datasets and making predictions when faced with new problem cases. In a proof-of-concept data mining study, we aimed to predict PID diagnoses using a supervised machine learning algorithm based on classification tree boosting. METHODS: Through a data query at the USIDNET registry we obtained a database of 2396 patients with common diagnoses of PID, including their clinical and laboratory features. We kept 286 features and all 12 diagnoses to include in the model. We used the XGBoost package with parallel tree boosting for the supervised classification model, and SHAP for variable importance interpretation, on Python v3.7. The patient database was split into training and testing subsets, and after boosting through gradient descent, the predictive model provides measures of diagnostic prediction accuracy and individual feature importance. After a baseline performance test, we used the Class Weighting Hyperparameter, or scale_pos_weight to correct for imbalanced classification. RESULTS: The twelve PID diagnoses were CVID (1098 patients), DiGeorge syndrome, Chronic granulomatous disease, Congenital agammaglobulinemia, PID not otherwise classified, Specific antibody deficiency, Complement deficiency, Hyper-IgM, Leukocyte adhesion deficiency, ectodermal dysplasia with immune deficiency, Severe combined immune deficiency, and Wiskott-Aldrich syndrome. For CVID, the model found an accuracy on the train sample of 0.80, with an area under the ROC curve (AUC) of 0.80, and a Gini coefficient of 0.60. In the test subset, accuracy was 0.76, AUC 0.75, and Gini 0.51. The positive feature value to predict CVID was highest for upper respiratory infections, asthma, autoimmunity and hypogammaglobulinemia. Features with the highest negative predictive value were high IgE, growth delay, abscess, lymphopenia, and congenital heart disease. For the rest of the diagnoses, accuracy stayed between 0.75 and 0.99, AUC 0.46-0.87, Gini 0.07-0.75, and LogLoss 0.09-8.55. DISCUSSION: Clinicians should remember to consider the negative predictive features together with the positives. We are calling this a proof-of-concept study to continue with our explorations. A good performance is encouraging, and feature importance might aid feature selection for future endeavors. In the meantime, we can learn from the rules derived by the model and build a user-friendly decision tree to generate differential diagnoses.
Assuntos
Doenças da Imunodeficiência Primária , Síndrome de Wiskott-Aldrich , Humanos , Diagnóstico Diferencial , Aprendizado de Máquina , Mineração de DadosRESUMO
Identifying risk factors associated with COVID-19 lethality is crucial in combating the ongoing pandemic. In this study, we developed lethality predictive models for each epidemiological wave and for the overall dataset using the Extreme Gradient Boosting technique and analyzed them using Shapley values to determine the contribution levels of various features, including demographics, comorbidities, medical units, and recent medical information from confirmed COVID-19 cases in Mexico between February 23, 2020, and April 15, 2022. The results showed that pneumonia and advanced age were the most important factors predicting patient death in all cohorts. Additionally, the medical unit where the patient received care acted as a risk or protective factor. IMSS medical units were identified as high-risk factors in all cohorts, except in wave four, while SSA medical units generally were moderate protective factors. We also found that intubation was a high-risk factor in the first epidemiological wave and a moderate-risk factor in the following waves. Female gender was a protective factor of moderate-high importance in all cohorts, while being between 18 and 29 years old was a moderate protective factor and being between 50 and 59 years old was a moderate risk factor. Additionally, diabetes (all cohorts), obesity (third wave), and hypertension (fourth wave) were identified as moderate risk factors. Finally, residing in municipalities with the lowest Human Development Index level represented a moderate risk factor. In conclusion, this study identified several significant risk factors associated with COVID-19 lethality in Mexico, which could aid policymakers in developing targeted interventions to reduce mortality rates.
Assuntos
COVID-19 , Humanos , Feminino , Adolescente , Adulto Jovem , Adulto , Pessoa de Meia-Idade , COVID-19/epidemiologia , México/epidemiologia , Fatores de Risco , Obesidade , Aprendizado de MáquinaRESUMO
The large amount of data generated during the COVID-19 pandemic requires advanced tools for the long-term prediction of risk factors associated with COVID-19 mortality with higher accuracy. Machine learning (ML) methods directly address this topic and are essential tools to guide public health interventions. Here, we used ML to investigate the importance of demographic and clinical variables on COVID-19 mortality. We also analyzed how comorbidity networks are structured according to age groups. We conducted a retrospective study of COVID-19 mortality with hospitalized patients from Londrina, Parana, Brazil, registered in the database for severe acute respiratory infections (SIVEP-Gripe), from January 2021 to February 2022. We tested four ML models to predict the COVID-19 outcome: Logistic Regression, Support Vector Machine, Random Forest, and XGBoost. We also constructed a comorbidity network to investigate the impact of co-occurring comorbidities on COVID-19 mortality. Our study comprised 8358 hospitalized patients, of whom 2792 (33.40%) died. The XGBoost model achieved excellent performance (ROC-AUC = 0.90). Both permutation method and SHAP values highlighted the importance of age, ventilatory support status, and intensive care unit admission as key features in predicting COVID-19 outcomes. The comorbidity networks for old deceased patients are denser than those for young patients. In addition, the co-occurrence of heart disease and diabetes may be the most important combination to predict COVID-19 mortality, regardless of age and sex. This work presents a valuable combination of machine learning and comorbidity network analysis to predict COVID-19 outcomes. Reliable evidence on this topic is crucial for guiding the post-pandemic response and assisting in COVID-19 care planning and provision.
RESUMO
The surging demand for commodity crops has led to rapid and severe agricultural frontier expansion globally and has put producing regions increasingly under pressure. However, knowledge about spatial patterns of agricultural frontier dynamics, their leading spatial determinants, and socio-ecological trade-offs is often lacking, hindering contextualized decision making towards more sustainable food systems. Here, we used inventory data to map frontier dynamics of avocado production, a cash crop of increasing importance in global diets, for Michoacán, Mexico, before and after the implementation of the North American Free Trade Agreement (NAFTA). We compiled a set of environmental, accessibility and social variables and identified the leading determinants of avocado frontier expansion and their interactions using extreme gradient boosting. We predicted potential expansion patterns and assessed their impacts on areas important for biodiversity conservation. Avocado frontiers expanded more than tenfold from 12,909 ha (1974) to 152,493 ha (2011), particularly after NAFTA. Annual precipitation, distance to settlements, and land tenure were key factors explaining avocado expansion. Under favorable climatic and accessibility conditions, most avocado expansion occurred on private lands. Contrary, under suboptimal conditions, most avocado expansion occurred on communal lands. Large areas suitable for further avocado expansion overlapped with priority sites for restoration, highlighting an imminent conflict between conservation and economic revenues. This is the first analysis of avocado frontier dynamics and their spatial determinants across a major production region and our results provide entry points to implement government-based strategies to support small-scale farmers, mostly those on communal lands, while trying to minimize the socio-environmental impacts of avocado production. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s10113-022-01883-6.
RESUMO
Abstract Covid-19 is today's pandemic disease and can cause the hospital crowded. Additionally, It affects the lungs and may cause pneumonia. The most popular technique for diagnosis of pneumonia is the evaluation of X-ray. However, a sufficient number of radiologists are needed to interpret the X-ray images. High rates of child deaths due to pneumonia have been encountered. Using this type of system, a diagnosis can be made quickly, and then the treatment process can be started rapidly. This study aims to diagnose pneumonia using boosting techniques by the automatic tool. With this tool, the workload of the doctors/radiologists is reduced. The boosting techniques are a family of machine learning techniques. Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost) are used for the study. These techniques are chosen because of their simulation duration for modeling and convenience for real-time applications. L2 normalization and feature selection are applied to the data before applying the techniques. Random Forest Classifier is used for feature selection estimator. After the modeling, Categorical Boosting algorithm is observed as faster than the other techniques. Simulation duration is obtained as 0.7 seconds. By using this automatic tool, the user can be able to upload the desired X-ray image to the system and get the result easily from the screen without any radiologist/doctor.
RESUMO
The COVID-19 pandemic, which originated in December 2019 in the city of Wuhan, China, continues to have a devastating effect on the health and well-being of the global population. Currently, approximately 8.8 million people have already been infected and more than 465,740 people have died worldwide. An important step in combating COVID-19 is the screening of infected patients using chest X-ray (CXR) images. However, this task is extremely time-consuming and prone to variability among specialists owing to its heterogeneity. Therefore, the present study aims to assist specialists in identifying COVID-19 patients from their chest radiographs, using automated computational techniques. The proposed method has four main steps: (1) the acquisition of the dataset, from two public databases; (2) the standardization of images through preprocessing; (3) the extraction of features using a deep features-based approach implemented through the networks VGG19, Inception-v3, and ResNet50; (4) the classifying of images into COVID-19 groups, using eXtreme Gradient Boosting (XGBoost) optimized by particle swarm optimization (PSO). In the best-case scenario, the proposed method achieved an accuracy of 98.71%, a precision of 98.89%, a recall of 99.63%, and an F1-score of 99.25%. In our study, we demonstrated that the problem of classifying CXR images of patients under COVID-19 and non-COVID-19 conditions can be solved efficiently by combining a deep features-based approach with a robust classifier (XGBoost) optimized by an evolutionary algorithm (PSO). The proposed method offers considerable advantages for clinicians seeking to tackle the current COVID-19 pandemic.