ABSTRACT
Type 2 diabetes mellitus (T2DM) is one of the most common metabolic diseases in the world and poses a significant public health challenge. Early detection and management of this metabolic disorder is crucial to prevent complications and improve outcomes. This paper aims to find core differences in male and female markers to detect T2DM by their clinic and anthropometric features, seeking out ranges in potential biomarkers identified to provide useful information as a pre-diagnostic tool whie excluding glucose-related biomarkers using machine learning (ML) models. We used a dataset containing clinical and anthropometric variables from patients diagnosed with T2DM and patients without TD2M as control. We applied feature selection with three different techniques to identify relevant biomarker models: an improved recursive feature elimination (RFE) evaluating each set from all the features to one feature with the Akaike information criterion (AIC) to find optimal outputs; Least Absolute Shrinkage and Selection Operator (LASSO) with glmnet; and Genetic Algorithms (GA) with GALGO and forward selection (FS) applied to GALGO output. We then used these for comparison with the AIC to measure the performance of each technique and collect the optimal set of global features. Then, an implementation and comparison of five different ML models was carried out to identify the most accurate and interpretable one, considering the following models: logistic regression (LR), artificial neural network (ANN), support vector machine (SVM), k-nearest neighbors (KNN), and nearest centroid (Nearcent). The models were then combined in an ensemble to provide a more robust approximation. The results showed that potential biomarkers such as systolic blood pressure (SBP) and triglycerides are together significantly associated with T2DM. This approach also identified triglycerides, cholesterol, and diastolic blood pressure as biomarkers with differences between male and female actors that have not been previously reported in the literature. The most accurate ML model was selection with RFE and random forest (RF) as the estimator improved with the AIC, which achieved an accuracy of 0.8820. In conclusion, this study demonstrates the potential of ML models in identifying potential biomarkers for early detection of T2DM, excluding glucose-related biomarkers as well as differences between male and female anthropometric and clinic profiles. These findings may help to improve early detection and management of the T2DM by accounting for differences between male and female subjects in terms of anthropometric and clinic profiles, potentially reducing healthcare costs and improving personalized patient attention. Further research is needed to validate these potential biomarkers ranges in other populations and clinical settings.
ABSTRACT
The escalating prevalence of Type 2 Diabetes (T2D) represents a substantial burden on global healthcare systems, especially in regions such as Mexico. Existing diagnostic techniques, although effective, often require invasive procedures and labor-intensive efforts. The promise of artificial intelligence and data science for streamlining and enhancing T2D diagnosis is well-recognized; however, these advancements are frequently constrained by the limited availability of comprehensive patient datasets. To mitigate this challenge, the present study investigated the efficacy of Generative Adversarial Networks (GANs) for augmenting existing T2D patient data, with a focus on a Mexican cohort. The researchers utilized a dataset of 1019 Mexican nationals, divided into 499 non-diabetic controls and 520 diabetic cases. GANs were applied to create synthetic patient profiles, which were subsequently used to train a Random Forest (RF) classification model. The study's findings revealed a notable improvement in the model's diagnostic accuracy, validating the utility of GAN-based data augmentation in a clinical context. The results bear significant implications for enhancing the robustness and reliability of Machine Learning tools in T2D diagnosis and management, offering a pathway toward more timely and effective patient care.
ABSTRACT
In recent years, the application of artificial intelligence (AI) in the automotive industry has led to the development of intelligent systems focused on road safety, aiming to improve protection for drivers and pedestrians worldwide to reduce the number of accidents yearly. One of the most critical functions of these systems is pedestrian detection, as it is crucial for the safety of everyone involved in road traffic. However, pedestrian detection goes beyond the front of the vehicle; it is also essential to consider the vehicle's rear since pedestrian collisions occur when the car is in reverse drive. To contribute to the solution of this problem, this research proposes a model based on convolutional neural networks (CNN) using a proposed one-dimensional architecture and the Inception V3 architecture to fuse the information from the backup camera and the distance measured by the ultrasonic sensors, to detect pedestrians when the vehicle is reversing. In addition, specific data collection was performed to build a database for the research. The proposed model showed outstanding results with 99.85% accuracy and 99.86% correct classification performance, demonstrating that it is possible to achieve the goal of pedestrian detection using CNN by fusing two types of data.
ABSTRACT
Driver identification refers to the process whose primary purpose is identifying the person behind the steering wheel using collected information about the driver him/herself. The constant monitoring of drivers through sensors generates great benefits in advanced driver assistance systems (ADAS), to learn more about the behavior of road users. Currently, there are many research works that address the subject in search of creating intelligent models that help to identify vehicle users in an efficient and objective way. However, the different methodologies proposed to create these models are based on data generated from sensors that include different vehicle brands on routes established in real environments, which, although they provide very important information for different purposes, in the case of driver identification, there may be a certain degree of bias due to the different situations in which the route environment may change. The proposed method seeks to intelligently and objectively select the most outstanding statistical features from motor activity generated in the main elements of the vehicle with genetic algorithms for driver identification, this process being newer than those established by the state-of-the-art. The results obtained from the proposal were an accuracy of 90.74% to identify two drivers and 62% for four, using a Random Forest Classifier (RFC). With this, it can be concluded that a comprehensive selection of features can greatly optimize the identification of drivers.
Subject(s)
Automobile Driving , Humans , Male , Accidents, Traffic , Random Forest , Learning , Motor ActivityABSTRACT
Abstract It is estimated that depression affects more than 300 million people in worldwide. Unfortunately, the current method of psychiatric evaluation requires a great effort on the part of clinicians to collect complete information. The aim of this paper is determine the optimal time intervals to detect depression using genetic algorithms and machine learning techniques; from motor activity readings of 55 participants during a week at one-minute intervals. The time intervals with the best performance in detecting depression in individuals were selected by applying Genetic Algorithms (GA). Methodology. 385 observations of the study participants were evaluated, obtaining an accuracy of 83.0 % with Logistic Regression (LR). Conclusion. There is a relationship between motor activity and people with depression since it is possible to detect it using machine learning techniques. However, the changes in the variables of the time intervals could be established as key factors since, at different times, they could give good or bad results because the motor activity in the patients could vary. However, the results present a first approximation for developing tools that help the opportune and objective diagnosis of depression.
Resumen Se estima que la depresión afecta a más de 300 millones de personas en el mundo. Desafortunadamente, el método de evaluación psiquiátrica actual requiere un gran esfuerzo por parte de los médicos para recopilar información completa. Objetivo. Determinar los intervalos de tiempo óptimos para detectar depresión mediante algoritmos genéticos y técnicas de aprendizaje automático, a partir de las lecturas de actividad motora de 55 sujetos durante una semana en intervalos de un minuto. Los intervalos de tiempo con mejor desempeño en la detección de depresión en individuos fueron seleccionados aplicando algoritmos genéticos. Metodología. Se evaluaron 385 observaciones de los sujetos de estudio, obteniendo una precisión del 83.0 % con Regresión Logística (LR). Conclusión. Existe una relación entre la actividad motora y las personas con depresión ya que es posible detectarla utilizando técnicas de aprendizaje automático. Sin embargo, los cambios en las variables de los intervalos de tiempo podrían establecerse como factores clave ya que en diferentes momentos podrían dar buenos o malos resultados debido a que la actividad motora en los pacientes podría llegar a variar. No obstante, los resultados presentan una primera aproximación para el desarrollo de herramientas que ayuden al diagnóstico oportuno y objetivo de la depresión.
ABSTRACT
Breast cancer is the most common cancer among women worldwide, after lung cancer. However, early detection of breast cancer can help to reduce death rates in breast cancer patients and also prevent cancer from spreading to other parts of the body. This work proposes a new method to design a bio-marker integrating Bayesian predictive models, pyRadiomics System and genetic algorithms to classify the benign and malignant lesions. The method allows one to evaluate two types of images: The radiologist-segmented lesion, and a novel automated breast cancer detection by the analysis of the whole breast. The results demonstrate only a difference of 12% of effectiveness for the cases of calcification between the radiologist generated segmentation and the automatic whole breast analysis, and a 25% of difference between the lesion and the breast for the cases of masses. In addition, our approach was compared against other proposed methods in the literature, providing an AUC = 0.86 for the analysis of images with lesions in breast calcification, and AUC = 0.96 for masses.
ABSTRACT
According to the World Health Organization (WHO), type 2 diabetes mellitus (T2DM) is a result of the inefficient use of insulin by the body. More than 95% of people with diabetes have T2DM, which is largely due to excess weight and physical inactivity. This study proposes an intelligent feature selection of metabolites related to different stages of diabetes, with the use of genetic algorithms (GA) and the implementation of support vector machines (SVMs), K-Nearest Neighbors (KNNs) and Nearest Centroid (NEARCENT) and with a dataset obtained from the Instituto Mexicano del Seguro Social with the protocol name of the following: "Análisis metabolómico y transcriptómico diferencial en orina y suero de pacientes pre diabéticos, diabéticos y con nefropatía diabética para identificar potenciales biomarcadores pronósticos de daño renal" (differential metabolomic and transcriptomic analyses in the urine and serum of pre-diabetic, diabetic and diabetic nephropathy patients to identify potential prognostic biomarkers of kidney damage). In order to analyze which machine learning (ML) model is the most optimal for classifying patients with some stage of T2DM, the novelty of this work is to provide a genetic algorithm approach that detects significant metabolites in each stage of progression. More than 100 metabolites were identified as significant between all stages; with the data analyzed, the average accuracies obtained in each of the five most-accurate implementations of genetic algorithms were in the range of 0.8214-0.9893 with respect to average accuracy, providing a precise tool to use in detections and backing up a diagnosis constructed entirely with metabolomics. By providing five potential biomarkers for progression, these extremely significant metabolites are as follows: "Cer(d18:1/24:1) i2", "PC(20:3-OH/P-18:1)", "Ganoderic acid C2", "TG(16:0/17:1/18:1)" and "GPEtn(18:0/20:4)".
ABSTRACT
Type 2 diabetes mellitus (T2DM) represents one of the biggest health problems in Mexico, and it is extremely important to early detect this disease and its complications. For a noninvasive detection of T2DM, a machine learning (ML) approach that uses ensemble classification models with dichotomous output that is also fast and effective for early detection and prediction of T2D can be used. In this article, an ensemble technique by hard voting is designed and implemented using generalized linear regression (GLM), support vector machines (SVM) and artificial neural networks (ANN) for the classification of T2DM patients. In the materials and methods as a first step, the data is balanced, standardized, imputed and integrated into the three models to classify the patients in a dichotomous result. For the selection of features, an implementation of LASSO is developed, with a 10-fold cross-validation and for the final validation, the Area Under the Curve (AUC) is used. The results in LASSO showed 12 features, which are used in the implemented models to obtain the best possible scenario in the developed ensemble model. The algorithm with the best performance of the three is SVM, this model obtained an AUC of 92% ± 3%. The ensemble model built with GLM, SVM and ANN obtained an AUC of 90% ± 3%.
ABSTRACT
Major depressive disorder (MDD) is the most recurrent mental illness globally, affecting approximately 5% of adults. Furthermore, according to the National Institute of Mental Health (NIMH) of the U.S., calculating an actual schizophrenia prevalence rate is challenging because of this illness's underdiagnosis. Still, most current global metrics hover between 0.33% and 0.75%. Machine-learning scientists use data from diverse sources to analyze, classify, or predict to improve the psychiatric attention, diagnosis, and treatment of MDD, schizophrenia, and other psychiatric conditions. Motor activity data are gaining popularity in mental illness diagnosis assistance because they are a cost-effective and noninvasive method. In the knowledge discovery in databases (KDD) framework, a model to classify depressive and schizophrenic patients from healthy controls is constructed using accelerometer data. Taking advantage of the multiple sleep disorders caused by mental disorders, the main objective is to increase the model's accuracy by employing only data from night-time activity. To compare the classification between the stages of the day and improve the accuracy of the classification, the total activity signal was cut into hourly time lapses and then grouped into subdatasets depending on the phases of the day: morning (06:00-11:59), afternoon (12:00-17:59), evening (18:00-23:59), and night (00:00-05:59). Random forest classifier (RFC) is the algorithm proposed for multiclass classification, and it uses accuracy, recall, precision, the Matthews correlation coefficient, and F1 score to measure its efficiency. The best model was night-featured data and RFC, with 98% accuracy for the classification of three classes. The effectiveness of this experiment leads to less monitoring time for patients, reducing stress and anxiety, producing more efficient models, using wearables, and increasing the amount of data.
ABSTRACT
Sudden infant death syndrome (SIDS) represents the leading cause of death in under one year of age in developing countries. Even in our century, its etiology is not clear, and there is no biomarker that is discriminative enough to predict the risk of suffering from it. Therefore, in this work, taking a public dataset on the lipidomic profile of babies who died from this syndrome compared to a control group, a univariate analysis was performed using the Mann-Whitney U test, with the aim of identifying the characteristics that enable discriminating between both groups. Those characteristics with a p-value less than or equal to 0.05 were taken; once these characteristics were obtained, classification models were implemented (random forests (RF), logistic regression (LR), support vector machine (SVM) and naive Bayes (NB)). We used seventy percent of the data for model training, subjecting it to a cross-validation (k = 5) and later submitting to validation in a blind test with 30% of the remaining data, which allows simulating the scenario in real life-that is, with an unknown population for the model. The model with the best performance was RF, since in the blind test, it obtained an AUC of 0.9, specificity of 1, and sensitivity of 0.8. The proposed model provides the basis for the construction of a SIDS risk prediction computer tool, which will contribute to prevention, and proposes lines of research to deal with this pathology.
ABSTRACT
One of the main microvascular complications presented in the Mexican population is diabetic retinopathy which affects 27.50% of individuals with type 2 diabetes. Therefore, the purpose of this study is to construct a predictive model to find out the risk factors of this complication. The dataset contained a total of 298 subjects, including clinical and paraclinical features. An analysis was constructed using machine learning techniques including Boruta as a feature selection method, and random forest as classification algorithm. The model was evaluated through a statistical test based on sensitivity, specificity, area under the curve (AUC), and receiving operating characteristic (ROC) curve. The results present significant values obtained by the model obtaining 69% of AUC. Moreover, a risk evaluation was incorporated to evaluate the impact of the predictors. The proposed method identifies creatinine, lipid treatment, glomerular filtration rate, waist hip ratio, total cholesterol, and high density lipoprotein as risk factors in Mexican subjects. The odds ratio increases by 3.5916 times for control patients which have high levels of cholesterol. It is possible to conclude that this proposed methodology is a preliminary computer-aided diagnosis tool for clinical decision-helping to identify the diagnosis of DR.
ABSTRACT
Worldwide, motor vehicle accidents are one of the leading causes of death, with alcohol-related accidents playing a significant role, particularly in child death. Aiming to aid in the prevention of this type of accidents, a novel non-invasive method capable of detecting the presence of alcohol inside a motor vehicle is presented. The proposed methodology uses a series of low-cost alcohol MQ3 sensors located inside the vehicle, whose signals are stored, standardized, time-adjusted, and transformed into 5 s window samples. Statistical features are extracted from each sample and a feature selection strategy is carried out using a genetic algorithm, and a forward selection and backwards elimination methodology. The four features derived from this process were used to construct an SVM classification model that detects presence of alcohol. The experiments yielded 7200 samples, 80% of which were used to train the model. The rest were used to evaluate the performance of the model, which obtained an area under the ROC curve of 0.98 and a sensitivity of 0.979. These results suggest that the proposed methodology can be used to detect the presence of alcohol and enforce prevention actions.
Subject(s)
Automobile Driving , Driving Under the Influence , Accidents, Traffic/prevention & control , Algorithms , Child , Humans , Motor VehiclesABSTRACT
Children's healthcare is a relevant issue, especially the prevention of domestic accidents, since it has even been defined as a global health problem. Children's activity classification generally uses sensors embedded in children's clothing, which can lead to erroneous measurements for possible damage or mishandling. Having a non-invasive data source for a children's activity classification model provides reliability to the monitoring system where it is applied. This work proposes the use of environmental sound as a data source for the generation of children's activity classification models, implementing feature selection methods and classification techniques based on Bayesian networks, focused on the recognition of potentially triggering activities of domestic accidents, applicable in child monitoring systems. Two feature selection techniques were used: the Akaike criterion and genetic algorithms. Likewise, models were generated using three classifiers: naive Bayes, semi-naive Bayes and tree-augmented naive Bayes. The generated models, combining the methods of feature selection and the classifiers used, present accuracy of greater than 97% for most of them, with which we can conclude the efficiency of the proposal of the present work in the recognition of potentially detonating activities of domestic accidents.
ABSTRACT
Alzheimer's disease (AD) is a neurodegenerative disease that mainly affects older adults. Currently, AD is associated with certain hypometabolic biomarkers, beta-amyloid peptides, hyperphosphorylated tau protein, and changes in brain morphology. Accurate diagnosis of AD, as well as mild cognitive impairment (MCI) (prodromal stage of AD), is essential for early care of the disease. As a result, machine learning techniques have been used in recent years for the diagnosis of AD. In this research, we propose a novel methodology to generate a multivariate model that combines different types of features for the detection of AD. In order to obtain a robust biomarker, ADNI baseline data, clinical and neuropsychological assessments (1024 features) of 106 patients were used. The data were normalized, and a genetic algorithm was implemented for the selection of the most significant features. Subsequently, for the development and validation of the multivariate classification model, a support vector machine model was created, and a five-fold cross-validation with an AUC of 87.63% was used to measure model performance. Lastly, an independent blind test of our final model, using 20 patients not considered during the model construction, yielded an AUC of 100%.
ABSTRACT
The main cause of death in Mexico and the world is heart disease, and it will continue to lead the death rate in the next decade according to data from the World Health Organization (WHO) and the National Institute of Statistics and Geography (INEGI). Therefore, the objective of this work is to implement, compare and evaluate machine learning algorithms that are capable of classifying normal and abnormal heart sounds. Three different sounds were analyzed in this study; normal heart sounds, heart murmur sounds and extra systolic sounds, which were labeled as healthy sounds (normal sounds) and unhealthy sounds (murmur and extra systolic sounds). From these sounds, fifty-two features were calculated to create a numerical dataset; thirty-six statistical features, eight Linear Predictive Coding (LPC) coefficients and eight Cepstral Frequency-Mel Coefficients (MFCC). From this dataset two more were created; one normalized and one standardized. These datasets were analyzed with six classifiers: k-Nearest Neighbors, Naive Bayes, Decision Trees, Logistic Regression, Support Vector Machine and Artificial Neural Networks, all of them were evaluated with six metrics: accuracy, specificity, sensitivity, ROC curve, precision and F1-score, respectively. The performances of all the models were statistically significant, but the models that performed best for this problem were logistic regression for the standardized data set, with a specificity of 0.7500 and a ROC curve of 0.8405, logistic regression for the normalized data set, with a specificity of 0.7083 and a ROC curve of 0.8407, and Support Vector Machine with a lineal kernel for the non-normalized data; with a specificity of 0.6842 and a ROC curve of 0.7703. Both of these metrics are of utmost importance in evaluating the performance of computer-assisted diagnostic systems.
ABSTRACT
Diabetes incidence has been a problem, because according with the World Health Organization and the International Diabetes Federation, the number of people with this disease is increasing very fast all over the world. Diabetic treatment is important to prevent the development of several complications, also lipid profile monitoring is important. For that reason the aim of this work is the implementation of machine learning algorithms that are able to classify cases, that corresponds to patients diagnosed with diabetes that have diabetes treatment, and controls that refers to subjects who do not have diabetes treatment but some of them have diabetes, bases on lipids profile levels. Logistic regression, K-nearest neighbor, decision trees and random forest were implemented, all of them were evaluated with accuracy, sensitivity, specificity and AUC-ROC curve metrics. Artificial neural network obtain an acurracy of 0.685 and an AUC value of 0.750, logistic regression achieve an accuracy of 0.729 and an AUC value of 0.795, K-nearest neighbor gets an accuracy of 0.669 and an AUC value of 0.709, on the other hand, decision tree reached an accuracy pg 0.691 and a AUC value of 0.683, finally random forest achieve an accuracy of 0.704 and an AUC curve of 0.776. The performance of all models was statistically significant, but the best performance model for this problem corresponds to logistic regression.
ABSTRACT
The prevalence of diabetes mellitus is increasing worldwide, causing health and economic implications. One of the principal microvascular complications of type 2 diabetes is Distal Symmetric Polyneuropathy (DSPN), affecting 42.6% of the population in Mexico. Therefore, the purpose of this study was to find out the predictors of this complication. The dataset contained a total number of 140 subjects, including clinical and paraclinical features. A multivariate analysis was constructed using Boruta as a feature selection method and Random Forest as a classification algorithm applying the strategy of K-Folds Cross Validation and Leave One Out Cross Validation. Then, the models were evaluated through a statistical analysis based on sensitivity, specificity, area under the curve (AUC) and receiving operating characteristic (ROC) curve. The results present significant values obtained by the model with this approach, presenting 67% of AUC with only three features as predictors. It is possible to conclude that this proposed methodology can classify patients with DSPN, obtaining a preliminary computer-aided diagnosis tool for the clinical area in helping to identify the diagnosis of DSPN.
ABSTRACT
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the coronavirus disease (COVID-19), a highly contagious infectious disease that has caused many deaths worldwide. Despite global efforts, it continues to cause great losses, and leaving multiple unknowns that we must resolve in order to face the pandemic more effectively. One of the questions that has arisen recently is what happens, after recovering from COVID-19. For this reason, the objective of this study is to identify the risk of presenting persistent symptoms in recovered from COVID-19. This case-control study was conducted in one state of Mexico. Initially the data were obtained from the participants, through a questionnaire about symptoms that they had at the moment of the interview. Initially were captured the collected data, to make a dataset. After the pre-processed using the R project tool to eliminate outliers or missing data. Obtained finally a total of 219 participants, 141 recovered and 78 controls. It was used confidence level of 90% and a margin of error of 7%. From results it was obtained that all symptoms have an associated risk in those recovered. The relative risk of the selected symptoms in the recovered patients goes from 3 to 22 times, being infinite for the case of dyspnea, due to the fact that there is no control that presents this symptom at the moment of the interview, followed by the nausea and the anosmia with a RR of 8.5. Therefore, public health strategies must be rethought, to treat or rehabilitate, avoiding chronic problems in patients recovered from COVID-19.
Subject(s)
COVID-19/diagnosis , Adult , COVID-19/physiopathology , Case-Control Studies , Chronic Disease , Female , Humans , Male , Mexico/epidemiology , Middle Aged , Surveys and QuestionnairesABSTRACT
Sudden infant death syndrome (SIDS) is defined as the death of a child under one year of age, during sleep, without apparent cause, after exhaustive investigation, so it is a diagnosis of exclusion. SIDS is the principal cause of death in industrialized countries. Inborn errors of metabolism (IEM) have been related to SIDS. These errors are a group of conditions characterized by the accumulation of toxic substances usually produced by an enzyme defect and there are thousands of them and included are the disorders of the ß-oxidation cycle, similarly to what can affect the metabolism of different types of fatty acid chain (within these, short chain fatty acids (SCFAs)). In this work, an analysis of postmortem SCFAs profiles of children who died due to SIDS is proposed. Initially, a set of features containing SCFAs information, obtained from the NIH Common Fund's National Metabolomics Data Repository (NMDR) is submitted to an univariate analysis, developing a model based on the relationship between each feature and the binary output (death due to SIDS or not), obtaining 11 univariate models. Then, each model is validated, calculating their receiver operating characteristic curve (ROC curve) and area under the ROC curve (AUC) value. For those features whose models presented an AUC value higher than 0.650, a new multivariate model is constructed, in order to validate its behavior in comparison to the univariate models. In addition, a comparison between this multivariate model and a model developed based on the whole set of features is finally performed. From the results, it can be observed that each SCFA which comprises of the SFCAs profile, has a relationship with SIDS and could help in risk identification.
ABSTRACT
Major Depression Disease has been increasing in the last few years, affecting around 7 percent of the world population, but nowadays techniques to diagnose it are outdated and inefficient. Motor activity data in the last decade is presented as a better way to diagnose, treat and monitor patients suffering from this illness, this is achieved through the use of machine learning algorithms. Disturbances in the circadian rhythm of mental illness patients increase the effectiveness of the data mining process. In this paper, a comparison of motor activity data from the night, day and full day is carried out through a data mining process using the Random Forest classifier to identified depressive and non-depressive episodes. Data from Depressjon dataset is split into three different subsets and 24 features in time and frequency domain are extracted to select the best model to be used in the classification of depression episodes. The results showed that the best dataset and model to realize the classification of depressive episodes is the night motor activity data with 99.37% of sensitivity and 99.91% of specificity.