ABSTRACT
Background: Financial strain resulting from cancer treatment correlates with reduced quality of life, treatment nonadherence, bankruptcy, and maladaptive behaviours. This study aims to explore the potential of a supervised machine learning algorithm in predicting financial toxicity in cancer patients based on their Tweets. Methods: A dataset of Tweets related to cancer and financial toxicity was constructed using Twitter's API. The dataset was curated, and synthetic Tweets were generated to augment the final dataset. A supervised machine learning algorithm, specifically Multinomial Naïve Bayes, was trained and tested to predict financial toxicity in cancer patients. Results: The model demonstrated high accuracy (0.97), precision (0.95), recall (0.99), specificity (0.96), F-1 score (0.97) and area-under-the-receiver-operating-characteristics (0.98) in predicting financial toxicity from Tweets. Wordcloud visualizations illustrated distinct linguistic patterns between Tweets related to financial toxicity and those unrelated to financial toxicity. The study also outlined potential proactive strategies for leveraging social media platforms like Twitter to identify and support cancer patients experiencing financial toxicity. Conclusions: This study marks the first attempt to construct a dataset of Tweets related to financial toxicity in cancer patients and to evaluate a predictive model trained on this dataset. The findings highlight the predictive capabilities of the model and its potential utility in guiding health systems and cancer center financial navigators to alleviate economic burdens associated with cancer treatment.
ABSTRACT
Focusing on the issue that the naive Bayes model(NBM)in outpatient intelligent diagnosis,it is not effective to distinguish between different types of symptoms involved in a different range of subjects.An improved algorithm for the naive Bayes method is proposed,Introducing IDF factor,Provide different weights for different symptom types.First of all,based on authoritative medical literature,Collected and sorted the related corpus of diagnostics as a training data set,Then,based on the naive Bayes method,the priori probability and the class conditional probability are calculated,Trained the IDF factors for differ-ent symptoms,Finally,IDF factor is introduced to different combination of symptoms in classification judgment,to smoothed the different types of symptoms.In the accuracy comparison experiment of intelligent diagnosis,the recall rate of the improved algo-rithm is up about 11%,obviously higher than the naive Bayes method.
ABSTRACT
OBJECTIVES@#To establish the menstrual blood identification model based on Naïve Bayes and multivariate logistic regression methods by using specific mRNA markers in menstrual blood detection technology combined with statistical methods, and to quantitatively distinguish menstrual blood from other body fluids.@*METHODS@#Body fluids including 86 menstrual blood, 48 peripheral blood, 48 vaginal secretions, 24 semen and 24 saliva samples were collected. RNA of the samples was extracted and cDNA was obtained by reverse transcription. Five menstrual blood-specific markers including members of the matrix metalloproteinase (MMP) family MMP3, MMP7, MMP11, progestogens associated endometrial protein (PAEP) and stanniocalcin-1 (STC1) were amplified and analyzed by electrophoresis. The results were analyzed by Naïve Bayes and multivariate logistic regression.@*RESULTS@#The accuracy of the classification model constructed was 88.37% by Naïve Bayes and 91.86% by multivariate logistic regression. In non-menstrual blood samples, the distinguishing accuracy of peripheral blood, saliva and semen was generally higher than 90%, while the distinguishing accuracy of vaginal secretions was lower, which were 16.67% and 33.33%, respectively.@*CONCLUSIONS@#The mRNA detection technology combined with statistical methods can be used to establish a classification and discrimination model for menstrual blood, which can distignuish the menstrual blood and other body fluids, and quantitative description of analysis results, which has a certain application value in body fluid stain identification.
Subject(s)
Female , Humans , RNA, Messenger/metabolism , Bayes Theorem , Logistic Models , Menstruation , Body Fluids , Saliva , Semen , Forensic Genetics/methodsABSTRACT
Introducción: en el campo de la salud, cada decisión representa datos, y las técnicas de minería de datos han empezado a ser una metodología prometedora para el análisis de esta información, especialmente en el diseño de los modelos predictivos. Métodos: estudio observacional analítico de pacientes mayores de 15 años, con reporte de punción de aspiración con aguja fina con estudio Bethesda IV, sometidos a manejo quirúrgico en el Hospital de San José de Bogotá. Los datos recogidos de los pacientes se incluyeron en tres grupos: la información sociodemográfica y clínica, los hallazgos en la citología y los reportes de la ecografía. Se realizó el análisis mediante Naive-Bayes, árbol de decisión y redes neuronales. Se usó la herramienta Weka versión 3.8.2. Resultados: de los 427 pacientes, 195 tuvieron resultados de patología de carcinoma de tiroides (45,6 %). Se evidenciaron mejores resultados usando la validación cruzada (10 fold) comparado con partición (66 %), la técnica de Bayes tuvo mejores resultados de clasificación correcta (91,1 %), comparado con la técnica de árbol (87,8 %) y la red neuronal (88,2 %). Conclusiones: el uso de la técnica de Naive Bayes muestra una importante exactitud para determinar la predicción de riesgo de malignidad en los pacientes con estudio citológico Bethesda IV, lo cual permitiría orientar de forma adecuada el manejo quirúrgico de los pacientes
Introduction: In the health field, each decision represents data, and data mining techniques have begun to be a promising methodology for the analysis of this information, especially in the design of predictive models. Methods: Analytical observational study; patients older than 15 years with a report of Bethesda IV after a fine needle aspiration biopsy that undergoing surgical management at the Hospital de San José in Bogotá. The data collected from those patients were included in three groups: sociodemographic-clinical information, cytology findings, and ultrasound reports. Analysis was performed using three technics: Naive Bayes, decision trees, and neural networks. Weka tool version 3.8.2 was used. Results: 195 patients out of 427, had a thyroid carcinoma pathology (45.6%). Better results were evidenced using cross-validation (10 fold) compared with a partition (66%), the Bayes technique had better results of correct classification (91.1%), than the tree technique (87.8%) and neural network (88.2%). Conclusions: The use of the Naive Bayes technique shows an important accuracy to determine the prediction of risk of malignancy in patients with a Bethesda IV cytological study, which would allow an adequate guide to the surgical management of patients.
Subject(s)
Humans , Data MiningABSTRACT
Abstract Diabetes mellitus (DM) is a category of metabolic disorders caused by high blood sugar. The DM affects human metabolism, and this disease causes many complications like Heart disease, Neuropathy, Diabetic retinopathy, kidney problems, skin disorder and slow healing. It is therefore essential to predict the presence of DM using an automated diabetes diagnosis system, which can be implemented using machine learning algorithms. A variety of automated diabetes prediction systems have been proposed in previous studies. Even so, the low prediction accuracy of DM prediction systems is a major issue. This proposed work developed a diabetes mellitus prediction system to improve the diabetes mellitus prediction accuracy using Optimized Gaussian Naive Bayes algorithm. This proposed model using the Pima Indians diabetes dataset as an input to build the DM predictive model. The missing values of an input dataset are imputed using regression imputation method. The sequential backward feature elimination method is used in this proposed model for selecting the relevant risk factors of diabetes disease. The proposed machine learning classifier named Optimized Gaussian Naïve Bayes (OGNB) is applied to the selected risk factors to create an enhanced Diabetes diagnostic system which predicts Diabetes in an individual. The performance analysis of this prediction architecture shows that, over other traditional machine learning classifiers, the Optimized Gaussian Naïve Bayes achieves an 81.85% classifier accuracy. This proposed DM prediction system is effective as compared to other diabetes prediction systems found in the literature. According to our experimental study, the OGNB based diabetes mellitus prediction system is more appropriate for DM disease prediction.
ABSTRACT
The clinical manifestations of patients with schizophrenia and patients with depression not only have a certain similarity, but also change with the patient's mood, and thus lead to misdiagnosis in clinical diagnosis. Electroencephalogram (EEG) analysis provides an important reference and objective basis for accurate differentiation and diagnosis between patients with schizophrenia and patients with depression. In order to solve the problem of misdiagnosis between patients with schizophrenia and patients with depression, and to improve the accuracy of the classification and diagnosis of these two diseases, in this study we extracted the resting-state EEG features from 100 patients with depression and 100 patients with schizophrenia, including information entropy, sample entropy and approximate entropy, statistical properties feature and relative power spectral density (rPSD) of each EEG rhythm (δ, θ, α, β). Then feature vectors were formed to classify these two types of patients using the support vector machine (SVM) and the naive Bayes (NB) classifier. Experimental results indicate that: ① The rPSD feature vector performs the best in classification, achieving an average accuracy of 84.2% and a highest accuracy of 86.3%; ② The accuracy of SVM is obviously better than that of NB; ③ For the rPSD of each rhythm, the β rhythm performs the best with the highest accuracy of 76%; ④ Electrodes with large feature weight are mainly concentrated in the frontal lobe and parietal lobe. The results of this study indicate that the rPSD feature vector in conjunction with SVM can effectively distinguish depression and schizophrenia, and can also play an auxiliary role in the relevant clinical diagnosis.
Subject(s)
Humans , Bayes Theorem , Depression , Electroencephalography , Schizophrenia , Signal Processing, Computer-Assisted , Support Vector MachineABSTRACT
The paper introduces the research idea, design and realization of the distributed Naive Bayesian intelligent diagnosis sys-tem based on Hadoop, makes optimization and improvement according to its application in Traditional Chinese Medicine ( TCM) Hospital of Guangdong Province, including algorithm design improvement and enhancement of accuracy, extensibility and security of the system.
ABSTRACT
This paper proposes a novel P1-weighted Lukasiewicz Logic based Fuzzy Similarity Classifier for classifying Denver Group of chromosomes and compares its performance with the other classifiers under study. A chromosome is classified to one of the seven groups from A to G, based on the Denver System of classification of chromosomes. Chromosomes within a particular Denver Group are difficult to identify, possessing almost identical characteristics for the extracted features. This work evaluates the performance of supervised classifiers including Naive Bayes, Support Vector Machine with Gaussian Kernel (SVM), Multilayer perceptron (MLP) and a novel, unsupervised, P1-weighted Lukasiewicz Logic based Fuzzy Similarity Classifier, in classifying the Denver Group of chromosomes. A fundamental review on fuzzy similarity based classification is presented. Experimental results clearly demonstrates that the proposed P1-weighted Lukasiewicz Logic based Fuzzy Similarity Classifier using the generalized Minkowski mean metric, produces the best classification results, almost identical to the Ground Truth values. One-way Analysis of Variance (ANOVA) at 95% and 99% level of confidence and Tukey's post-hoc analysis is performed to validate the selection of the classifier. The proposed P1-weighted Lukasiewicz Logic based Fuzzy Similarity Classifier gives the most promising classification results and can be applied to any large scale biomedical data and other applications.
Este trabalho propõe uma nova lógica P1pondera de Lukasiewicz de acordo com o classificador de similarida fuzzy para classificar cromossomas do Grupo Denver e compara o seu desempenho com os outros classificadores em estudo. Um cromossoma é classificado com um dos sete grupos de A a G, com base no Sistema de Denver de classificação de cromossomos. Cromossomos dentro de um grupo de Denver particular são difíceis de identificar, com características quase idênticas para os recursos extraídos. Este trabalho avalia o desempenho de classificadores supervisionados, incluindo Naive Bayes, Support Vector Machine com Gaussian Kernel (SVM), perceptron multicamadas (MLP) e um novo classificador sem supervisão, P1-weighted, lógica de Lukasiewicz de acordo com o classificador de similaridade Fuzzy para a classificação do Grupo Denver de cromossomos . Apresenta-se ma revisão fundamentada na classificação de acordo com similaridade difusa. Resultados experimentais demonstram claramente que Classificador Similaridade Fuzzy proposto de acordo com a lógica de Lukasiewicz P1-weighted usando a médica métrica de Minkowski para produz melhores resultados de classificação. Estes valores foram muito similares aos valores de Ground Truth . Análise de variancia (ANOVA) com 95% de grau de confiança e análise post-hoc de Tukey 99% foram realizadas para validar a seleção do classificador. Este classificador P1-weighted de lógica de Lukasiewicz está de acordo com o classificador de similaridade difusa oferecendo resultados declassificação mais promissoras. Portanto, podendo ser aplicado a dados biomédicos em larga escala além de outras aplicações.
Subject(s)
Chromosomes , Classification , Fuzzy LogicABSTRACT
OBJECTIVE: To develop breast cancer prediction models and to compare their predictive performance by using Bayesian Networks (BN), Naive Bayes (NB), Classification and Regression Trees (CART), and Logistic Regression (LR). METHODS: The dataset consisting of 109 breast cancer patients and 100 healthy women was used. Hugin Researcher(TM) 6.7 and Poulin-Hugin 1.5, both of which are NB modeling software, were used. For the LRmodel and CART, ECMiner was used. RESULTS: The highest area under the receiver operating characteristic curve (AUC) was shown in the Tree augmented NBmodel as .90. The lowest AUCwas CARTwith .48; that of the LR model was .86. Two BN models with prior knowledge and without prior knowledge did not show any difference at all (.64 vs. .65). The lifts of four models (Simple NB, Tree Augmented NB, Hierarchical NB, LR) were 1.9. The AUCs in both the NB and LR models were higher than that of the previously established models that have been published by using LR methods. CONCLUSION: NB could be preferred to LR in the development of a predictive model to promote regular screening tests and early detection,which ismore or less free fromstatistical assumptions and limitations.
Subject(s)
Female , Humans , Area Under Curve , Bays , Breast , Breast Neoplasms , Logistic Models , Mass Screening , Risk Assessment , ROC CurveABSTRACT
OBJECTIVE: Today in United States, about one in eight women have been affected with breast cancer over their lifetime. Up to today, some various prediction models using SEER (Surveillance Epidemiology and End Results) datasets have been proposed in past studies. However, appropriate methods for predicting the 5 years survival rate of breast cancer have not established. In this study, we evaluate those models to predict the survival rate of breast cancer patients. METHODS: Five data mining algorithms (Artificial Neural Network, Naive Bayes , Decision Trees (ID3) and Decision Trees(J48)) besides a most generally used statistical method (Logistic Regression) were used to evaluate the prediction models using a dataset (37,256 follow-up cases from 1992 to 1997). We also used 10-fold cross-validation methods to assess the unbiased estimate of the five prediction models for comparison of performance of each method. RESULTS: The accuracy was 85.8+/-0.2%, 84.3+/-1.4%, 83.9+/-0.2%, 82.3+/-0.2%, 75.1+/-0.2% for the Logistic Regression, Artificial Neural, Naive Bayes, Decision Trees (ID3), Decision Trees(J48), respectively. Although the accuracy of Logistic Regression showed the highest performances, the Decision Trees (J48) was the lowest one. CONCLUSIONS: The accuracy of Logistic Regression was the best performances, on the other hand Decision Trees (J48) was the worst. Artificial Neural Network indicated relatively high performance.
Subject(s)
Female , Humans , Bays , Breast Neoplasms , Breast , Data Mining , Dataset , Decision Trees , Epidemiology , Follow-Up Studies , Hand , Logistic Models , SEER Program , Survival Rate , United StatesABSTRACT
Predicting enzyme class from protein structure parameters is a challenging problem in protein analysis. We developed a method to predict enzyme class that combines the strengths of statistical and data-mining methods. This method has a strong mathematical foundation and is simple to implement, achieving an accuracy of 45%. A comparison with the methods found in the literature designed to predict enzyme class showed that our method outperforms the existing methods.
Subject(s)
Humans , Protein Conformation , Enzymes/chemistry , Enzymes/classification , Bayes Theorem , Algorithms , Sequence AlignmentABSTRACT
OBJECTIVE: The purpose of this study was to explore the usability of a feature selection method based on the mutual information theory to increase predictive performance of a classifier in data mining. METHODS: The HIV Cost and Services Utilization Study(HCSUS) dataset was used to apply the feature selection method to a classifier. Its contribution to increasing the predictive performance of the classifier was evaluated by comparing the Naive Bayes(NB) and the Logistic Regression(LG) models using different variables. The infrequent office visit representing limited health service utilization was selected as an outcome variable. HUGIN Researcher(TM) 6.3 was used to train and test the NB models and SAS(R) 8.0 was used for the LG modeling. RESULTS: Higher AUC in the NB model was obtained using the variables selected by the mutual information based feature selection method(AUC=.639, CI=.611, .660); lower AUC using the variables defined by a previous study(AUC=.599, CI=.570, .620). There was no difference between the LG models with different variables. CONCLUSION: This study demonstrated the mutual information method may be useful in identifying relevant predictors as the feature selection method, which can contribute to an increase in the predictive performance of a classifier.
Subject(s)
Area Under Curve , Data Mining , Dataset , Health Services , HIV , Information Theory , Office VisitsABSTRACT
Human Papillomavirus (HPV) infection is known as the main factor for cervical cancer which is a leading cause of cancer deaths in women worldwide. Because there are more than 100 types in HPV, it is critical to discriminate the HPVs related with cervical cancer from those not related with it. In this paper, the risk type of HPVs using their textual explanation. The important issue in this problem is to distinguish false negatives from false positives. That is, we must find high-risk HPVs as many as possible though we may miss some low-risk HPVs. For this purpose, the AdaCost, a cost-sensitive learner is adopted to consider different costs between training examples. The experimental results on the HPV sequence database show that the consideration of costs gives higher performance. The improvement in F-score is higher than that of the accuracy, which implies that the number of high-risk HPVs found is increased.