SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning.

Shinde, Anjali; Shahra, Essa Q; Basurra, Shadi; Saeed, Faisal; AlSewari, Abdulrahman A; Jabbar, Waheb A

Shinde, Anjali; Shahra, Essa Q; Basurra, Shadi; Saeed, Faisal; AlSewari, Abdulrahman A; Jabbar, Waheb A.

Afiliação

Shinde A; Faculty of Computing, Engineering and Built Environment, Birmingham City University, Birmingham B4 7RQ, UK.
Shahra EQ; Faculty of Computing, Engineering and Built Environment, Birmingham City University, Birmingham B4 7RQ, UK.
Basurra S; Faculty of Computing, Engineering and Built Environment, Birmingham City University, Birmingham B4 7RQ, UK.
Saeed F; Faculty of Computing, Engineering and Built Environment, Birmingham City University, Birmingham B4 7RQ, UK.
AlSewari AA; Faculty of Computing, Engineering and Built Environment, Birmingham City University, Birmingham B4 7RQ, UK.
Jabbar WA; Faculty of Computing, Engineering and Built Environment, Birmingham City University, Birmingham B4 7RQ, UK.

Sensors (Basel) ; 24(18)2024 Sep 20.

Article em En | MEDLINE | ID: mdl-39338829

ABSTRACT

ABSTRACT

The growing problem of unsolicited text messages (smishing) and data irregularities necessitates stronger spam detection solutions. This paper explores the development of a sophisticated model designed to identify smishing messages by understanding the complex relationships among words, images, and context-specific factors, areas that remain underexplored in existing research. To address this, we merge a UCI spam dataset of regular text messages with real-world spam data, leveraging OCR technology for comprehensive analysis. The study employs a combination of traditional machine learning models, including K-means, Non-Negative Matrix Factorization, and Gaussian Mixture Models, along with feature extraction techniques such as TF-IDF and PCA. Additionally, deep learning models like RNN-Flatten, LSTM, and Bi-LSTM are utilized. The selection of these models is driven by their complementary strengths in capturing both the linear and non-linear relationships inherent in smishing messages. Machine learning models are chosen for their efficiency in handling structured text data, while deep learning models are selected for their superior ability to capture sequential dependencies and contextual nuances. The performance of these models is rigorously evaluated using metrics like accuracy, precision, recall, and F1 score, enabling a comparative analysis between the machine learning and deep learning approaches. Notably, the K-means feature extraction with vectorizer achieved 91.01% accuracy, and the KNN-Flatten model reached 94.13% accuracy, emerging as the top performer. The rationale behind highlighting these models is their potential to significantly improve smishing detection rates. For instance, the high accuracy of the KNN-Flatten model suggests its applicability in real-time spam detection systems, but its computational complexity might limit scalability in large-scale deployments. Similarly, while K-means with vectorizer excels in accuracy, it may struggle with the dynamic and evolving nature of smishing attacks, necessitating continual retraining.

Palavras-chave

deep learning semi supervised; feature ex-traction; smishing message; unsupervised machine learning

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Sensors (Basel) Ano de publicação: 2024 Tipo de documento: Article País de publicação: Suíça

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google