Pesquisa | Portal Regional da BVS

Offensive language detection in low resource languages: A use case of Persian language.

Mozafari, Marzieh; Mnassri, Khouloud; Farahbakhsh, Reza; Crespi, Noel.

PLoS One ; 19(6): e0304166, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38905214

RESUMO

THIS ARTICLE USES WORDS OR LANGUAGE THAT IS CONSIDERED PROFANE, VULGAR, OR OFFENSIVE BY SOME READERS. Different types of abusive content such as offensive language, hate speech, aggression, etc. have become prevalent in social media and many efforts have been dedicated to automatically detect this phenomenon in different resource-rich languages such as English. This is mainly due to the comparative lack of annotated data related to offensive language in low-resource languages, especially the ones spoken in Asian countries. To reduce the vulnerability among social media users from these regions, it is crucial to address the problem of offensive language in such low-resource languages. Hence, we present a new corpus of Persian offensive language consisting of 6,000 out of 520,000 randomly sampled micro-blog posts from X (Twitter) to deal with offensive language detection in Persian as a low-resource language in this area. We introduce a method for creating the corpus and annotating it according to the annotation practices of recent efforts for some benchmark datasets in other languages which results in categorizing offensive language and the target of offense as well. We perform extensive experiments with three classifiers in different levels of annotation with a number of classical Machine Learning (ML), Deep learning (DL), and transformer-based neural networks including monolingual and multilingual pre-trained language models. Furthermore, we propose an ensemble model integrating the aforementioned models to boost the performance of our offensive language detection task. Initial results on single models indicate that SVM trained on character or word n-grams are the best performing models accompanying monolingual transformer-based pre-trained language model ParsBERT in identifying offensive vs non-offensive content, targeted vs untargeted offense, and offensive towards individual or group. In addition, the stacking ensemble model outperforms the single models by a substantial margin, obtaining 5% respective macro F1-score improvement for three levels of annotation.

Assuntos

Idioma , Humanos , Mídias Sociais , Aprendizado de Máquina , Irã (Geográfico)

A survey on multi-lingual offensive language detection.

Mnassri, Khouloud; Farahbakhsh, Reza; Chalehchaleh, Razieh; Rajapaksha, Praboda; Jafari, Amir Reza; Li, Guanlin; Crespi, Noel.

PeerJ Comput Sci ; 10: e1934, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38660178

RESUMO

The prevalence of offensive content on online communication and social media platforms is growing more and more common, which makes its detection difficult, especially in multilingual settings. The term "Offensive Language" encompasses a wide range of expressions, including various forms of hate speech and aggressive content. Therefore, exploring multilingual offensive content, that goes beyond a single language, focus and represents more linguistic diversities and cultural factors. By exploring multilingual offensive content, we can broaden our understanding and effectively combat the widespread global impact of offensive language. This survey examines the existing state of multilingual offensive language detection, including a comprehensive analysis on previous multilingual approaches, and existing datasets, as well as provides resources in the field. We also explore the related community challenges on this task, which include technical, cultural, and linguistic ones, as well as their limitations. Furthermore, in this survey we propose several potential future directions toward more efficient solutions for multilingual offensive language detection, enabling safer digital communication environment worldwide.

Multilingual Hate Speech Detection: A Semi-Supervised Generative Adversarial Approach.

Mnassri, Khouloud; Farahbakhsh, Reza; Crespi, Noel.

Entropy (Basel) ; 26(4)2024 Apr 18.

Artigo em Inglês | MEDLINE | ID: mdl-38667898

RESUMO

Social media platforms have surpassed cultural and linguistic boundaries, thus enabling online communication worldwide. However, the expanded use of various languages has intensified the challenge of online detection of hate speech content. Despite the release of multiple Natural Language Processing (NLP) solutions implementing cutting-edge machine learning techniques, the scarcity of data, especially labeled data, remains a considerable obstacle, which further requires the use of semisupervised approaches along with Generative Artificial Intelligence (Generative AI) techniques. This paper introduces an innovative approach, a multilingual semisupervised model combining Generative Adversarial Networks (GANs) and Pretrained Language Models (PLMs), more precisely mBERT and XLM-RoBERTa. Our approach proves its effectiveness in the detection of hate speech and offensive language in Indo-European languages (in English, German, and Hindi) when employing only 20% annotated data from the HASOC2019 dataset, thereby presenting significantly high performances in each of multilingual, zero-shot crosslingual, and monolingual training scenarios. Our study provides a robust mBERT-based semisupervised GAN model (SS-GAN-mBERT) that outperformed the XLM-RoBERTa-based model (SS-GAN-XLM) and reached an average F1 score boost of 9.23% and an accuracy increase of 5.75% over the baseline semisupervised mBERT model.

RESUMO

Assuntos

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA