Search | VHL Regional Portal

Detection of Hate Speech in COVID-19-Related Tweets in the Arab Region: Deep Learning and Topic Modeling Approach.

Alshalan, Raghad; Al-Khalifa, Hend; Alsaeed, Duaa; Al-Baity, Heyam; Alshalan, Shahad.

J Med Internet Res ; 22(12): e22609, 2020 12 08.

Article in English | MEDLINE | ID: mdl-33207310

ABSTRACT

BACKGROUND: The massive scale of social media platforms requires an automatic solution for detecting hate speech. These automatic solutions will help reduce the need for manual analysis of content. Most previous literature has cast the hate speech detection problem as a supervised text classification task using classical machine learning methods or, more recently, deep learning methods. However, work investigating this problem in Arabic cyberspace is still limited compared to the published work on English text. OBJECTIVE: This study aims to identify hate speech related to the COVID-19 pandemic posted by Twitter users in the Arab region and to discover the main issues discussed in tweets containing hate speech. METHODS: We used the ArCOV-19 dataset, an ongoing collection of Arabic tweets related to COVID-19, starting from January 27, 2020. Tweets were analyzed for hate speech using a pretrained convolutional neural network (CNN) model; each tweet was given a score between 0 and 1, with 1 being the most hateful text. We also used nonnegative matrix factorization to discover the main issues and topics discussed in hate tweets. RESULTS: The analysis of hate speech in Twitter data in the Arab region identified that the number of non-hate tweets greatly exceeded the number of hate tweets, where the percentage of hate tweets among COVID-19 related tweets was 3.2% (11,743/547,554). The analysis also revealed that the majority of hate tweets (8385/11,743, 71.4%) contained a low level of hate based on the score provided by the CNN. This study identified Saudi Arabia as the Arab country from which the most COVID-19 hate tweets originated during the pandemic. Furthermore, we showed that the largest number of hate tweets appeared during the time period of March 1-30, 2020, representing 51.9% of all hate tweets (6095/11,743). Contrary to what was anticipated, in the Arab region, it was found that the spread of COVID-19-related hate speech on Twitter was weakly related with the dissemination of the pandemic based on the Pearson correlation coefficient (r=0.1982, P=.50). The study also identified the commonly discussed topics in hate tweets during the pandemic. Analysis of the 7 extracted topics showed that 6 of the 7 identified topics were related to hate speech against China and Iran. Arab users also discussed topics related to political conflicts in the Arab region during the COVID-19 pandemic. CONCLUSIONS: The COVID-19 pandemic poses serious public health challenges to nations worldwide. During the COVID-19 pandemic, frequent use of social media can contribute to the spread of hate speech. Hate speech on the web can have a negative impact on society, and hate speech may have a direct correlation with real hate crimes, which increases the threat associated with being targeted by hate speech and abusive language. This study is the first to analyze hate speech in the context of Arabic COVID-19-related tweets in the Arab region.

Subject(s)

COVID-19/epidemiology , Deep Learning/standards , Hate , SARS-CoV-2/pathogenicity , Social Media/standards , Speech/physiology , Humans , Pandemics , Research Design , Saudi Arabia

Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures.

Almasoud, Ameera M; Al-Khalifa, Hend S; Al-Salman, Abdulmalik S.

Biomed Res Int ; 2019: 6750296, 2019.

Article in English | MEDLINE | ID: mdl-30809545

ABSTRACT

In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.

Subject(s)

Big Data , Computational Biology/methods , Proteins/genetics , Semantics , Algorithms , Cluster Analysis , Gene Ontology , Molecular Sequence Annotation , Software

A7×³ta: Data on a monolingual Arabic parallel corpus for grammar checking.

Madi, Nora; Al-Khalifa, Hend S.

Data Brief ; 22: 237-240, 2019 Feb.

Article in English | MEDLINE | ID: mdl-30591941

ABSTRACT

Grammar error correction can be considered as a "translation" problem, such that an erroneous sentence is "translated" into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language. This data article presents a monolingual parallel corpus of Arabic text called A7×³ta (). It contains 470 erroneous sentences and their 470 error-free counterparts. This is an Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Although there are a number of available Arabic corpora of errors and corrections [2] such as QALB [10] and Arabic Learner Corpus [11], the data we present in this article is an effort to increase the number of freely available Arabic corpora of errors and corrections by providing a detailed error specification and leveraging the work of language experts.

Readability of written medicine information materials in Arabic language: expert and consumer evaluation.

Al Aqeel, Sinaa; Abanmy, Norah; Aldayel, Abeer; Al-Khalifa, Hend; Al-Yahya, Maha; Diab, Mona.

BMC Health Serv Res ; 18(1): 139, 2018 02 27.

Article in English | MEDLINE | ID: mdl-29482618

ABSTRACT

BACKGROUND: Written Medicine Information (WMI) is one of the sources that patients use to obtain information concerning medicine. This paper aims to assess the readability of two types of WMIs in Arabic language based on vocabulary use and sentence structure using a panel of experts and consumers. METHODS: This is a descriptive study. Two different types of materials, including the online text from King Abdullah Bin Abdulaziz Arabic Health Encyclopaedia (KAAHE) and medication leaflets submitted by the manufacturers to the Saudi Food and Drug Authority (SFDA) were evaluated. We selected a group of sentences from each WMI. The readability was assessed by experts (n = 5) and consumers (n = 5). The sentence readability of each measured using a specific criteria and rated as 1 = easy, 2 = intermediate, or 3 = difficult. RESULTS: A total of 4476 sentences (SFDA 2231; KAHEE 2245) extracted from websites or patient information leaflets on 50 medications and evaluated. The majority of the vocabulary and sentence structure was considered easy by both expert (SFDA: 68%; KAAHE: 76%) and consumer (SFDA: 76%; KAAHE: 84%) groups. The sentences with difficult or intermediate vocabulary and sentence structure are derived primarily from the precautions and side effects sections. CONCLUSIONS: The SFDA and KAAHE WMIs are easy to read and understand as judged by our study sample. However; there is room for improvement, especially in sections related to the side effects and precautions.

Subject(s)

Comprehension , Consumer Health Information , Language , Patient Education as Topic , Health Services Research , Humans , Internet , Pamphlets , Saudi Arabia , Vocabulary

Ultra Wideband Indoor Positioning Technologies: Analysis and Recent Advances.

Alarifi, Abdulrahman; Al-Salman, AbdulMalik; Alsaleh, Mansour; Alnafessah, Ahmad; Al-Hadhrami, Suheer; Al-Ammar, Mai A; Al-Khalifa, Hend S.

Sensors (Basel) ; 16(5)2016 05 16.

Article in English | MEDLINE | ID: mdl-27196906

ABSTRACT

In recent years, indoor positioning has emerged as a critical function in many end-user applications; including military, civilian, disaster relief and peacekeeping missions. In comparison with outdoor environments, sensing location information in indoor environments requires a higher precision and is a more challenging task in part because various objects reflect and disperse signals. Ultra WideBand (UWB) is an emerging technology in the field of indoor positioning that has shown better performance compared to others. In order to set the stage for this work, we provide a survey of the state-of-the-art technologies in indoor positioning, followed by a detailed comparative analysis of UWB positioning technologies. We also provide an analysis of strengths, weaknesses, opportunities, and threats (SWOT) to analyze the present state of UWB positioning technologies. While SWOT is not a quantitative approach, it helps in assessing the real status and in revealing the potential of UWB positioning to effectively address the indoor positioning problem. Unlike previous studies, this paper presents new taxonomies, reviews some major recent advances, and argues for further exploration by the research community of this challenging problem space.

A system for sentiment analysis of colloquial Arabic using human computation.

Al-Subaihin, Afnan S; Al-Khalifa, Hend S.

ScientificWorldJournal ; 2014: 631394, 2014.

Article in English | MEDLINE | ID: mdl-24892064

ABSTRACT

We present the implementation and evaluation of a sentiment analysis system that is conducted over Arabic text with evaluative content. Our system is broken into two different components. The first component is a game that enables users to annotate large corpuses of text in a fun manner. The game produces necessary linguistic resources that will be used by the second component which is the sentimental analyzer. Two different algorithms have been designed to employ these linguistic resources to analyze text and classify it according to its sentimental polarity. The first approach is using sentimental tag patterns, which reached a precision level of 56.14%. The second approach is the sentimental majority approach which relies on calculating the number of negative and positive phrases in the sentence and classifying the sentence according to the dominant polarity. The results after evaluating the system for the first sentimental majority approach yielded the highest accuracy level reached by our system which is 60.5% while the second variation scored an accuracy of 60.32%.

Subject(s)

Linguistics , Arab World , Humans

Proposed framework for the evaluation of standalone corpora processing systems: an application to Arabic corpora.

Al-Thubaity, Abdulmohsen; Al-Khalifa, Hend; Alqifari, Reem; Almazrua, Manal.

ScientificWorldJournal ; 2014: 602745, 2014.

Article in English | MEDLINE | ID: mdl-25610910

ABSTRACT

Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework.

Subject(s)

Natural Language Processing , Algorithms , Arab World , Software

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL