Search | VHL Regional Portal

UDDIPOK: A reading comprehension based question answering dataset in Bangla language.

Aurpa, Tanjim Taharat; Ahmed, Md Shoaib; Rifat, Richita Khandakar; Anwar, Md Musfique; Shawkat Ali, A B M.

Data Brief ; 47: 108933, 2023 Apr.

Article in English | MEDLINE | ID: mdl-36819905

ABSTRACT

The popularity of reading comprehension (RC) is increasing day-to-day in Bangla Natural Language Processing (NLP) research area, both in machine learning and deep learning techniques. However, there is no original dataset from various sources in the Bangla language except translated from foreign RC datasets, which contain abnormalities and mismatched translated data. In his paper, we present UDDIPOK, a novel wide-ranging, open-domain Bangla reading comprehension dataset. This dataset contains 270 reading passages, 3636 questions, and answers from diverse origins, for instance, textbooks, exam questions from middle and high schools, newspapers, etc. Furthermore, this dataset is formated in CSV, which contains three columns: passages, questions, and answers. As a result, data can be handled expeditiously and easily for any machine learning research.

Reading comprehension based question answering system in Bangla language with transformer-based learning.

Aurpa, Tanjim Taharat; Rifat, Richita Khandakar; Ahmed, Md Shoaib; Anwar, Md Musfique; Ali, A B M Shawkat.

Heliyon ; 8(10): e11052, 2022 Oct.

Article in English | MEDLINE | ID: mdl-36254291

ABSTRACT

Question answering (QA) system in any language is an assortment of mechanisms for obtaining answers to user questions with various data compositions. Reading comprehension (RC) is one type of composition, and the popularity of this type is increasing day by day in Natural Language Processing (NLP) research area. Some works have been done in several languages, mainly in English. In the Bangla language, neither any dataset available for RC nor any work has been done in the past. In this research work, we develop a question-answering system from RC. For doing this, we construct a dataset containing 3636 reading comprehensions along with questions and answers. We apply a transformer-based deep neural network model to obtain convenient answers to questions based on reading comprehensions precisely and swiftly. We exploit some deep neural network architectures such as LSTM (Long Short-Term Memory), Bi-LSTM (Bidirectional LSTM) with attention, RNN (Recurrent Neural Network), ELECTRA, and BERT (Bidirectional Encoder Representations from Transformers) to our dataset for training. The transformer-based pre-training language architectures BERT and ELECTRA perform more prominently than others from those architectures. Finally, the trained model of BERT performs a satisfactory outcome with 87.78% of testing accuracy and 99% training accuracy, and ELECTRA provides training and testing accuracy of 82.5% and 93%, respectively.

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL