Named Entity Recognition and Relation Extraction for COVID-19: Explainable Active Learning with Word2vec Embeddings and Transformer-Based BERT Models

Arguello-Casteleiro, M.; Maroto, N.; Wroe, C.; Torrado, C. S.; Henson, C.; Des-Diz, J.; Fernandez-Prieto, M. J.; Furmston, T.; Fernandez, D. M.; Kulshrestha, M.; Stevens, R.; Keane, J.; Peters, S.

Arguello-Casteleiro, M.; Maroto, N.; Wroe, C.; Torrado, C. S.; Henson, C.; Des-Diz, J.; Fernandez-Prieto, M. J.; Furmston, T.; Fernandez, D. M.; Kulshrestha, M.; Stevens, R.; Keane, J.; Peters, S..

41st SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, AI 2021 ; 13101 LNAI:158-163, 2021.

Article in English | Scopus | ID: covidwho-1603584

ABSTRACT

ABSTRACT

Deep learning for natural language processing acquires dense vector representations for n-grams from large-scale unstructured corpora. Converting static embeddings of n-grams into a dataset of interlinked concepts with explicit contextual semantic dependencies provides the foundation to acquire reusable knowledge. However, the validation of this knowledge requires cross-checking with ground-truths that may be unavailable in an actionable or computable form. This paper presents a novel approach from the new field of explainable active learning that combines methods for learning static embeddings (word2vec models) with methods for learning dynamic contextual embeddings (transformer-based BERT models). We created a dataset for named entity recognition (NER) and relation extraction (REX) for the Coronavirus Disease 2019 (COVID-19). The COVID-19 dataset has 2,212 associations captured by 11 word2vec models with additional examples of use from the biomedical literature. We propose interpreting the NER and REX tasks for COVID-19 as Question Answering (QA) incorporating general medical knowledge within the question, e.g. “does ‘cough’ (n-gram) belong to ‘clinical presentation/symptoms’ for COVID-19?”. We evaluated biomedical-specific pre-trained language models (BioBERT, SciBERT, ClinicalBERT, BlueBERT, and PubMedBERT) versus general-domain pre-trained language models (BERT, and RoBERTa) for transfer learning with COVID-19 dataset, i.e. task-specific fine-tuning considering NER as a sequence-level task. Using 2,060 QA for training (associations from 10 word2vec models) and 152 QA for validation (associations from 1 word2vec model), BERT obtained an F-measure of 87.38%, with precision = 93.75% and recall = 81.82%. SciBERT achieved the highest F-measure of 94.34%, with precision = 98.04% and recall = 90.91%. © 2021, Springer Nature Switzerland AG.

Keywords

Deep, learning, for, natural, language, processing; Embeddings; Explainable, active, learning; Transfer, learning; Transformer-based, models; Computational, linguistics; Deep, learning; Extraction; Natural, language, processing, systems; Semantics; Active, Learning; Coronaviruses; N-grams; Named, entity, recognition; Relation, extraction; Transformer-based, model; Coronavirus

Fulltext

XML

Search on Google

Full text: Available Collection: Databases of international organizations Database: Scopus Language: English Journal: 41st SGAI International Conference on Innovative Techniques and Applications of Artificial Intelligence, AI 2021 Year: 2021 Document Type: Article

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

Search on Google