We are not ready yet: limitations of state-of-the-art disease named entity recognizers.

Kühnel, Lisa; Fluck, Juliane

Kühnel, Lisa; Fluck, Juliane.

Kühnel L; ZB MED - Information Centre for Life Sciences, Gleueler Str. 60, Cologne, Germany. kuehnel@zbmed.de.
Fluck J; Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Postfach 10 01 31, 33501, Bielefeld, Germany. kuehnel@zbmed.de.

J Biomed Semantics ; 13(1): 26, 2022 10 27.

Article in English | MEDLINE | ID: covidwho-2089233

ABSTRACT

ABSTRACT

BACKGROUND:

Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize.

RESULTS:

Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data.

CONCLUSIONS:

We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.

Subject(s)

COVID-19; Data Mining; Humans; Data Mining/methods; Natural Language Processing; Machine Learning

Keywords

BERT; Manual Curation; Text mining; bioNLP

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Main subject: Data Mining / COVID-19 Type of study: Prognostic study / Reviews Limits: Humans Language: English Journal: J Biomed Semantics Year: 2022 Document Type: Article Affiliation country: S13326-022-00280-6

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google