An Overview of Drugs, Diseases, Genes and Proteins in the CORD-19 Corpus

Badenes-Olmedo, C.; Alonso, A.; Corcho, O.

Badenes-Olmedo, C.; Alonso, A.; Corcho, O..

Procesamiento Del Lenguaje Natural ; - (69):165-176, 2022.

Article in English | Web of Science | ID: covidwho-2218007

ABSTRACT

ABSTRACT

Several initiatives have emerged during the COVID-19 pandemic to gather scientific publications related to coronaviruses. Among them, the COVID-19 Open Research Dataset (CORD-19) has proven to be a valuable resource that provides full-text articles from the PubMed Central, bioRxiv and medRxiv repositories. Such a large amount of biomedical literature needs to be properly managed to facilitate and promote its use by health professionals, for example by tagging documents with the biomedical entities that appear on them. We created a biomedical named entity recognizer (NER) that normalizes (NEN) the drugs, diseases, genes and proteins mentioned in texts with the codes of the main standardization systems such as MeSH, ICD-10, ATC, SNOMED, ChEBI, GARD and NCBI. It is based on fine-tuning the BioBERT language model independently for each entity type using domain-specific datasets and an inverse index search to normalize the references. We have used the resultant BioNER+BioNEN system to process the CORD-19 corpus and offer an overview of the drugs, diseases, genes and proteins related to coronaviruses in the last fifty years.

Keywords

ner; normalization; bioentities; document retrieval; NAMED ENTITY RECOGNITION; ONTOLOGY

Fulltext

XML

Search on Google

Full text: Available Collection: Databases of international organizations Database: Web of Science Language: English Journal: Procesamiento Del Lenguaje Natural Year: 2022 Document Type: Article

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

Search on Google