An Overview of Drugs, Diseases, Genes and Proteins in the CORD-19 Corpus
Procesamiento Del Lenguaje Natural
; - (69):165-176, 2022.
Article
in English
| Web of Science | ID: covidwho-2218007
ABSTRACT
Several initiatives have emerged during the COVID-19 pandemic to gather scientific publications related to coronaviruses. Among them, the COVID-19 Open Research Dataset (CORD-19) has proven to be a valuable resource that provides full-text articles from the PubMed Central, bioRxiv and medRxiv repositories. Such a large amount of biomedical literature needs to be properly managed to facilitate and promote its use by health professionals, for example by tagging documents with the biomedical entities that appear on them. We created a biomedical named entity recognizer (NER) that normalizes (NEN) the drugs, diseases, genes and proteins mentioned in texts with the codes of the main standardization systems such as MeSH, ICD-10, ATC, SNOMED, ChEBI, GARD and NCBI. It is based on fine-tuning the BioBERT language model independently for each entity type using domain-specific datasets and an inverse index search to normalize the references. We have used the resultant BioNER+BioNEN system to process the CORD-19 corpus and offer an overview of the drugs, diseases, genes and proteins related to coronaviruses in the last fifty years.
Full text:
Available
Collection:
Databases of international organizations
Database:
Web of Science
Language:
English
Journal:
Procesamiento Del Lenguaje Natural
Year:
2022
Document Type:
Article
Similar
MEDLINE
...
LILACS
LIS