Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model.

Wang, Zhizheng; Liu, Xiao Fan; Du, Zhanwei; Wang, Lin; Wu, Ye; Holme, Petter; Lachmann, Michael; Lin, Hongfei; Wong, Zoie S Y; Xu, Xiao-Ke; Sun, Yuanyuan

Wang, Zhizheng; Liu, Xiao Fan; Du, Zhanwei; Wang, Lin; Wu, Ye; Holme, Petter; Lachmann, Michael; Lin, Hongfei; Wong, Zoie S Y; Xu, Xiao-Ke; Sun, Yuanyuan.

Wang Z; College of Computer Science and Technology, Dalian University of Technology, Haishan Building No.2 Linggong Road, Dalian, Liaoning 116023, China.
Liu XF; Web Mining Laboratory, Department of Media and Communication, City University of Hong Kong, Hong Kong Special Administrative Region, China.
Du Z; WHO Collaborating Centre for Infectious Disease Epidemiology and Control, School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong Special Administrative Region, China.
Wang L; Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK.
Wu Y; Computational Communication Research Center and School of Journalism and Communication, Beijing Normal University, Beijing, China.
Holme P; Tokyo Tech World Research Hub Initiative (WRHI), Institute of Innovative Research, Tokyo Institute of Technology, Tokyo, Japan.
Lachmann M; Santa Fe Institute, Santa Fe, NM, USA.
Lin H; College of Computer Science and Technology, Dalian University of Technology, Haishan Building No.2 Linggong Road, Dalian, Liaoning 116023, China.
Wong ZSY; Graduate School of Public Health, St. Luke's International University, Tokyo, Japan.
Xu XK; College of Information and Communication Engineering, Dalian Minzu University, Liaoning, China.
Sun Y; College of Computer Science and Technology, Dalian University of Technology, Haishan Building No.2 Linggong Road, Dalian, Liaoning 116023, China.

iScience ; 25(10): 105079, 2022 Oct 21.

Article in English | MEDLINE | ID: covidwho-2007782

ABSTRACT

ABSTRACT

Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation.

Keywords

Artificial intelligence; Health sciences; Machine learning; Virology

Fulltext

XML

PubMed Links

Search on Google

Full text: Available Collection: International databases Database: MEDLINE Type of study: Case report / Observational study Language: English Journal: IScience Year: 2022 Document Type: Article Affiliation country: J.isci.2022.105079

Similar

MEDLINE

LILACS

LIS

Fulltext

XML

PubMed Links

Search on Google