Mega-COV: A Billion-Scale Dataset of 100+Languages for COVID-19

Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E. B.; Pabbi, D.; Verma, K.; Lin, R.

Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E. B.; Pabbi, D.; Verma, K.; Lin, R..

16th Conference of the European Chapter of the Association for Computational Linguistics (Eacl 2021) ; : 3402-3420, 2021.

Article in English | Web of Science | ID: covidwho-2156484

ABSTRACT

ABSTRACT

We describe mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (similar to 169M tweets). We release tweet IDs from the dataset. We also develop two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F-1 =97%) and another for detecting misinformation about COVID-19 (best F-1 =92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.

Keywords

DISASTER; TWITTER

Search on Google

XML

Collection: Databases of international organizations Database: Web of Science Language: English Journal: 16th Conference of the European Chapter of the Association for Computational Linguistics (Eacl 2021) Year: 2021 Document Type: Article

Similar

MEDLINE

LILACS

LIS

Search on Google

XML