TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic

What

TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies.

This dataset consists of 41,307,082 tweets in total, posted by 12,825,911 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until August 2022. In total, this makes 676,368,018 statements in RDF, which can be queried using the SPARQL-endpoint described below.

More information is available at the following paper:

Dimitrov, D., Baran, E., Fafalios, F., Yu, R., Zhu, X., Zloch, M., and Dietze, S.,

TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic,

29th ACM International Conference on Information & Knowledge Management (CIKM2020), Resource Track, ACM 2020.

[preprint] pdf

Why

The TweetsCOV19 dataset reflects online discourse during the COVID-19 pandemic in a pre-processed fashion, following established knowledge graph principles. Thus, TweetsCOV19 represents a unique corpus for studying online discourse during the Corona pandemic together with its societal impact.

On the one hand, the dataset facilitates research in the (computational) social sciences, for instance, about information diffusion processes or the impact of (dis-)information on attitudes, solidarity, risk assessment and public opinion. On the other hand, the data may serve to evaluate and improve computational methods for tasks such as sentiment analysis, event detection, topic analysis or retweet prediction.

How

To extract the dataset from TweetsKB, we applied a seed list of 268 COVID-19-related keywords. The seed list is an extension of the seed list of Chen et al. and allows a broader view on the societal discourse on COVID-19 on Twitter.

Tweets in TweetsCOV19 contain at least one keyword from the set of seed terms, are written in English and published throughout the aforementioned time period. Data cleaning and enrichment as described in TweetsKB has been applied.

Dataset

• TweetsCOV19 is available as Notation3 (N3) and tab-separated values (tsv) files through the Zenodo data repository (under a Creative Commons Attribution 4.0 license):

Dataset Part	Zenodo Link	Entity Linking	Keywords
Part 1 (Oct 2019 - April 2020)	https://zenodo.org/record/3871753	Yahoo FEL Wikipedia April 2020 Dump	List v1
Part 2 (May 2020 - May 2020)	https://zenodo.org/record/4593502	Yahoo FEL Wikipedia April 2020 Dump	List v1
Part 3 (June 2020 - Dec 2020)	https://zenodo.org/record/4593524	Yahoo FEL Wikipedia February 2021 Dump	List v1.1
Part 4 (Jan 2021 - Aug 2022)	https://doi.org/10.7802/2470	Yahoo FEL Wikipedia August 2022 Dump	List v1.1

• SPARQL endpoint containing the full dataset: SPARQL endpoint (Default Graph: http://data.gesis.org/tweetscov19)

• TSV File format Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:

Tweet Id: Long.
Username: String. Encrypted for privacy issues.
Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
#Followers: Integer.
#Friends: Integer.
#Retweets: Integer.
#Favorites: Integer.
Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"

Statistics

Overview of the TweetsCOV19 dataset (as of August 2022):

Feature	Total	Unique	Ratio of tweets with at least one feature
Hashtags:	18,734,035	1,775,379	0.28
Mentions:	28,139,872	5,124,474	0.44
Entities:	59,378,032	700,136	0.70
Non-neutral sentiment:	24,528,721	-	0.59

Selected hashtag usage over time (as of April 2020):

The figure shows a comparison of hashtag popularity over time for (a) the two most popular hashtags #coronavirus and #covid19, and for (b) #hydroxychloroquine vs. #vaccine.

Sentiments of selected PLDs mentions and user mentions over time (as of April 2020):

The figure shows the sentiment of tweets mentioning Donald Trump and WHO (the upper two figures), and containing URLs to Breitbart (the lower two figures) - a politically far-right-wing associated news media - and CNN - a left-wing associated media.

Entities over time (as of April 2020):

The table shows the top five entities (confidence level -2) and their frequency per month in the TweetsCOV19 dataset since the beginning of 2020.

January 2020		February 2020
entity	frequency	entity	frequency
Wuhan	10,147	Wuhan	10,494
Iran	5,905	Coronavirus_disease_2019	4,999
BTS	5,014	BTS	4,513
What's_Happening!!	4,899	What's_Happening!!	3,431
Twitter	4,105	Twitter	3,351

March 2020		April 2020
entity	frequency	entity	frequency
Coronavirus_disease_2019	178,396	Coronavirus_disease_2019	200,342
Social_distancing	66,176	Social_distancing	52,323
Italy	22,164	India	18,992
Wuhan	16,804	Hydroxychloroquine	15,820
India	15,822	Wuhan	14,478

Geolocation (as of April 2020 hydrated data):
The following map shows the approximated geolocation distribution of TweetsCOV19 tweets at a global scale. We also provide countrly level geolocation for USA, UK, Germany, France, Italy, Spain and India. Maps are created from geotags obtained through the pre-trained DeepGeo neutral network.

Monthly Tweet Distribution and Hydration (as of October 2020 hydrated data):
The Figure shows the monthly distribution of TweetsCOV19 tweets before (original dataset) and after hydration. From the 8,151,524 tweets IDs in TweetsCOV19, the hydrator was able to obtain the raw tweet data for 6.856.135 IDs.

Data model and features

The following figure illustrates the data model and features captured for single tweet instances. Features are extracted from tweets using a pipeline described in the paper listed here. Please note that entity and sentiment annotations may include noise, that is, our pipeline does not achieve perfect precision and recall. For all entity annotations, we provide a confidence score (nee:confidence) which allows you to choose a confidence threshold suitable to your use case, for instance, emphasising either precision or recall.

RDF/S Model:

Instantiation example:

Examples

The following query gives you the top entities co-occurring with the entity Donald Trump (US president) during April 2020.

PREFIX schema: <http://schema.org/> PREFIX nee: <http://www.ics.forth.gr/isl/oae/core#> PREFIX dc: <http://purl.org/dc/terms/> PREFIX dbr: <http://dbpedia.org/resource/> SELECT ?otherEntity ( COUNT(?tweet) AS ?count ) WHERE { ?tweet schema:mentions ?entity ; dc:created ?date FILTER( year(?date) = 2020 and month(?date) = 4 ) . ?entity a nee:Entity ; nee:hasMatchedURI dbr:Donald_Trump . ?tweet schema:mentions ?entity2 . ?entity2 a nee:Entity ; nee:hasMatchedURI ?otherEntity FILTER ( ?otherEntity != dbr:Donald_Trump ) } GROUP BY ?otherEntity ORDER BY DESC(?count) LIMIT 5
Run query

The following query lists the number of tweets per day in April 2020 mentioning the entity Hydroxychloroquine.

PREFIX schema: <http://schema.org/> PREFIX nee: <http://www.ics.forth.gr/isl/oae/core#> PREFIX dc: <http://purl.org/dc/terms/> PREFIX dbr: <http://dbpedia.org/resource/> SELECT ( day(?date) AS ?day ) ( COUNT(?tweet) AS ?count ) WHERE { ?tweet schema:mentions ?entity ; dc:created ?date FILTER( year(?date) = 2020 and month(?date) = 4 ) . ?entity a nee:Entity ; nee:hasMatchedURI dbr:Hydroxychloroquine . } GROUP BY day(?date) ORDER BY day(?date)
Run query

This query lists the top URLs mentioned in tweets of 6 April 2020 together with the entity Hydroxychloroquine.

PREFIX nee: <http://www.ics.forth.gr/isl/oae/core#> PREFIX schema: <http://schema.org/> PREFIX dc: <http://purl.org/dc/terms/> PREFIX dbr: <http://dbpedia.org/resource/> SELECT ?url ( COUNT(?tweet) AS ?count ) WHERE { ?tweet schema:mentions ?entity ; dc:created ?date FILTER( year(?date)=2020 and month(?date)=4 and (day(?date)=6 || day(?date)=7) ) . ?entity a nee:Entity ; nee:hasMatchedURI dbr:Hydroxychloroquine . ?tweet schema:citation ?url } GROUP BY ?url ORDER BY DESC(?count)
Run query

To capture COVID-19-related discourse on Twitter, several datasets have been released for academic use, including one stream API (last entry in table). We list these here as a reference point for other researchers interested in using complementary datasets and features. To the best of our knowledge, TweetsCOV19 and TweetsKB are the only publicly available knowledge bases containing both precomputed entity and sentiment annotations together with extracted tweet metadata.

available tweet information	other annotation	dates contained	extraction method	languages	available format	number of tweets
A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration
tweet ID, date, time	top 1000 frequent terms, bigrams and trigrams	March 11, 2020 - present (smaller proportion January 1, 2020 - March 11th 2020)	tweets mentioning specific keywords (13 keywords)	all	csv and tsv files	309,326,736
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
tweet ID, date	-	January 21, 2020 - present, updated weekly	tweets mentioning specific keywords, and tweets from specified accounts	all	txt files	129,911,732 (v1.9)
Coronavirus (COVID-19) Tweets Dataset
tweet ID, time frame	Sentiment score of each tweet	March 20, 2020 - present updated daily	tweets mentioning specific keywords	English	csv files	102,650,603
Coronavirus (COVID-19) Geo-tagged Tweets Dataset
tweet ID, longitude, latitude	-	since April 28, 2020, updated daily	tweets mentioning specific keywords and contains location informaion	all	csv and json files	-
Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations
tweet ID, user ID, date	inferred geolocation	February 6, 2020 - May 20, 2020, updated regularly	tweets mentioning specific keywords (15 keywords)	all	json files	-
Coronavirus Tweet Ids
tweet ID	-	March 3, 2020 - May 1, 2020 (version 5)	tweets mentioning specific keywords (3 keywords)	all	txt files	188,026,475
GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information
tweet ID, user ID, tweet location, user location,	place mentioned in tweets	Feb 1, 2020 - May 1, 2020	800 hashtags and keywords	multilingual (62)	tsv, json	524,353,432
Crowdbreaks: Tracking health trends using public social media data and crowdsourcing
tweet ID, location	place mentioned in tweets	January 12, 2020 - May 20, 2020	tweets mentioning specific keywords (5 keywords)	English	txt files	-
Large Arabic Twitter Dataset on COVID-19
tweet ID, date	-	January 1 - April 30, 2020	tweets mentioning specific keywords, and written in Arabic	Arabic	txt files	4,514,136
ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks
tweet ID, date	-	January 27, 2020 - March 31, 2020	tweets written in Arabic and returned by Twitter Standard search API when using COVID related keywords (e.g. Corona) as queries	Arabic	plain text file	747,599
NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset
tweet ID/micro blog ID, user ID, date	-	January 20, 2020 - March 24, 2020	filtering based on combination (AND, OR) of keywords and language	English, Japanese. Chinese (from Weibo)	tsv files	16,250,038 in English, 9,501,866 in Japanese, 173,869 in Chinese
Corona Virus (COVID-19) Turkish Tweets Dataset
tweet ID, time, user ID To-User-Id(if it is sent to a user), number of retweets	-	March 9, 2020 - May 6, 2020	tweets mentioning specific keywords (4 keywords) and written in Turkish	Turkish	csv files	4.8 million
COVID-19 stream
Full tweet objects	-	real-time	tweets mentioning specific keywords (590 hashtags and keywords by 13 May 2020)	all	streaming endpoint	-

About Us

Team:


                                        - Erdal Baran,

                                        - Stefan
                                        Dietze(GESIS – Leibniz Institute for the Social Sciences and Heinrich-Heine-University Düsseldorf) (main contact),

                                        - Dennis Segeth

                                        - Dimitar
                                        Dimitrov,

                                        
                                        - Pavlos
                                        Fafalios,

                                        - Robert
                                        Jäschke,

                                        - Ran Yu,

                                        - Xiaofei Zhu,

                                        - Matthäus
                                        Zloch

                                        - Yudong Zhang

Assert a data protection request: dimitar.dimitrov@gesis.org

L3S Research Center, University of Hannover, Germany
GESIS – Leibniz Institute for the Social Sciences, Germany