Go to GESIS-Homepage
Go to homepage

TweetsCOV19

What

TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies.

This dataset consists of 41,307,082 tweets in total, posted by 12,825,911 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until August 2022. In total, this makes 676,368,018 statements in RDF, which can be queried using the SPARQL-endpoint described below.

More information is available at the following paper:

Dimitrov, D., Baran, E., Fafalios, F., Yu, R., Zhu, X., Zloch, M., and Dietze, S.,
TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic,
29th ACM International Conference on Information & Knowledge Management (CIKM2020), Resource Track, ACM 2020.

Why

The TweetsCOV19 dataset reflects online discourse during the COVID-19 pandemic in a pre-processed fashion, following established knowledge graph principles. Thus, TweetsCOV19 represents a unique corpus for studying online discourse during the Corona pandemic together with its societal impact.

On the one hand, the dataset facilitates research in the (computational) social sciences, for instance, about information diffusion processes or the impact of (dis-)information on attitudes, solidarity, risk assessment and public opinion. On the other hand, the data may serve to evaluate and improve computational methods for tasks such as sentiment analysis, event detection, topic analysis or retweet prediction.

How

To extract the dataset from TweetsKB, we applied a seed list of 268 COVID-19-related keywords. The seed list is an extension of the seed list of Chen et al. and allows a broader view on the societal discourse on COVID-19 on Twitter.

Tweets in TweetsCOV19 contain at least one keyword from the set of seed terms, are written in English and published throughout the aforementioned time period. Data cleaning and enrichment as described in TweetsKB has been applied.

Dataset

• TweetsCOV19 is available as Notation3 (N3) and tab-separated values (tsv) files through the Zenodo data repository (under a Creative Commons Attribution 4.0 license):

Dataset Part Zenodo LinkEntity LinkingKeywords
Part 1 (Oct 2019 - April 2020) https://zenodo.org/record/3871753Yahoo FEL Wikipedia April 2020 Dump List v1
Part 2 (May 2020 - May 2020) https://zenodo.org/record/4593502Yahoo FEL Wikipedia April 2020 DumpList v1
Part 3 (June 2020 - Dec 2020) https://zenodo.org/record/4593524Yahoo FEL Wikipedia February 2021 DumpList v1.1
Part 4 (Jan 2021 - Aug 2022) https://doi.org/10.7802/2470Yahoo FEL Wikipedia August 2022 DumpList v1.1

• SPARQL endpoint containing the full dataset: SPARQL endpoint (Default Graph: http://data.gesis.org/tweetscov19)

• TSV File format Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:

  1. Tweet Id: Long.
  2. Username: String. Encrypted for privacy issues.
  3. Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
  4. #Followers: Integer.
  5. #Friends: Integer.
  6. #Retweets: Integer.
  7. #Favorites: Integer.
  8. Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
  9. Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
  10. Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
  11. Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
  12. URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"

Statistics

Data model and features

The following figure illustrates the data model and features captured for single tweet instances. Features are extracted from tweets using a pipeline described in the paper listed here. Please note that entity and sentiment annotations may include noise, that is, our pipeline does not achieve perfect precision and recall. For all entity annotations, we provide a confidence score (nee:confidence) which allows you to choose a confidence threshold suitable to your use case, for instance, emphasising either precision or recall.

RDF/S Model:

 
Instantiation example:

Examples

To capture COVID-19-related discourse on Twitter, several datasets have been released for academic use, including one stream API (last entry in table). We list these here as a reference point for other researchers interested in using complementary datasets and features. To the best of our knowledge, TweetsCOV19 and TweetsKB are the only publicly available knowledge bases containing both precomputed entity and sentiment annotations together with extracted tweet metadata.

available tweet information other annotation dates contained extraction method languages available format number of tweets
A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration
tweet ID, date, time top 1000 frequent terms, bigrams and trigrams March 11, 2020 -
present (smaller proportion January 1, 2020 - March 11th 2020)
tweets mentioning specific keywords (13 keywords) all csv and tsv files 309,326,736
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
tweet ID, date - January 21, 2020 -
present, updated weekly
tweets mentioning specific keywords, and tweets from specified accounts all txt files 129,911,732 (v1.9)
Coronavirus (COVID-19) Tweets Dataset
tweet ID, time frame Sentiment score of each tweet March 20, 2020 -
present updated daily
tweets mentioning specific keywords English csv files 102,650,603
Coronavirus (COVID-19) Geo-tagged Tweets Dataset
tweet ID, longitude, latitude - since April 28, 2020,
updated daily
tweets mentioning specific keywords and contains location informaion all csv and json files -
Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations
tweet ID, user ID, date inferred geolocation February 6, 2020 -
May 20, 2020, updated regularly
tweets mentioning specific keywords (15 keywords) all json files -
Coronavirus Tweet Ids
tweet ID - March 3, 2020 -
May 1, 2020 (version 5)
tweets mentioning specific keywords (3 keywords) all txt files 188,026,475
GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information
tweet ID, user ID, tweet location, user location, place mentioned in tweets Feb 1, 2020 -
May 1, 2020
800 hashtags and keywords multilingual (62) tsv, json 524,353,432
Crowdbreaks: Tracking health trends using public social media data and crowdsourcing
tweet ID, location place mentioned in tweets January 12, 2020 -
May 20, 2020
tweets mentioning specific keywords (5 keywords) English txt files -
Large Arabic Twitter Dataset on COVID-19
tweet ID, date - January 1 -
April 30, 2020
tweets mentioning specific keywords, and written in Arabic Arabic txt files 4,514,136
ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks
tweet ID, date - January 27, 2020 -
March 31, 2020
tweets written in Arabic and returned by Twitter Standard search API when using COVID related keywords (e.g. Corona) as queries Arabic plain text file 747,599
NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset
tweet ID/micro blog ID, user ID, date - January 20, 2020 -
March 24, 2020
filtering based on combination (AND, OR) of keywords and language English, Japanese. Chinese (from Weibo) tsv files 16,250,038 in English, 9,501,866 in Japanese, 173,869 in Chinese
Corona Virus (COVID-19) Turkish Tweets Dataset
tweet ID, time, user ID
To-User-Id(if it is sent to a user), number of retweets
- March 9, 2020 -
May 6, 2020
tweets mentioning specific keywords (4 keywords) and written in Turkish Turkish csv files 4.8 million
COVID-19 stream
Full tweet objects - real-time tweets mentioning specific keywords (590 hashtags and keywords by 13 May 2020) all streaming endpoint -

 

About Us

Team:

- Erdal Baran,
- Stefan Dietze (main contact),
- Dennis Segeth
- Dimitar Dimitrov,
- Pavlos Fafalios,
- Robert Jäschke,
- Ran Yu,
- Xiaofei Zhu,
- Matthäus Zloch
L3S Research Center, University of Hannover, Germany
GESIS – Leibniz Institute for the Social Sciences, Germany

    


 

Imprint

You can find the imprint with provider and legal notices here