What

TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies.

This dataset consists of 41,307,082 tweets in total, posted by 12,825,911 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until August 2022. In total, this makes 676,368,018 statements in RDF, which can be queried using the SPARQL-endpoint described below.

More information is available at the following paper:

Dimitrov, D., Baran, E., Fafalios, F., Yu, R., Zhu, X., Zloch, M., and Dietze, S.,
TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic,
29th ACM International Conference on Information & Knowledge Management (CIKM2020), Resource Track, ACM 2020.

Why

The TweetsCOV19 dataset reflects online discourse during the COVID-19 pandemic in a pre-processed fashion, following established knowledge graph principles. Thus, TweetsCOV19 represents a unique corpus for studying online discourse during the Corona pandemic together with its societal impact.

On the one hand, the dataset facilitates research in the (computational) social sciences, for instance, about information diffusion processes or the impact of (dis-)information on attitudes, solidarity, risk assessment and public opinion. On the other hand, the data may serve to evaluate and improve computational methods for tasks such as sentiment analysis, event detection, topic analysis or retweet prediction.

How

To extract the dataset from TweetsKB, we applied a seed list of 268 COVID-19-related keywords. The seed list is an extension of the seed list of Chen et al. and allows a broader view on the societal discourse on COVID-19 on Twitter.

Tweets in TweetsCOV19 contain at least one keyword from the set of seed terms, are written in English and published throughout the aforementioned time period. Data cleaning and enrichment as described in TweetsKB has been applied.

Dataset

• TweetsCOV19 is available as Notation3 (N3) and tab-separated values (tsv) files through the Zenodo data repository (under a Creative Commons Attribution 4.0 license):

Dataset Part Zenodo Link Entity Linking Keywords
Part 1 (Oct 2019 - April 2020) https://zenodo.org/record/3871753 Yahoo FEL Wikipedia April 2020 Dump List v1
Part 2 (May 2020 - May 2020) https://zenodo.org/record/4593502 Yahoo FEL Wikipedia April 2020 Dump List v1
Part 3 (June 2020 - Dec 2020) https://zenodo.org/record/4593524 Yahoo FEL Wikipedia February 2021 Dump List v1.1
Part 4 (Jan 2021 - Aug 2022) https://doi.org/10.7802/2470 Yahoo FEL Wikipedia August 2022 Dump List v1.1

• SPARQL endpoint containing the full dataset: SPARQL endpoint (Default Graph: http://data.gesis.org/tweetscov19)

• TSV File format Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:

  1. Tweet Id: Long.
  2. Username: String. Encrypted for privacy issues.
  3. Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
  4. #Followers: Integer.
  5. #Friends: Integer.
  6. #Retweets: Integer.
  7. #Favorites: Integer.
  8. Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
  9. Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
  10. Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
  11. Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
  12. URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"

Statistics

  • Overview of the TweetsCOV19 dataset (as of August 2022):
    Feature Total Unique Ratio of tweets with at least one feature
    Hashtags: 18,734,035 1,775,379 0.28
    Mentions: 28,139,872 5,124,474 0.44
    Entities: 59,378,032 700,136 0.70
    Non-neutral sentiment: 24,528,721 - 0.59

  •  
  • Selected hashtag usage over time (as of April 2020):

    The figure shows a comparison of hashtag popularity over time for (a) the two most popular hashtags #coronavirus and #covid19, and for (b) #hydroxychloroquine vs. #vaccine.


  •  
  • Sentiments of selected PLDs mentions and user mentions over time (as of April 2020):

    The figure shows the sentiment of tweets mentioning Donald Trump and WHO (the upper two figures), and containing URLs to Breitbart (the lower two figures) - a politically far-right-wing associated news media - and CNN - a left-wing associated media.


  •  
  • Entities over time (as of April 2020):

    The table shows the top five entities (confidence level -2) and their frequency per month in the TweetsCOV19 dataset since the beginning of 2020.

    January 2020 February 2020
    entity frequency entity frequency
    Wuhan 10,147 Wuhan 10,494
    Iran 5,905 Coronavirus_disease_2019 4,999
    BTS 5,014 BTS 4,513
    What's_Happening!! 4,899 What's_Happening!! 3,431
    Twitter 4,105 Twitter 3,351
    March 2020 April 2020
    entity frequency entity frequency
    Coronavirus_disease_2019 178,396 Coronavirus_disease_2019 200,342
    Social_distancing 66,176 Social_distancing 52,323
    Italy 22,164 India 18,992
    Wuhan 16,804 Hydroxychloroquine 15,820
    India 15,822 Wuhan 14,478

  •  
  • Geolocation (as of April 2020 hydrated data):

    The following map shows the approximated geolocation distribution of TweetsCOV19 tweets at a global scale. We also provide countrly level geolocation for USA, UK, Germany, France, Italy, Spain and India. Maps are created from geotags obtained through the pre-trained DeepGeo neutral network.


  •  
  • Monthly Tweet Distribution and Hydration (as of October 2020 hydrated data):

    The Figure shows the monthly distribution of TweetsCOV19 tweets before (original dataset) and after hydration. From the 8,151,524 tweets IDs in TweetsCOV19, the hydrator was able to obtain the raw tweet data for 6.856.135 IDs.

Data model and features

The following figure illustrates the data model and features captured for single tweet instances. Features are extracted from tweets using a pipeline described in the paper listed here. Please note that entity and sentiment annotations may include noise, that is, our pipeline does not achieve perfect precision and recall. For all entity annotations, we provide a confidence score (nee:confidence) which allows you to choose a confidence threshold suitable to your use case, for instance, emphasising either precision or recall.

RDF/S Model:

 
Instantiation example:

Examples

  • The following query gives you the top entities co-occurring with the entity Donald Trump (US president) during April 2020.
     

    Run query


  •  
  • The following query lists the number of tweets per day in April 2020 mentioning the entity Hydroxychloroquine.
     

    Run query


  •  
  • This query lists the top URLs mentioned in tweets of 6 April 2020 together with the entity Hydroxychloroquine.
     

    Run query


  •  

To capture COVID-19-related discourse on Twitter, several datasets have been released for academic use, including one stream API (last entry in table). We list these here as a reference point for other researchers interested in using complementary datasets and features. To the best of our knowledge, TweetsCOV19 and TweetsKB are the only publicly available knowledge bases containing both precomputed entity and sentiment annotations together with extracted tweet metadata.

available tweet information other annotation dates contained extraction method languages available format number of tweets
A large-scale COVID-19 Twitter chatter dataset for open scientific research - an international collaboration
tweet ID, date, time top 1000 frequent terms, bigrams and trigrams March 11, 2020 -
present (smaller proportion January 1, 2020 - March 11th 2020)
tweets mentioning specific keywords (13 keywords) all csv and tsv files 309,326,736
Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
tweet ID, date - January 21, 2020 -
present, updated weekly
tweets mentioning specific keywords, and tweets from specified accounts all txt files 129,911,732 (v1.9)
Coronavirus (COVID-19) Tweets Dataset
tweet ID, time frame Sentiment score of each tweet March 20, 2020 -
present updated daily
tweets mentioning specific keywords English csv files 102,650,603
Coronavirus (COVID-19) Geo-tagged Tweets Dataset
tweet ID, longitude, latitude - since April 28, 2020,
updated daily
tweets mentioning specific keywords and contains location informaion all csv and json files -
Coronavirus Twitter Data: A collection of COVID-19 tweets with automated annotations
tweet ID, user ID, date inferred geolocation February 6, 2020 -
May 20, 2020, updated regularly
tweets mentioning specific keywords (15 keywords) all json files -
Coronavirus Tweet Ids
tweet ID - March 3, 2020 -
May 1, 2020 (version 5)
tweets mentioning specific keywords (3 keywords) all txt files 188,026,475
GeoCoV19: A Dataset of Hundreds of Millions of Multilingual COVID-19 Tweets with Location Information
tweet ID, user ID, tweet location, user location, place mentioned in tweets Feb 1, 2020 -
May 1, 2020
800 hashtags and keywords multilingual (62) tsv, json 524,353,432
Crowdbreaks: Tracking health trends using public social media data and crowdsourcing
tweet ID, location place mentioned in tweets January 12, 2020 -
May 20, 2020
tweets mentioning specific keywords (5 keywords) English txt files -
Large Arabic Twitter Dataset on COVID-19
tweet ID, date - January 1 -
April 30, 2020
tweets mentioning specific keywords, and written in Arabic Arabic txt files 4,514,136
ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks
tweet ID, date - January 27, 2020 -
March 31, 2020
tweets written in Arabic and returned by Twitter Standard search API when using COVID related keywords (e.g. Corona) as queries Arabic plain text file 747,599
NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset
tweet ID/micro blog ID, user ID, date - January 20, 2020 -
March 24, 2020
filtering based on combination (AND, OR) of keywords and language English, Japanese. Chinese (from Weibo) tsv files 16,250,038 in English, 9,501,866 in Japanese, 173,869 in Chinese
Corona Virus (COVID-19) Turkish Tweets Dataset
tweet ID, time, user ID
To-User-Id(if it is sent to a user), number of retweets
- March 9, 2020 -
May 6, 2020
tweets mentioning specific keywords (4 keywords) and written in Turkish Turkish csv files 4.8 million
COVID-19 stream
Full tweet objects - real-time tweets mentioning specific keywords (590 hashtags and keywords by 13 May 2020) all streaming endpoint -

 

About Us