What

TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for nearly 3.1 billion tweets, spanning more than 10 years until the Tweeter API was closed in 06.2023. Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the usernames and we do not provide the text of the tweets. However, the tweet IDs can be used to retrieve the original Tweet text.

More information is available at the following paper:

P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze,
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets,
15th Extended Semantic Web Conference (ESWC'18), Heraklion, Crete, Greece, June 3-7, 2018.
Nominated for the "Best Resource Paper" award!

TweetsCOV19 is a subset of TweetsKB containing COVID-related tweets and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until April 2020.

Why

  • For relieving data consumers from the computationally intensive process of extracting and processing tweets.
  • For facilitating a variety of multi-aspect data consumption, exploration and analytics scenarios. These include:
    • time-aware and entity-centric exploration of the Twitter archive
    • data integration by directly exploiting existing knowledge bases (like DBpedia)
    • entity-centric analytics and knowledge discovery by inferring multi-aspect information related to one or more entities during certain time periods (like popularity, attitude or relations with other entities)

Dataset

TweetsKB is available as Notation3 (N3) files (split by month) through the Zenodo data repository (under a Creative Commons Attribution 4.0 license) and GESIS Search:

Dataset Part Zenodo Link Entity Linking
Part 1 (Feb 2013 - Feb 2014) https://zenodo.org/record/573852 Yahoo FEL with Wikipedia Dump from November 2015
Part 2 (Mar 2014 - Nov 2014) https://zenodo.org/record/577572 Yahoo FEL with Wikipedia Dump from November 2015
Part 3 (Dec 2014 - Dec 2015) https://zenodo.org/record/579597 Yahoo FEL with Wikipedia Dump from November 2015
Part 4 (Jan 2016 - Nov 2016) https://zenodo.org/record/579601 Yahoo FEL with Wikipedia Dump from November 2015
Part 5 (Dec 2016 - Oct 2017) https://zenodo.org/record/1095592 Yahoo FEL with Wikipedia Dump from November 2015
Part 6 (Nov 2017 - Mar 2018) https://zenodo.org/record/1808741 Yahoo FEL with Wikipedia Dump from November 2015
Part 7 (Apr 2018 - Apr 2019) https://zenodo.org/record/3828929 Yahoo FEL with Wikipedia Dump from April 2020
Part 8 (May 2019 - Apr 2020) https://zenodo.org/record/3828949 Yahoo FEL with Wikipedia Dump from April 2020
Part 9 (May 2020 - Dec 2020) https://zenodo.org/record/4420178 Yahoo FEL with Wikipedia Dump from April 2020
Part 10 (Jan 2021 - Dec 2021) https://doi.org/10.7802/2472 Yahoo FEL with Wikipedia Dump from August 2022
Part 11 (Jan 2022 - Aug 2022) https://doi.org/10.7802/2473 Yahoo FEL with Wikipedia Dump from August 2022
Part 12 (Sep 2022 - Jun 2023) https://doi.org/10.78l02/2781 Yahoo FEL with Wikipedia Dump from June 2023

• SPARQL endpoint containing a 5% sample of the dataset: SPARQL endpoint (Default Graph: http://data.gesis.org/tweetskb)

Statistics

Data model and features

The following figure illustrates the data model and features captured for single tweet instances. Features are extracted from tweets using a pipeline described in the paper listed here. Please note that entity and sentiment annotations may include noise, that is, our pipeline does not achieve perfect precision and recall. For all entity annotations, we provide a confidence score (nee:confidence) which allows you to choose a confidence threshold suitable to your use case, for instance, emphasising either precision or recall.

RDF/S Model:

 
Instantiation example:

Examples

  • The following query requests the number of tweets per month mentioning Alexis Tsipras (Greek prime minister) in 2015.
     

    Run query

  • The result of this query shows that the number of tweets increased significantly in June and July, likely to be caused by the Greek bailout referendum that was held in July 2015, following the bank holiday and capital controls of June 2015.
     
  • The following query retrieves the top-5 entities co-occurring with Barack Obama in tweets of summer 2016.
     

    Run query


  •  
  • The following query requests the top-10 hashtags co-occuring with the entity Refugee (http://dbpedia.org/resource/Refugee) in 2016.
     

    Run query


  •  
  • The following query lists top entities co-occurring with the entity DonaldTrump in tweets of April 2020 (replace the FILTER below with any other entity from DBpedia).
     

    Run query

Next Step

We plan to publish datasets for other subsets.

About Us


 
The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and the H2020 Grant No. 687916 (AFEL project).

>