TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 2.0 billion tweets, spanning more than 7 years (February 2013 - April 2020). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the tweet IDs and usernames, and we do not provide the text of the tweets.

More information is available at the following paper:

P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze,
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets,
15th Extended Semantic Web Conference (ESWC'18), Heraklion, Crete, Greece, June 3-7, 2018.
Nominated for the "Best Resource Paper" award!

TweetsCOV19 is a subset of TweetsKB containing COVID-related tweets and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until April 2020.

why top

  • For relieving data consumers from the computationally intensive process of extracting and processing tweets.
  • For facilitating a variety of multi-aspect data consumption, exploration and analytics scenarios. These include:
    • time-aware and entity-centric exploration of the Twitter archive
    • data integration by directly exploiting existing knowledge bases (like DBpedia)
    • entity-centric analytics and knowledge discovery by inferring multi-aspect information related to one or more entities during certain time periods (like popularity, attitude or relations with other entities)

dataset top

• TweetsKB is available as Notation3 (N3) files (split by month) through the Zenodo data repository (under a Creative Commons Attribution 4.0 license):

Part 1 (Feb 2013 - Feb 2014) https://zenodo.org/record/573852
Part 2 (Mar 2014 - Nov 2014) https://zenodo.org/record/577572
Part 3 (Dec 2014 - Dec 2015) https://zenodo.org/record/579597
Part 4 (Jan 2016 - Nov 2016) https://zenodo.org/record/579601
Part 5 (Dec 2016 - Oct 2017) https://zenodo.org/record/1095592
Part 6 (Nov 2017 - Mar 2018) https://zenodo.org/record/1808741
Part 7 (Apr 2018 - Apr 2019) https://zenodo.org/record/3828929
Part 8 (May 2019 - Apr 2020) https://zenodo.org/record/3828949

• SPARQL endpoint containing a 5% sample of the dataset: SPARQL endpoint (Default Graph: http://data.gesis.org/tweetskb)

statistics top
data model and features top

The following figure illustrates the data model and features captured for single tweet instances. Features are extracted from tweets using a pipeline described in the paper listed here. Please note that entity and sentiment annotations may include noise, that is, our pipeline does not achieve perfect precision and recall. For all entity annotations, we provide a confidence score (nee:confidence) which allows you to choose a confidence threshold suitable to your use case, for instance, emphasising either precision or recall.

RDF/S Model:

Instantiation example:

example queries top
  • The following query requests the number of tweets per month mentioning Alexis Tsipras (Greek prime minister) in 2015.

    The result of this query shows that the number of tweets increased significantly in June and July, likely to be caused by the Greek bailout referendum that was held in July 2015, following the bank holiday and capital controls of June 2015.
  • The following query retrieves the top-5 entities co-occurring with Barack Obama in tweets of summer 2016.
  • The following query requests the top-10 hashtags co-occuring with the entity Refugee (http://dbpedia.org/resource/Refugee) in 2016.
  • The following query lists top entities co-occurring with the entity DonaldTrump in tweets of April 2020 (replace the FILTER below with any other entity from DBpedia).
about top

The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and the H2020 Grant No. 687916 (AFEL project).

imprint top

You can find the imprint with provider and legal notices here