Go to GESIS-Homepage
Go to homepage

TweetsKB

What

TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for nearly 3.0 billion tweets, spanning more than 9 years (February 2013 - August 2022). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the usernames and we do not provide the text of the tweets. However, the tweet IDs can be used to retrieve the original Tweet text.

More information is available at the following paper:

P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze,
TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets,
15th Extended Semantic Web Conference (ESWC'18), Heraklion, Crete, Greece, June 3-7, 2018.
Nominated for the "Best Resource Paper" award!

TweetsCOV19 is a subset of TweetsKB containing COVID-related tweets and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until April 2020.

Why

Dataset

TweetsKB is available as Notation3 (N3) files (split by month) through the Zenodo data repository (under a Creative Commons Attribution 4.0 license):

Dataset Part Zenodo LinkEntity Linking
Part 1 (Feb 2013 - Feb 2014) https://zenodo.org/record/573852Yahoo FEL with Wikipedia Dump from November 2015
Part 2 (Mar 2014 - Nov 2014) https://zenodo.org/record/577572Yahoo FEL with Wikipedia Dump from November 2015
Part 3 (Dec 2014 - Dec 2015) https://zenodo.org/record/579597Yahoo FEL with Wikipedia Dump from November 2015
Part 4 (Jan 2016 - Nov 2016) https://zenodo.org/record/579601Yahoo FEL with Wikipedia Dump from November 2015
Part 5 (Dec 2016 - Oct 2017) https://zenodo.org/record/1095592Yahoo FEL with Wikipedia Dump from November 2015
Part 6 (Nov 2017 - Mar 2018) https://zenodo.org/record/1808741Yahoo FEL with Wikipedia Dump from November 2015
Part 7 (Apr 2018 - Apr 2019) https://zenodo.org/record/3828929Yahoo FEL with Wikipedia Dump from April 2020
Part 8 (May 2019 - Apr 2020) https://zenodo.org/record/3828949Yahoo FEL with Wikipedia Dump from April 2020
Part 9 (May 2020 - Dec 2020) https://zenodo.org/record/4420178Yahoo FEL with Wikipedia Dump from April 2020
Part 10 (Jan 2021 - Dec 2021) https://doi.org/10.7802/2472Yahoo FEL with Wikipedia Dump from August 2022
Part 11 (Jan 2022 - Aug 2022) https://doi.org/10.7802/2473Yahoo FEL with Wikipedia Dump from August 2022

• SPARQL endpoint containing a 5% sample of the dataset: SPARQL endpoint (Default Graph: http://data.gesis.org/tweetskb)

Statistics

Data model and features

The following figure illustrates the data model and features captured for single tweet instances. Features are extracted from tweets using a pipeline described in the paper listed here. Please note that entity and sentiment annotations may include noise, that is, our pipeline does not achieve perfect precision and recall. For all entity annotations, we provide a confidence score (nee:confidence) which allows you to choose a confidence threshold suitable to your use case, for instance, emphasising either precision or recall.

RDF/S Model:

 
Instantiation example:

Examples


 

About Us

Team:

- Erdal Baran,
- Stefan Dietze (main contact),
- Dimitar Dimitrov,
- Pavlos Fafalios,
- Robert Jäschke,
- Vasileios Iosifidis,
- Ran Yu,
- Xiaofei Zhu,
- Matthäus Zloch
- Sebastian Schellhammer
L3S Research Center, University of Hannover, Germany
GESIS – Leibniz Institute for the Social Sciences, Germany

    


 
The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and the H2020 Grant No. 687916 (AFEL project).

Imprint

You can find the imprint with provider and legal notices here