TweetsKB - A Public and Large-Scale RDF Corpus of Annotated Tweets

What

TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for nearly 3.1 billion tweets, spanning more than 10 years until the Tweeter API was closed in 06.2023. Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the usernames and we do not provide the text of the tweets. However, the tweet IDs can be used to retrieve the original Tweet text.

More information is available at the following paper:

P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze,

TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets,

15th Extended Semantic Web Conference (ESWC'18), Heraklion, Crete, Greece, June 3-7, 2018.
Nominated for the "Best Resource Paper" award!

pdf • bib • slides

TweetsCOV19 is a subset of TweetsKB containing COVID-related tweets and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until April 2020.

Why

For relieving data consumers from the computationally intensive process of extracting and processing tweets.
For facilitating a variety of multi-aspect data consumption, exploration and analytics scenarios. These include:

time-aware and entity-centric exploration of the Twitter archive
data integration by directly exploiting existing knowledge bases (like DBpedia)
entity-centric analytics and knowledge discovery by inferring multi-aspect information related to one or more entities during certain time periods (like popularity, attitude or relations with other entities)

Dataset

•TweetsKB is available as Notation3 (N3) files (split by month) through the Zenodo data repository (under a Creative Commons Attribution 4.0 license) and GESIS Search:

Dataset Part	Zenodo Link	Entity Linking
Part 1 (Feb 2013 - Feb 2014)	https://zenodo.org/record/573852	Yahoo FEL with Wikipedia Dump from November 2015
Part 2 (Mar 2014 - Nov 2014)	https://zenodo.org/record/577572	Yahoo FEL with Wikipedia Dump from November 2015
Part 3 (Dec 2014 - Dec 2015)	https://zenodo.org/record/579597	Yahoo FEL with Wikipedia Dump from November 2015
Part 4 (Jan 2016 - Nov 2016)	https://zenodo.org/record/579601	Yahoo FEL with Wikipedia Dump from November 2015
Part 5 (Dec 2016 - Oct 2017)	https://zenodo.org/record/1095592	Yahoo FEL with Wikipedia Dump from November 2015
Part 6 (Nov 2017 - Mar 2018)	https://zenodo.org/record/1808741	Yahoo FEL with Wikipedia Dump from November 2015
Part 7 (Apr 2018 - Apr 2019)	https://zenodo.org/record/3828929	Yahoo FEL with Wikipedia Dump from April 2020
Part 8 (May 2019 - Apr 2020)	https://zenodo.org/record/3828949	Yahoo FEL with Wikipedia Dump from April 2020
Part 9 (May 2020 - Dec 2020)	https://zenodo.org/record/4420178	Yahoo FEL with Wikipedia Dump from April 2020
Part 10 (Jan 2021 - Dec 2021)	https://doi.org/10.7802/2472	Yahoo FEL with Wikipedia Dump from August 2022
Part 11 (Jan 2022 - Aug 2022)	https://doi.org/10.7802/2473	Yahoo FEL with Wikipedia Dump from August 2022
Part 12 (Sep 2022 - Jun 2023)	https://doi.org/10.78l02/2781	Yahoo FEL with Wikipedia Dump from June 2023

• SPARQL endpoint containing a 5% sample of the dataset: SPARQL endpoint (Default Graph: http://data.gesis.org/tweetskb)

Statistics

Main statistics (as of August 2022):

Feature	Total	Unique	Ratio of tweets with at least one feature
Hashtags:	1,161,839,471	68,832,205	0.19
Mentions:	1,840,456,543	149,277,474	0.38
Entities:	2,563,433,997	2,265,201	0.56
Non-neutral sentiment:	1,265,974,641	-	0.5

Distribution of top-100,000 entity occurrences (as of December 2020):

• Distribution of top-100,000 hashtag occurrences (as of December 2020):

Overview of the top-20 entity types (as of December 2020):

Entities with confidence thresholds -1.

Entity	Count
YouTube	11,031,812
Twitter	5,530,023
Facebook	5,064,670
Guess_(clothing)	4,416,317
Harry_Styles	2,657,646
The_Weather_Channel	2,499,980
CNN	2,414,882
Instagram	2,129,375
Netflix	1,990,180
EBay	1,973,536
SoundCloud	1,816,846
Britney_Spears	1,439,859
Etsy	1,289,528
Lana_Del_Rey	1,260,523
Google	1,207,049
Blackpink	1,079,633
Amazon.com	1,063,110
Not_safe_for_work	1,048,587
Got7	1,039,332
Spotify	1,035,797

Entities with confidence thresholds -2.

Entity	Count
You	100,902,750
LOL	34,304,976
YouTube	14,305,365
Facebook	10,657,535
Twitter	9,641,694
Yahoo!_News	9,371,001
BTS	7,429,176
Happy_Birthday_to_You	7,072,576
World_of_Warcraft	5,852,425
Young-adult_fiction	5,795,295
Yo	4,789,585
The_Weather_Channel	4,516,875
Guess_(clothing)	4,416,500
Ask.com	4,046,042
5_Seconds_of_Summer	4,040,199
Primary_lateral_sclerosis	4,026,400
PPL_Corporation	3,507,805
LMFAO	3,354,678
Instagram	3,298,283
Lady_Gaga	3,245,063

Data model and features

The following figure illustrates the data model and features captured for single tweet instances. Features are extracted from tweets using a pipeline described in the paper listed here. Please note that entity and sentiment annotations may include noise, that is, our pipeline does not achieve perfect precision and recall. For all entity annotations, we provide a confidence score (nee:confidence) which allows you to choose a confidence threshold suitable to your use case, for instance, emphasising either precision or recall.

RDF/S Model:

Instantiation example:

Examples

The following query requests the number of tweets per month mentioning Alexis Tsipras (Greek prime minister) in 2015.

PREFIX nee: <http://www.ics.forth.gr/isl/oae/core#> PREFIX schema: <http://schema.org/> PREFIX dc: <http://purl.org/dc/terms/> PREFIX dbr: <http://dbpedia.org/resource/> SELECT (month(?date) AS ?month) (count(?tweet) AS ?numOfTweets) WHERE { ?tweet schema:mentions ?entity ; dc:created ?date FILTER(year(?date) = 2015) . ?entity a nee:Entity ; nee:hasMatchedURI dbr:Alexis_Tsipras . } GROUP BY month(?date) ORDER BY ?month

Run query

The following query retrieves the top-5 entities co-occurring with Barack Obama in tweets of summer 2016.

PREFIX nee: <http://www.ics.forth.gr/isl/oae/core#> PREFIX schema: <http://schema.org/> PREFIX dc: <http://purl.org/dc/terms/> PREFIX dbr: <http://dbpedia.org/resource/> PREFIX dbo: <http://dbpedia.org/ontology/> SELECT ?relatedEntity (count(distinct ?tweet) as ?num) WHERE { ?tweet dc:created ?date FILTER(month(?date) >= 6 && month(?date) <= 8) . ?tweet schema:mentions ?entity . ?entity a nee:Entity ; nee:hasMatchedURI dbr:Barack_Obama . ?tweet schema:mentions ?entity2. ?entity2 nee:hasMatchedURI ?relatedEntity FILTER (?relatedEntity != dbr:Barack_Obama) } GROUP BY ?relatedEntity ORDER BY DESC(?num) LIMIT 5
Run query

The following query requests the top-10 hashtags co-occuring with the entity Refugee (http://dbpedia.org/resource/Refugee) in 2016.

PREFIX nee: <http://www.ics.forth.gr/isl/oae/core#> PREFIX schema: <http://schema.org/> PREFIX dc: <http://purl.org/dc/terms/> PREFIX dbr: <http://dbpedia.org/resource/> PREFIX sioc: <http://rdfs.org/sioc/types#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> SELECT ?hastagLabel (count(distinct ?tweet) as ?num) WHERE { ?tweet dc:created ?date FILTER(year(?date) = 2016) . ?tweet schema:mentions ?entity . ?entity a nee:Entity ; nee:hasMatchedURI dbr:Refugee . ?tweet schema:mentions ?hashtag. ?hashtag a sioc:Tag ; rdfs:label ?hastagLabel } GROUP BY ?hastagLabel ORDER BY DESC(?num) LIMIT 10
Run query

The following query lists top entities co-occurring with the entity DonaldTrump in tweets of April 2020 (replace the FILTER below with any other entity from DBpedia).

PREFIX nee: <http://www.ics.forth.gr/isl/oae/core#> PREFIX schema: <http://schema.org/> PREFIX dc: <http://purl.org/dc/terms/> PREFIX dbr: <http://dbpedia.org/resource/> SELECT ?otherEntity ( COUNT(?tweet) AS ?count ) WHERE { ?tweet schema:mentions ?entity ; dc:created ?date FILTER ( year(?date) = 2020 and month(?date) = 4 ) . ?entity a nee:Entity ; nee:hasMatchedURI dbr:Donald_Trump . ?tweet schema:mentions ?entity2 . ?entity2 a nee:Entity ; nee:hasMatchedURI ?otherEntity FILTER ( ?otherEntity != dbr:Donald_Trump ) } GROUP BY ?otherEntity ORDER BY DESC( ?count )
Run query

Next Step

We plan to publish datasets for other subsets.

About Us

Team:


                                        - Erdal Baran,

                                        - Stefan
                                        Dietze(GESIS – Leibniz Institute for the Social Sciences and Heinrich-Heine-University Düsseldorf) (main
                                        contact),

                                        - Dimitar Dimitrov,

                                        - Pavlos Fafalios,

                                        - Robert Jäschke,

                                        - Vasileios Iosifidis,

                                        - Ran Yu,

                                        - Xiaofei Zhu,

                                        - Matthäus
                                        Zloch

                                        - Sebastian Schellhammer

                                        - Yudong Zhang

Assert a data protection request: dimitar.dimitrov@gesis.org

L3S Research Center, University of Hannover, Germany
GESIS – Leibniz Institute for the Social Sciences, Germany

The work was partially funded by the European Commission for the ERC Advanced Grant ALEXANDRIA under grant No. 339233 and the H2020 Grant No. 687916 (AFEL project).

>