Tweetplomacy 23 - a semantically annotated corpus of tweets

What

Tweetplomacy 23 is a semantically annotated corpus of tweets capturing digital communicative interaction between international political leaders, peer groups and citizens in the wake of three major global crises: (1) the increasing emphasis on the security of energy supplies following Russia’s invasion of Ukraine; (2) the political and geo-economic consequences of the COVID-19 pandemic; (3) the intensified debate on the progression of climate change. These events occurred between 2018 and 2023, each of them marking a significant shake-up of the international system.

The dataset focuses on the strategic use of networked information on X (formerly Twitter) by executive political actors facing exogenous shocks in the context of a global crisis situation. It is extracted from an X archive covering more than 14 billion tweets collected from the 1% random sample API. To extract the dataset, we resort to a list of top executives of the political administration – heads of state, heads of government, ministers of foreign affairs – or their respective public-relations offices. Their tweets are filtered using a list of thematically relevant keywords in four languages (English, German, French, Spanish), reflecting the discourse with respect to the three crises mentioned above.

Our sample covers instances from the beginning of 2018 up to May 2023, representing statements made by leading politicians from 83 countries on all continents. As a subset, tweets published by the political leaders of the 38 member states of the OECD and the five BRICS countries (Brazil, Russia, India, China, South Africa) have been extracted. Additionally, the sample comprises a selection of 10 international organizations.

Why

track and examine the repercussions/resonance produced by the ‘digital audience’ of the most influential political leaders in the course of the three crises, thus hinting at the political and societal impact their communicative actions had in the digital realm.
identify changes in sentiments, argumentation and/or tonality as well as more general breakpoints of discussion by conducting in-depth analyses of the online discourse relating to each of the three debates.
yield new insights into networks of communication among ‘online champions’ in the diplomatic community with regard to global political crises. To this end, researchers will be able to employ both quantitative/statistical and qualitative/hermeneutic methodologies to further explore and compare specific communicative motivations of national political leaders and the global ‘digital public’ in such cases.
be used as a valuable empirical input not merely for political or media scientists, but also for scholars focusing on sociological, economic or socio-psychological aspects of crisis communication.

Dataset

The entire data collection consists of the following files: (1) users: excel file with a list of 654 Twitter user handles (usernames) of top executives of the political administration (and/or their institutional accounts), their nationalities, functions/roles and tenure; (2) keywords: excel file with a list of 72 crisis-related keywords; (3) a gzipped JSONL file per language: each line in the JSONL files represents a JSON object containing metadata about a tweet matching either one or more of the actors’ user handles and one or more of the keywords in the respective language. Additionally, semantic enrichments (i.e., entities and sentiments) calculated on the basis of the tweet text are provided. The JSON object includes the following fields:

tweetId: integer, unique ID for an original tweet
timeStamp: format ("EEE MMM dd HH:mm:ss Z yyyy"), the timestamp of the original tweet
userName: JSON object containing the MD5-hashed user names for private persons or the user names for public persons and institutions
userBio: string (available only for public users and institutions), metadata at the time point of the original tweets or of retweets
followers: integer, metadata at the timepoint of the original tweets or of retweets
followees: integer, metadata at the timepoint of the original tweets or of retweets
retweets: integer, metadata at the timepoint of the original tweets or of retweets
favorites: integer, metadata at the timepoint of the original tweets or of retweets
replies: integer, metadata at the timepoint of the original tweets or of retweets
matchingKeywords: list of strings representing the matching keywords
matchingUserMentions: list of strings representing the matching user mentions
matchingUserName: string representing the matching user name
sentiments: JSON object containing the output of the VADER sentiment analysis tool (available only for English, German and French)
entities: JSON object containing the output of Entity Fishing named entity linking tool
hashtags: list of strings containing the hashtags extracted from the tweet text
mentions: list of strings containing the user mentions extracted from the tweet text
urls: JSON object containing (resolved) URLs extracted from the tweet text
retweetId: integer, unique ID for the retweet of an original tweet with an ID captured in the tweetId field
retweetTimeStamp: format ("EEE MMM dd HH:mm:ss Z yyyy"), the timestamp of the retweet
retweetUserName: JSON object containing the MD5-hashed username of the retweeting user

Dataset version	DOI Link
01.01.2018 - 31.05.2023	https://doi.org/10.7802/2985

Dataset Analysis

The Jupyter notebooks for analyzing the Tweetplomacy 23 dataset can be found with this link https://github.com/trovdimi/tweetplomacy-23.

Descriptive statistics

The table shows the percentages of tweets and users as well as the means and standard deviations for replies, retweets and favorites per crisis, language and type of user (public/private) in the dataset. The basis for calculating the statistics are 2,048,232 tweets and 914,533 users. The percentages given for tweets do not add up to exactly 100 as some tweets might cover multiple topics. Similarly, some users might talk about multiple topics in multiple languages.

	Overall	Energy sec.	COVID-19	Climate chg.	English	German	French	Spanish	Public	Private
tweets (%)	100	46.20	48.96	8.02	61.49	1.42	2.69	34.40	5.17	94.83
users (%)	100	52.70	51.97	9.69	67.61	1.80	3.18	28.07	0.06	99.94
replies (M/Std)	31/684	27/482	34/840	30/486	39/860	25/199	23/231	16/172	401/2,899	11/165
retweets (M/Std)	92/829	81/682	105/953	74/688	102/1,017	29/173	52/309	80/376	712/3,064	58/437
favorites (M/Std)	310/4,271	258/3,006	358/5,098	307/4,896	411/5,371	159/1,068	159/1,293	147/1,113	3,324/17,127	146/1,654

Hashtag usage per crisis

	Energy security	COVID-19	Climate change
tweets with hashtags	42%	56%	53%
users using hashtags	31%	46%	45%
hashtagstweet	2.60	2.43	2.64
hashtags user	2.37	2.37	2.56

Top hashtags and mentions ranked by their occurrence counts for each crisis

Energy security
hashtag	occ.	mention	occ.
#ukraine	42,008	@realDonaldTrump	250,202
#fanb	33,197	@NicolasMaduro	175,825
#covid19	32,378	@lopezobrador_	140,301
#gnb	31,421	@POTUS	121,766
#tigraygenocide	21,005	@JoeBiden	63,722
COVID-19
hashtag	occ.	mention	occ.
#covid19	164,954	@realDonaldTrump	318,982
#coronavirus	62,037	@NicolasMaduro	200,455
#fanb	28,221	@narendramodi	113,549
#covid_19	26,966	@PMOIndia	102,017
#gnb	26,556	@lopezobrador_	101,899
Climate change
hashtag	occ.	mention	occ.
#climatechange	13,701	@realDonaldTrump	42,256
#climateaction	10,794	@POTUS	17,280
#covid19	5,416	@NicolasMaduro	14,766
#climatecrisis	4,231	@lopezobrador_	14,048
#climate	4,134	@JoeBiden	12,015

Top five detected entities ranked by the number of occurrences

Energy security		COVID-19		Climate change
entity ID	occ.	entity ID	occ.	entity ID	occ.
Q918	227,352	Q918	307,804	Q918	93,109
Q212	169,666	Q84263196	219,251	Q7942	27,602
Q11696	89,800	Q134808	81,218	Q11696	13,631
Q30	40,375	Q81068910	71,304	Q1065	12,364
Q159	40,092	Q7817	63,232	Q208645	8,058

Sentiments of tweets mentioning URLs from news media outlets

The figure shows the average compound sentiment scores for tweets sharing URLs from a qualitative selection of news outlets with different political leaning according Media Bias/Fact Check.

Dataset	Extracted information	Timespan	Method of extraction	Languages	Number of postings
TweetIntent@Crisis: A Dataset Revealing Narratives of Both Sides in the Russia-Ukraine Crisis \| Proceedings of the International AAAI Conference on Web and Social Media
TweetIntent@Crisis (Twitter/X) – Ai et al. (2024)	Tweet ID, date, text	02/01/2022 – 02/28/2023	Keywords from tweets (after verification and topic modeling)	English	17,854 (after cleaning)
IsamasRed: A Public Dataset Tracking Reddit Discussions on Israel-Hamas Conflict \| Proceedings of the International AAAI Conference on Web and Social Media
IsamasRed (Reddit) – Chen et al. (2024)	Conversations and comments	08/2023 – 11/2023	Keywords (after automated extraction)	English	412,258 conversations; 8,089,095 comments
TweetsCOV19- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic
TweetsCov19 (Twitter/X) – Dimitrov et al. (2020)	Tweet ID, date, metadata, entities, sentiments, hashtags, mentions	10/2019 – 08/2022	Keywords from tweets (manual)	English	41,307,082
JMIR Public Health and Surveillance - Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set
Tracking Social Media Discourse About the COVID-19 Pandemic (Twitter/X) – Chen et al. (2020)	Tweet ID, date	01/21/2020 – present	Keywords from tweets, Tweets from accounts (manual)	multilingual	129,911,732
Coronavirus (COVID-19) Tweets Dataset \| IEEE DataPort
Coronavirus Tweets Dataset (Twitter/X) – Lamsal (2021)	Tweet ID, time frame	03/20/2020 – present	Keywords from tweets (manual, tokenized)	English	102,650,603
SocialDrought: A Social and News Media Driven Dataset and Analytical Platform towards Understanding Societal Impact of Drought \| Proceedings of the International AAAI Conference on Web and Social Media
SocialDrought (Twitter/X; online news articles; meteorological data) – Shang et al. (2024)	Tweet ID, date, text, geolocation (Twitter); title, text, URL (news articles); weather indicators (meteorological records)	01/2012 – 04/2023 (Twitter); 01/2022 – 12/2023 (news articles); 01/2012 – 12/2023 (meteorological records)	Keywords from tweets and news articles; analysis of weather statitics from US (manual)	English	3,562,605 (tweets); 1,482 (news articles); 31,977 (meteorological records)

About Us

Team:


                                        - Dimitar Dimitrov,

                                        - Jan-Henrik Petermann,

                                        - Yudong Zhang

Assert a data protection request: dimitar.dimitrov@gesis.org

L3S Research Center, University of Hannover, Germany
GESIS Web Data, Germany
GESIS – Leibniz Institute for the Social Sciences, Germany