COVID-19 Retweet Prediction Challenge

Task Description

As a result of the ongoing Coronavirus disease 2019 (COVID-19) pandemic, our daily life routines and behavior patterns changed drastically, not only offline but also online. One example of such a change is the change in the reading patterns on Wikipedia and Reddit [1,2]. COVID-19 has also been a hot topic on other social media platforms such as Facebook, Twitter, or YouTube. To understand the information spreading mechanisms during the COVID-19 pandemic, in this challenge, we focus on Twitter. Twitter is an online social network where users can follow each other and share information using short text posts called tweets. The platform offers a function to retweet a tweet, which means sharing it with your followers without any change. Retweeting is a popular function, and it has also found his way in other online social networks such as Weibo. Retweeting can be seen as amplifying the spread of original content, and thus retweet prediction is a crucial task when studying information spreading processes. As such, understanding retweet behavior is useful and has many practical applications, e.g., political audience design [3,4], fake news spreading and tracking [5,6], health promotion [7], mass emergency management [8], etc. Modeling retweet behavior has been an active research area and is also especially important during times of crisis, such as the current COVID-19 pandemic.

The retweet prediction task in the challenge is based on the TweetsCOV19 dataset --- a publicly available dataset containing more than 8 million COVID-19-related tweets, spanning the period October 2019 to April 2020.

The COVID-19 Retweet Prediction Challenge is part of the CIKM2020 AnalytiCup, and the winners will be invited to present their solutions during the *online* AnalytiCup Workshop in October 2020.

Join the COVID-19 Retweet Prediction Challenge at CodaLab!

Timeline

01.07.20 Contest and Phase 1 Begin (Validation Leaderboard opens)
15.08.20 Phase 2 Begin (Testing Leaderboard opens)
31.08.20 Last Shot & Contest End
01.09.20 Semi-Finalists Announcement (top six teams on the Testing Leaderboard)
01.10.20 Report & Code Due
20.10.20 Winners Announcement

Data

In this competition, you are provided with the TweetsCOV19, a publicly available dataset of more than 8 million tweets, spanning the period Oct’19-Apr’20.

TSV File format
Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:

Tweet Id: Long. [Provided only in the training data.]
Username: String. Encrypted for privacy issues.
Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
#Followers: Integer.
#Friends: Integer.
#Retweets: Integer. [The target variable to predict!]
#Favorites: Integer.
Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"

Problem

Given the set of features for a tweet from TweetsCOV19, the task is to predict the number of times it will be retweeted (#retweets).

Leaderboard

Can you do better?

Rules

Violating any competition rule specified below is ground for disqualification. In the event of any dispute in connection with the Competition, or with the interpretation or implementation of these rules, the decision of the Organizers shall be final.

Eligibility

The competition is open to everyone except for anyone involved with the organization.

One account per participant

You cannot sign up to CodaLab from multiple accounts, and therefore you cannot submit from multiple accounts.

Team size

There is no limit to the number of team members. The only restriction is that the total count of submission of all team members must be less than or equal to the maximum allowed in the respective phase of the competition.

Team mergers

Team mergers are allowed all throughout Phase 1, and can be performed by the team leader (go to your account's User Settings and indicate team name and members). In order to merge, the combined team must have a total submission count less than or equal to the maximum allowed as of the merge date. The maximum allowed is the number of submissions per day multiplied by the number of days the competition has been running. The organizers do not provide any assistance regarding the team mergers.

Additional data

Participants are free to use any additional datasets that have been made publicly available *before* ~~the beginning of the Competition~~ April 30, 2020.

No private sharing outside teams

Privately sharing code or data outside of teams is not permitted. It is permitted to share code if it is made available to all participants on the forums or as a public repository (e.g., Github).

Submissions

You may submit a maximum of 20 entries per day during Phase 1 (Validation Leaderboard). For Phase 2 (Testing Leaderboard), you can only submit 10 entries in total.

At the end of Phase 2, the semi-finalists--- the top six teams---are to submit ~~their code as well as~~ a report describing their solution (4 pages in ACM format) and make their code publicly available by the stipulated deadline.

The submitted codes and reports may be inspected to check the validity of the solution. The reports will eventually be made publicly available on the CIKM conference website.

Selected teams will also be invited to present their solutions *online* at the CIKM AnalytiCup Workshop in October 2020. To allocate the limited presentation slots, preference will be given to award-winning teams, as well as teams deemed by the organizers to have interesting or remarkable solutions.

Ethics

We trust that all used data, methods, and resources comply with the ACM code of ethics.

Winners

The ranking of entries based on the prediction score (MSLE) during Phase 2 will be used to determine the semi-finalists (top six teams), subject to the validity of the solutions. Winners will be the top 3 teams among the semi-finalist teams. A tie in the prediction score will be broken in favor of the earlier submission on the final leaderboard.

Organizers

The challenge is part of the CIKM2020 AnalytiCup

Dimitar Dimitrov, GESIS – Leibniz Institute for the Social Sciences, Germany
Xiaofei Zhu, Chongqing University of Technology, China
Questions related to the challenge?

References

[1] Gozzi, N., Tizzani, M., Starnini, M., Ciulla, F., Paolotti, D., Panisson, A. and Perra, N., 2020. Collective response to the media coverage of COVID-19 Pandemic on Reddit and Wikipedia. arXiv preprint arXiv:2006.06446.

[2] Ribeiro, M.H., Gligorić, K., Peyrard, M., Lemmerich, F., Strohmaier, M. and West, R., 2020. Sudden Attention Shifts on Wikipedia Following COVID-19 Mobility Restrictions. arXiv preprint arXiv:2005.08505.

[3] Stieglitz, S. and Dang-Xuan, L., 2012, January. Political communication and influence through microblogging--An empirical analysis of sentiment in Twitter messages and retweet behavior. In 2012 45th Hawaii International Conference on System Sciences (pp. 3500-3509). IEEE.

[4] Kim, E., Sung, Y. and Kang, H., 2014. Brand followers’ retweeting behavior on Twitter: How brand relationships influence brand electronic word-of-mouth. Computers in Human Behavior, 37, pp.18-25.

[5] Lumezanu, C., Feamster, N. and Klein, H., 2012, May. # bias: Measuring the tweeting behavior of propagandists. In Sixth International AAAI Conference on Weblogs and Social Media.

[6] Vosoughi, S., Roy, D. and Aral, S., 2018. The spread of true and false news online. Science, 359(6380), pp.1146-1151.

[7] Chung, J.E., 2017. Retweeting in health promotion: Analysis of tweets about Breast Cancer Awareness Month. Computers in Human Behavior, 74, pp.112-119.

[8] Kogan, M., Palen, L. and Anderson, K.M., 2015, February. Think local, retweet global: Retweeting by the geographically-vulnerable during Hurricane Sandy. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing (pp. 981-993).