Week 7

This week I did some meta data calculatation as all the data for 32 nfl teams have been scraped:

There are in total 262 pre posts and 28 paired games. There are in total 1083 game posts and 527 paired games. There are in total 1015 post posts and 461 paired games.

For future conveniences, I cleared out some tables for quick data reference.

  1. a giant tsv (tab separated file) of all comments. tsv has 3 columns: comment_id, post_id, raw_comment, parent_comment_id which is 0 when it is a top level comment.
  2. a tsv of posts, which has the columns post_id, game_id, subreddit.
  3. a tsv of game information, including competing team names, score, etc.

In addition, I set up another experimental task allowing only “in” and “out” options. This time human and gpt agree on annotation 78% of the time. Meanwhile, the cohen kappa score among multiple runs of gpt annotation is 0.58. Since we are masking and prediting ingroup/outgroup/third-party entities, I calculate the number of usable comments in the dataset. To provide enought context, we keep comments with at least five words. There are in total 5019538 comments from paired posts. Among them, there are 190963 third-party mentions, 225591 ingroup mentions, and 201075 outgroup mentions. 124230 comments contain only third-party entity, 177301 comments contain only ingroup entity, 162211 comments contain only outgroup entity. 17606 comments have both ingroup and outgroup entities, 12832 comments have ingroup and third-party entities, 10384 comments have outgroup and third-party entities. There are 1888 comments that include all three types of entities. Finally, there are 506452 comments that have at least one type of masked entities. Those statistics demonstrate that using only the team name, there are sufficiently many comments that we can take advantage of.

Written on July 7, 2023