All about side threads

Introduction

This is a brief post on some preliminary work I’ve done looking at the side threads on rcounting. The post is split into two sections. First, how I’ve worked with the log files to get them into a structured database that’s easier for me to reason about, and after that, how I’veused the logs to plot two different charts: A cumulative total of the number of counts made in each thread, and the most popular side thread over time.

It’s also the first time I’m working with the plotly library to make an interactive graph, so please let me know if you have any comments on that!

Adding side threads to a single database

For me it’s easier to work with a single database file with a well-defined structure than it is to work with a bunch of csv files, especially when I want to analyse all the side thread counts together. This section is mainly a bunch of code to go from one representation to the other1. Feel free to skip to Section 3 if you want!

  • 1 I’ll make a separate script that logs threads to this database later, but to start with I’ll populate it with all the historical counts that u/Countletics and u/Antichess have been nice enough to provide.

  • Code for importing packages and loading data
    import itertools
    import os
    import sqlite3
    from pathlib import Path
    
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    import plotly.express as px
    import plotly.graph_objects as go
    import plotly.io as pio
    import seaborn as sns
    from IPython.display import Markdown
    from rcounting import analysis, counters, graph_tools, parsing
    from rcounting.models import submission_to_dict
    from rcounting.reddit_interface import reddit
    
    pio.templates.default = "seaborn"
    sns.set_theme()
    data_directory = Path(os.getenv("HOME")) / "Downloads" / "side_threads"
    side_threads_db = sqlite3.connect(
        Path(os.getenv("COUNTING_DATA")) / "side_threads.sqlite"
    )

    I’ll start by loading in all the csv files to a single DataFrame, with an additional column identifying which file every comment came from. I then save the comments (minus this last column) to a table in the database. I don’t add the side thread information to this table because that’s a property of submissions, not of comments.

    Code
    filenames = os.listdir(data_directory)
    filename = filenames[0]
    df = pd.concat(
        [
            pd.read_csv(
                data_directory / filename,
                names=["body", "username", "timestamp", "comment_id", "submission_id"],
                dtype={"body": str},
            ).assign(thread=filename.replace("_log.csv", "").replace("_", " "))
            for filename in filenames
        ],
        axis=0,
        ignore_index=True,
    )
    
    df.drop(["thread"], axis=1).to_sql(
        "comments", side_threads_db, index=False, if_exists="replace"
    )

    I want to store more information about each submission than just the submission id, so I’ll use the info endpoint of the reddit API to get that in batches of up to 100 submissions per request. This only works because I know the id of each submission.

    Code
    submissions = reddit.info(fullnames=["t3_" + x for x in df["submission_id"].unique()])
    submissions = pd.DataFrame(
        [submission_to_dict(submission) for submission in submissions]
    )

    The final piece of information I need to write in the submissions table is which thread every submission belongs to. I’ll follow the standard practice of the directory and identify each side thread with the id of the first submission in the thread. That’s a unique identifier which doesn’t change, so it makes for a good key.

    Code
    submissions = submissions.merge(
        pd.concat(
            [
                df.drop_duplicates("submission_id")["submission_id"],
                df.drop_duplicates("submission_id")
                .groupby("thread")
                .transform("first")["submission_id"]
                .rename("thread_id"),
            ],
            axis=1,
        ),
        left_on="submission_id",
        right_on="submission_id",
    )

    And then I’ll write the submissions table to the database.

    Code
    submissions.to_sql("submissions", side_threads_db, index=False, if_exists="replace")

    The thread id is a useful key for identifying a given thread, but it’s hard to understand which thread we’re working with if we just have the id. To make things easier, I’ll store a human-readable name as well, in a threads table. For many of the threads, I already have my own name for them in the rcounting tools, which might differ from the one derived from the name of the csv file. So for the ones that I’ve already given a name, I’ll use that one, and for the new threads, I’ll use the name from the csv files.

    In the future, this table could be populated with some more information, such as the length of each thread.

    Code
    from rcounting.side_threads import known_thread_ids
    
    thread_names = (
        df.groupby("thread")["submission_id"]
        .first()
        .reset_index()
        .rename({"thread": "thread_name", "submission_id": "thread_id"}, axis=1)
    )
    
    
    def rename_thread(series):
        return (
            series["thread_name"]
            if series["thread_id"] not in known_thread_ids
            else known_thread_ids[series["thread_id"]]
        )
    
    
    thread_names["thread_name"] = thread_names.apply(rename_thread, axis=1)
    thread_names.to_sql("threads", side_threads_db, index=False, if_exists="replace")

    And that’s the database created, with the structure that each comment belongs to a submission, and each submission belongs to a thread; and with information about each of those stored in their respective tables. Onward to the analysis!

    Analysing side thread data

    The first and most obvious thing to do with the side thread data is to plot a chart of the cumulative total number of counts in each thread.

    I’ll pretend that I didn’t just create the side threads database, so I’ll start by loading in some comments:

    Code to query the database and load the side thread comments
    comments = pd.read_sql(
        "SELECT comments.timestamp, submissions.thread_id, comments.username "
        "FROM comments JOIN submissions "
        "ON comments.submission_id == submissions.submission_id "
        "ORDER by comments.timestamp",
        side_threads_db,
    )
    comments["username"] = comments["username"].apply(counters.apply_alias)
    known_threads = pd.read_sql("select * from threads", side_threads_db).set_index(
        "thread_id"
    )

    In the thread directory, we treat the top 25 threads specially, and so I’ll focus this analysis on those threads. This also helps make the plots slightly less crowded.

    I’ll use a groupby and a size to get the total number of counts in each thread, and use that to find the top 25 threads. Then I select only the comments in those threads, and get a 12 hour resampled count of how many comments were made in each side thread.

    Code to calculate how many counts were made in each side thread in 12 hour periods
    top_threads = (
        comments.groupby("thread_id")
        .size()
        .sort_values(ascending=False)
        .head(25)
        .reset_index(name="size")
    )
    frequency = "12h"
    subset = top_threads.merge(comments, left_on="thread_id", right_on="thread_id")[
        ["thread_id", "timestamp"]
    ]
    subset = subset.set_index(pd.to_datetime(subset["timestamp"], unit="s"))["thread_id"]
    resampled = pd.get_dummies(subset.sort_index()).resample(frequency).sum()

    From there it’s just a short hop and a skip to the chart I described. As teased earlier, the chart is interactive and you can pan and zoom as you like. You can also highlight specific threads by double clicking the legend.

    Code to plot the cumulative side thread counts
    cumulative = (
        resampled.expanding()
        .sum()
        .rename(known_threads.to_dict()["thread_name"], axis=1)
        .melt(ignore_index=False)
        .reset_index()
    )
    order = list(
        cumulative.groupby("variable")["value"].last().sort_values(ascending=False).index
    )
    fig = px.line(
        data_frame=cumulative,
        x="timestamp",
        y="value",
        line_group="variable",
        color="variable",
        category_orders={"variable": order},
        labels={"timestamp": "Date", "variable": "Side Thread", "value": "Total Counts"},
    )
    fig.show()

    Looking a the chart, there are a couple of interesting things to notice. One is the impressive and sustained rise of tug of war, which is by far our most popular side thread. The next is how well we’ve managed to keep parity between binary and hexadecimal over the years. Apart from a brief blip in September 20202 the two lines are basically on top of each other for their entire history. Then there’s the impressive rise of no repeating digits starting in mid 2019, and continuing until the present.

  • 2 Caused by counters who were unaware of the parity goal running a couple of threads

  • Below the top five threads we have 20 threads that are basically track each other, all growing at approximately the same rate. In this group, it’s maybe interesting to note the growth of age, which wasn’t started until mid-2019, and saw a big rise, before levelling off after the start of 2022. And then there’s time, which made it to the very top of this group, before completely halting in late 2022.

    The most prolific side thread counters

    We can use basically the same approach to find and plot the top side thread counters over time,

    Code to plot the hall of side threads
    counts = comments.groupby("username").size().sort_values(ascending=False)
    top_counters = [x for x in counts.index if not counters.is_banned_counter(x)][:25]
    df = comments.loc[comments["username"].isin(top_counters), ["username", "timestamp"]]
    total = (
        pd.get_dummies(df.set_index(pd.to_datetime(df["timestamp"], unit="s"))["username"])
        .resample(frequency)
        .sum()
        .cumsum()
        .melt(ignore_index=False)
        .reset_index()
        .rename(
            {"timestamp": "Date", "variable": "Counter", "value": "Total Counts"}, axis=1
        )
    )
    
    fig = px.line(
        data_frame=total,
        x="Date",
        y="Total Counts",
        line_group="Counter",
        color="Counter",
        category_orders={"Counter": top_counters},
    )
    fig.show()

    A couple of things stand out about this plot too. The first is how consistent u/TheNitromeFan’s counting rate was between late 2017 and the start of 2022, followed by his semi-retirement since then. Similarly, you can see how atomicimploder bascially left the subreddit for a couple of years before coming back to reclaim his number 2 spot in the total number of side thread counts.

    It’s also fun to see how many of our counters have wildly varying rates of side thread activity over time, which makes for bumpy lines on this plot.