All about side threads

Introduction

This is a brief post on some preliminary work I’ve done looking at the side threads on rcounting. The post is split into two sections. First, how I’ve worked with the log files to get them into a structured database that’s easier for me to reason about, and after that, how I’veused the logs to plot two different charts: A cumulative total of the number of counts made in each thread, and the most popular side thread over time.

It’s also the first time I’m working with the plotly library to make an interactive graph, so please let me know if you have any comments on that!

Adding side threads to a single database

For me it’s easier to work with a single database file with a well-defined structure than it is to work with a bunch of csv files, especially when I want to analyse all the side thread counts together. This section is mainly a bunch of code to go from one representation to the other¹. Feel free to skip to Section 3 if you want!

¹ I’ll make a separate script that logs threads to this database later, but to start with I’ll populate it with all the historical counts that u/Countletics and u/Antichess have been nice enough to provide.

Code for importing packages and loading data

import itertools
import os
import sqlite3
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns
from IPython.display import Markdown
from rcounting import analysis, counters, graph_tools, parsing
from rcounting.models import submission_to_dict
from rcounting.reddit_interface import reddit

pio.templates.default = "seaborn"
sns.set_theme()
data_directory = Path(os.getenv("HOME")) / "Downloads" / "side_threads"
side_threads_db = sqlite3.connect(
    Path(os.getenv("COUNTING_DATA")) / "side_threads.sqlite"
)

I’ll start by loading in all the csv files to a single DataFrame, with an additional column identifying which file every comment came from. I then save the comments (minus this last column) to a table in the database. I don’t add the side thread information to this table because that’s a property of submissions, not of comments.

Code

filenames = os.listdir(data_directory)
filename = filenames[0]
df = pd.concat(
    [
        pd.read_csv(
            data_directory / filename,
            names=["body", "username", "timestamp", "comment_id", "submission_id"],
            dtype={"body": str},
        ).assign(thread=filename.replace("_log.csv", "").replace("_", " "))
        for filename in filenames
    ],
    axis=0,
    ignore_index=True,
)

df.drop(["thread"], axis=1).to_sql(
    "comments", side_threads_db, index=False, if_exists="replace"
)

I want to store more information about each submission than just the submission id, so I’ll use the info endpoint of the reddit API to get that in batches of up to 100 submissions per request. This only works because I know the id of each submission.

Code

submissions = reddit.info(fullnames=["t3_" + x for x in df["submission_id"].unique()])
submissions = pd.DataFrame(
    [submission_to_dict(submission) for submission in submissions]
)

The final piece of information I need to write in the submissions table is which thread every submission belongs to. I’ll follow the standard practice of the directory and identify each side thread with the id of the first submission in the thread. That’s a unique identifier which doesn’t change, so it makes for a good key.

Code

submissions = submissions.merge(
    pd.concat(
        [
            df.drop_duplicates("submission_id")["submission_id"],
            df.drop_duplicates("submission_id")
            .groupby("thread")
            .transform("first")["submission_id"]
            .rename("thread_id"),
        ],
        axis=1,
    ),
    left_on="submission_id",
    right_on="submission_id",
)

And then I’ll write the submissions table to the database.

Code

submissions.to_sql("submissions", side_threads_db, index=False, if_exists="replace")

The thread id is a useful key for identifying a given thread, but it’s hard to understand which thread we’re working with if we just have the id. To make things easier, I’ll store a human-readable name as well, in a threads table. For many of the threads, I already have my own name for them in the rcounting tools, which might differ from the one derived from the name of the csv file. So for the ones that I’ve already given a name, I’ll use that one, and for the new threads, I’ll use the name from the csv files.

In the future, this table could be populated with some more information, such as the length of each thread.

Code

from rcounting.side_threads import known_thread_ids

thread_names = (
    df.groupby("thread")["submission_id"]
    .first()
    .reset_index()
    .rename({"thread": "thread_name", "submission_id": "thread_id"}, axis=1)
)


def rename_thread(series):
    return (
        series["thread_name"]
        if series["thread_id"] not in known_thread_ids
        else known_thread_ids[series["thread_id"]]
    )


thread_names["thread_name"] = thread_names.apply(rename_thread, axis=1)
thread_names.to_sql("threads", side_threads_db, index=False, if_exists="replace")

And that’s the database created, with the structure that each comment belongs to a submission, and each submission belongs to a thread; and with information about each of those stored in their respective tables. Onward to the analysis!

Analysing side thread data

The first and most obvious thing to do with the side thread data is to plot a chart of the cumulative total number of counts in each thread.

I’ll pretend that I didn’t just create the side threads database, so I’ll start by loading in some comments:

Code to query the database and load the side thread comments

comments = pd.read_sql(
    "SELECT comments.timestamp, submissions.thread_id, comments.username "
    "FROM comments JOIN submissions "
    "ON comments.submission_id == submissions.submission_id "
    "ORDER by comments.timestamp",
    side_threads_db,
)
comments["username"] = comments["username"].apply(counters.apply_alias)
known_threads = pd.read_sql("select * from threads", side_threads_db).set_index(
    "thread_id"
)

In the thread directory, we treat the top 25 threads specially, and so I’ll focus this analysis on those threads. This also helps make the plots slightly less crowded.

I’ll use a groupby and a size to get the total number of counts in each thread, and use that to find the top 25 threads. Then I select only the comments in those threads, and get a 12 hour resampled count of how many comments were made in each side thread.

Code to calculate how many counts were made in each side thread in 12 hour periods

top_threads = (
    comments.groupby("thread_id")
    .size()
    .sort_values(ascending=False)
    .head(25)
    .reset_index(name="size")
)
frequency = "12h"
subset = top_threads.merge(comments, left_on="thread_id", right_on="thread_id")[
    ["thread_id", "timestamp"]
]
subset = subset.set_index(pd.to_datetime(subset["timestamp"], unit="s"))["thread_id"]
resampled = pd.get_dummies(subset.sort_index()).resample(frequency).sum()

From there it’s just a short hop and a skip to the chart I described. As teased earlier, the chart is interactive and you can pan and zoom as you like. You can also highlight specific threads by double clicking the legend.

Code to plot the cumulative side thread counts

cumulative = (
    resampled.expanding()
    .sum()
    .rename(known_threads.to_dict()["thread_name"], axis=1)
    .melt(ignore_index=False)
    .reset_index()
)
order = list(
    cumulative.groupby("variable")["value"].last().sort_values(ascending=False).index
)
fig = px.line(
    data_frame=cumulative,
    x="timestamp",
    y="value",
    line_group="variable",
    color="variable",
    category_orders={"variable": order},
    labels={"timestamp": "Date", "variable": "Side Thread", "value": "Total Counts"},
)
fig.show()

Looking a the chart, there are a couple of interesting things to notice. One is the impressive and sustained rise of tug of war, which is by far our most popular side thread. The next is how well we’ve managed to keep parity between binary and hexadecimal over the years. Apart from a brief blip in September 2020² the two lines are basically on top of each other for their entire history. Then there’s the impressive rise of no repeating digits starting in mid 2019, and continuing until the present.

² Caused by counters who were unaware of the parity goal running a couple of threads

Below the top five threads we have 20 threads that are basically track each other, all growing at approximately the same rate. In this group, it’s maybe interesting to note the growth of age, which wasn’t started until mid-2019, and saw a big rise, before levelling off after the start of 2022. And then there’s time, which made it to the very top of this group, before completely halting in late 2022.

The most prolific side thread counters

We can use basically the same approach to find and plot the top side thread counters over time,

Code to plot the hall of side threads

counts = comments.groupby("username").size().sort_values(ascending=False)
top_counters = [x for x in counts.index if not counters.is_banned_counter(x)][:25]
df = comments.loc[comments["username"].isin(top_counters), ["username", "timestamp"]]
total = (
    pd.get_dummies(df.set_index(pd.to_datetime(df["timestamp"], unit="s"))["username"])
    .resample(frequency)
    .sum()
    .cumsum()
    .melt(ignore_index=False)
    .reset_index()
    .rename(
        {"timestamp": "Date", "variable": "Counter", "value": "Total Counts"}, axis=1
    )
)

fig = px.line(
    data_frame=total,
    x="Date",
    y="Total Counts",
    line_group="Counter",
    color="Counter",
    category_orders={"Counter": top_counters},
)
fig.show()

A couple of things stand out about this plot too. The first is how consistent u/TheNitromeFan’s counting rate was between late 2017 and the start of 2022, followed by his semi-retirement since then. Similarly, you can see how atomicimploder bascially left the subreddit for a couple of years before coming back to reclaim his number 2 spot in the total number of side thread counts.

It’s also fun to see how many of our counters have wildly varying rates of side thread activity over time, which makes for bumpy lines on this plot.

Plotting the most popular side thread over time

We can also look at which side thread is the most popular in any given 30 day period³, shown here below

³ To be precise, we’ll be looking at which of the current top 25 threads is the most popular at any time. So if a thread was popular once, but has since dropped out of the top 25 you won’t find it here.

Code to plot the most popular side thread

window = "30d"
one_hot = resampled.rolling(window).sum().idxmax(axis=1)
mode = pd.get_dummies(one_hot).rename(known_threads.to_dict()["thread_name"], axis=1)
mode = mode[[x for x in order if x in mode.columns]]
labels = {"timestamp": "Date", "variable": "Side thread"}
palette = sns.color_palette("colorblind", len(mode.columns))
colors = [f"rgb{tuple(256*np.array(x))}" for x in palette]

fig = go.Figure()
for column, color in zip(mode.columns, colors):
    fig.add_trace(
        go.Scatter(
            x=mode.index,
            y=mode[column],
            fill="tozeroy",
            mode="none",
            fillcolor=color,
            name=column,
            hoveron="points+fills",
        )
    )
fig.update_yaxes(range=[0, 1], visible=False, showticklabels=False)
fig.show()

What I’d most like to draw your attention to with this plot is the four month stretch in 2021 when ternary was our most popular thread, and 25k counts were made in it. This is the only time ternary has ever been the most popular side thread, and the reason for the rapid rise is that somebody had decided to push the thread hard to reach an extra digit. Once that was accomplished, the activity declined to basically where it was before.

I was originally going to include some deeper analysis of tug of war in this post as well, but cleaning up that data is going to take a lot longer than I thought, so that’s all for now! If you have any suggestions for things you’d like to see me do with the side thread data, let me know!

--- title: All about side threads --- # Introduction This is a brief post on some preliminary work I've done looking at the side threads on `rcounting`. The post is split into two sections. First, how I've worked with the log files to get them into a structured database that's easier for me to reason about, and after that, how I'veused the logs to plot two different charts: A cumulative total of the number of counts made in each thread, and the most popular side thread over time. It's also the first time I'm working with the plotly library to make an interactive graph, so please let me know if you have any comments on that! # Adding side threads to a single database For me it's easier to work with a single database file with a well-defined structure than it is to work with a bunch of csv files, especially when I want to analyse all the side thread counts together. This section is mainly a bunch of code to go from one representation to the other[^1]. Feel free to skip to @sec-analysis if you want! ```{python} # | code-summary: "Code for importing packages and loading data" import itertools import os import sqlite3 from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd import plotly.express as px import plotly.graph_objects as go import plotly.io as pio import seaborn as sns from IPython.display import Markdown from rcounting import analysis, counters, graph_tools, parsing from rcounting.models import submission_to_dict from rcounting.reddit_interface import reddit pio.templates.default = "seaborn" sns.set_theme() data_directory = Path(os.getenv("HOME")) / "Downloads" / "side_threads" side_threads_db = sqlite3.connect( Path(os.getenv("COUNTING_DATA")) / "side_threads.sqlite" ) ``` I'll start by loading in all the csv files to a single `DataFrame`, with an additional column identifying which file every comment came from. I then save the comments (minus this last column) to a table in the database. I don't add the side thread information to this table because that's a property of submissions, not of comments. ```{python} # | eval: false # | code-fold: show filenames = os.listdir(data_directory) filename = filenames[0] df = pd.concat( [ pd.read_csv( data_directory / filename, names=["body", "username", "timestamp", "comment_id", "submission_id"], dtype={"body": str}, ).assign(thread=filename.replace("_log.csv", "").replace("_", " ")) for filename in filenames ], axis=0, ignore_index=True, ) df.drop(["thread"], axis=1).to_sql( "comments", side_threads_db, index=False, if_exists="replace" ) ``` I want to store more information about each submission than just the submission id, so I'll use the `info` endpoint of the reddit API to get that in batches of up to 100 submissions per request. This only works because I know the id of each submission. ```{python} # | eval: false # | code-fold: show submissions = reddit.info(fullnames=["t3_" + x for x in df["submission_id"].unique()]) submissions = pd.DataFrame( [submission_to_dict(submission) for submission in submissions] ) ``` The final piece of information I need to write in the submissions table is which thread every submission belongs to. I'll follow the standard practice of the directory and identify each side thread with the id of the first submission in the thread. That's a unique identifier which doesn't change, so it makes for a good key. ```{python} # | eval: false # | code-fold: show submissions = submissions.merge( pd.concat( [ df.drop_duplicates("submission_id")["submission_id"], df.drop_duplicates("submission_id") .groupby("thread") .transform("first")["submission_id"] .rename("thread_id"), ], axis=1, ), left_on="submission_id", right_on="submission_id", ) ``` And then I'll write the submissions table to the database. ```{python} # | eval: false # | code-fold: show submissions.to_sql("submissions", side_threads_db, index=False, if_exists="replace") ``` The thread id is a useful key for identifying a given thread, but it's hard to understand which thread we're working with if we just have the id. To make things easier, I'll store a human-readable name as well, in a threads table. For many of the threads, I already have my own name for them in the `rcounting` tools, which might differ from the one derived from the name of the csv file. So for the ones that I've already given a name, I'll use that one, and for the new threads, I'll use the name from the csv files. In the future, this table could be populated with some more information, such as the length of each thread. ```{python} # | eval: false # | code-fold: show from rcounting.side_threads import known_thread_ids thread_names = ( df.groupby("thread")["submission_id"] .first() .reset_index() .rename({"thread": "thread_name", "submission_id": "thread_id"}, axis=1) ) def rename_thread(series): return ( series["thread_name"] if series["thread_id"] not in known_thread_ids else known_thread_ids[series["thread_id"]] ) thread_names["thread_name"] = thread_names.apply(rename_thread, axis=1) thread_names.to_sql("threads", side_threads_db, index=False, if_exists="replace") ``` And that's the database created, with the structure that each comment belongs to a submission, and each submission belongs to a thread; and with information about each of those stored in their respective tables. Onward to the analysis! # Analysing side thread data {#sec-analysis} The first and most obvious thing to do with the side thread data is to plot a chart of the cumulative total number of counts in each thread. I'll pretend that I didn't just create the side threads database, so I'll start by loading in some comments: ```{python} # | code-summary: "Code to query the database and load the side thread comments" comments = pd.read_sql( "SELECT comments.timestamp, submissions.thread_id, comments.username " "FROM comments JOIN submissions " "ON comments.submission_id == submissions.submission_id " "ORDER by comments.timestamp", side_threads_db, ) comments["username"] = comments["username"].apply(counters.apply_alias) known_threads = pd.read_sql("select * from threads", side_threads_db).set_index( "thread_id" ) ``` In the thread directory, we treat the top 25 threads specially, and so I'll focus this analysis on those threads. This also helps make the plots slightly less crowded. I'll use a `groupby` and a `size` to get the total number of counts in each thread, and use that to find the top 25 threads. Then I select only the comments in those threads, and get a 12 hour resampled count of how many comments were made in each side thread. ```{python} # | code-summary: "Code to calculate how many counts were made in each side thread in 12 hour periods" top_threads = ( comments.groupby("thread_id") .size() .sort_values(ascending=False) .head(25) .reset_index(name="size") ) frequency = "12h" subset = top_threads.merge(comments, left_on="thread_id", right_on="thread_id")[ ["thread_id", "timestamp"] ] subset = subset.set_index(pd.to_datetime(subset["timestamp"], unit="s"))["thread_id"] resampled = pd.get_dummies(subset.sort_index()).resample(frequency).sum() ``` From there it's just a short hop and a skip to the chart I described. As teased earlier, the chart is interactive and you can pan and zoom as you like. You can also highlight specific threads by double clicking the legend. ```{python} # | code-summary: "Code to plot the cumulative side thread counts" # | column: body-outset cumulative = ( resampled.expanding() .sum() .rename(known_threads.to_dict()["thread_name"], axis=1) .melt(ignore_index=False) .reset_index() ) order = list( cumulative.groupby("variable")["value"].last().sort_values(ascending=False).index ) fig = px.line( data_frame=cumulative, x="timestamp", y="value", line_group="variable", color="variable", category_orders={"variable": order}, labels={"timestamp": "Date", "variable": "Side Thread", "value": "Total Counts"}, ) fig.show() ``` Looking a the chart, there are a couple of interesting things to notice. One is the impressive and sustained rise of `tug of war`, which is by far our most popular side thread. The next is how well we've managed to keep parity between `binary` and `hexadecimal` over the years. Apart from a brief blip in September 2020[^2] the two lines are basically on top of each other for their entire history. Then there's the impressive rise of `no repeating digits` starting in mid 2019, and continuing until the present. Below the top five threads we have 20 threads that are basically track each other, all growing at approximately the same rate. In this group, it's maybe interesting to note the growth of `age`, which wasn't started until mid-2019, and saw a big rise, before levelling off after the start of 2022. And then there's `time`, which made it to the very top of this group, before completely halting in late 2022. ## The most prolific side thread counters We can use basically the same approach to find and plot the top side thread counters over time, ```{python} # | code-summary: "Code to plot the hall of side threads" #| column: body-outset counts = comments.groupby("username").size().sort_values(ascending=False) top_counters = [x for x in counts.index if not counters.is_banned_counter(x)][:25] df = comments.loc[comments["username"].isin(top_counters), ["username", "timestamp"]] total = ( pd.get_dummies(df.set_index(pd.to_datetime(df["timestamp"], unit="s"))["username"]) .resample(frequency) .sum() .cumsum() .melt(ignore_index=False) .reset_index() .rename( {"timestamp": "Date", "variable": "Counter", "value": "Total Counts"}, axis=1 ) ) fig = px.line( data_frame=total, x="Date", y="Total Counts", line_group="Counter", color="Counter", category_orders={"Counter": top_counters}, ) fig.show() ``` A couple of things stand out about this plot too. The first is how consistent `u/TheNitromeFan`'s counting rate was between late 2017 and the start of 2022, followed by his semi-retirement since then. Similarly, you can see how `atomicimploder` bascially left the subreddit for a couple of years before coming back to reclaim his number 2 spot in the total number of side thread counts. It's also fun to see how many of our counters have wildly varying rates of side thread activity over time, which makes for bumpy lines on this plot. ## Plotting the most popular side thread over time We can also look at which side thread is the most popular in any given 30 day period[^3], shown here below ```{python} # | code-summary: "Code to plot the most popular side thread" # | column: body-outset window = "30d" one_hot = resampled.rolling(window).sum().idxmax(axis=1) mode = pd.get_dummies(one_hot).rename(known_threads.to_dict()["thread_name"], axis=1) mode = mode[[x for x in order if x in mode.columns]] labels = {"timestamp": "Date", "variable": "Side thread"} palette = sns.color_palette("colorblind", len(mode.columns)) colors = [f"rgb{tuple(256*np.array(x))}" for x in palette] fig = go.Figure() for column, color in zip(mode.columns, colors): fig.add_trace( go.Scatter( x=mode.index, y=mode[column], fill="tozeroy", mode="none", fillcolor=color, name=column, hoveron="points+fills", ) ) fig.update_yaxes(range=[0, 1], visible=False, showticklabels=False) fig.show() ``` What I'd most like to draw your attention to with this plot is the four month stretch in 2021 when `ternary` was our most popular thread, and 25k counts were made in it. This is the only time `ternary` has ever been the most popular side thread, and the reason for the rapid rise is that somebody had decided to push the thread hard to reach an extra digit. Once that was accomplished, the activity declined to basically where it was before. I was originally going to include some deeper analysis of `tug of war` in this post as well, but cleaning up that data is going to take a lot longer than I thought, so that's all for now! If you have any suggestions for things you'd like to see me do with the side thread data, let me know! [^1]: I'll make a separate script that logs threads to this database later, but to start with I'll populate it with all the historical counts that [u/Countletics](https://reddit.com/user/Countletics) and [u/Antichess](https://reddit.com/user/Antichess) have been nice enough to provide. [^2]: Caused by counters who were unaware of the parity goal running a couple of threads [^3]: To be precise, we'll be looking at which of the **current top 25 threads** is the most popular at any time. So if a thread was popular once, but has since dropped out of the top 25 you won't find it here.