Analysing rcounting data

This page highlights some of the initial analysis I’ve done of the counts made on r/counting. To do this, I use the database of all counts that was put together by u/davidjl123 and u/Countletics (and others!), as modified by me. For more niche analysis, see some of the other pages on the sidebar on the left. Most of the figures and tables here have also been posted on the subreddit, but I wanted to have them in a central place. I also liked being able to show the code and the tables or figures it generates in the same document, so that people can see both. Some of it isn’t particularly interesting, so I’ve hidden it behind a code widget. You can unfold it just by clicking.

The idea of this page is also that I’ll try and keep the analysis current as more counts come in, while the other pages might slowly grow stale.

Imports and initialization

We’ll start with some imports, after which we can connect to the database of counts

Code for importing packages and connecting to the database
import re
import sqlite3
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import seaborn as sns
from rcounting import analysis, counters, parsing, side_threads
from rcounting import thread_navigation as tn
from rcounting.reddit_interface import reddit

pio.templates.default = "seaborn"
sns.set_theme()
from IPython.display import Markdown

data_directory = Path("../data")

db = sqlite3.connect(data_directory / "counting.sqlite")

Then we load some data, both the counts and the gets. We convert the timestamp to a date column, and add a “replying to” column, since some of what we’ll be doing later needs it.

Code for loading counts from the counting database
counts = pd.read_sql(
    "select comments.username, comments.submission_id, comments.timestamp "
    "from comments join submissions "
    "on comments.submission_id = submissions.submission_id "
    "where comments.position > 0 "
    "order by submissions.timestamp, comments.position",
    db,
)
counts["date"] = pd.to_datetime(counts["timestamp"], unit="s")
counts["username"] = counts["username"].apply(counters.apply_alias)
counts.drop("timestamp", inplace=True, axis=1)
counts["replying_to"] = counts["username"].shift(1)
print(f"There are {len(counts)} comments logged on main")
gets = counts.groupby("submission_id").last().sort_values("date").reset_index()
gets["basecount"] = (gets.index + 15) * 1000
gets.loc[[0, 1], ["basecount"]] = [0, 16691]
There are 5248939 comments logged on main

Counting progress over time

A first bit of analysis is to visualize the progress of r/counting over time. That isn’t particularly difficult to do

Code to plot the cumulativa counts over time
data = gets.set_index("date")
fig = px.line(
    data_frame=data.resample("30d")[["basecount"]].mean().reset_index(),
    x="date",
    y="basecount",
    labels={"basecount": "Count", "date": "Date"},
)
fig.show()

Figure 1: Cumulative counts made on /r/counting as a function of time

On Figure 1 you can see that the counting rate varies quite a bit over time, with signifcant peaks and lulls in activity. Whether or not there are active runners really changes how fast the count is progressing!

We can split the total counts into different groups according to who made them, and plot the total number of counts made over time by the top participants on rcounting.

Code to plot the hall of counters
frequency = "24h"
totals = counts.groupby("username").size().sort_values(ascending=False)
top_counters = [x for x in totals.index if not counters.is_banned_counter(x)][:25]
series = counts.loc[counts["username"].isin(top_counters), ["username", "date"]]
total = (
    pd.get_dummies(series.set_index("date")["username"])
    .resample(frequency)
    .sum()
    .cumsum()
    .melt(ignore_index=False)
    .reset_index()
)

order = list(
    total.groupby("variable")["value"].last().sort_values(ascending=False).index
)

fig = px.line(
    data_frame=total,
    x="date",
    y="value",
    line_group="variable",
    color="variable",
    category_orders={"variable": order},
    labels={"date": "Date", "variable": "Counter", "value": "Total Counts"},
)
fig.show()

Total counts vs k parts

We can try plotting thread participation vs total counts. The expectation is that generally, people who’ve made more total counts will also have counted in more threads. However, some users might have periods where they make a count every now and then but never do any runs, leading to comparatively more k_parts. On the other hand, some counters might only do runs, giving a counts/thread of up to 500.

We start extract the number of counts and the threads participated in, using the groupby functionality of pandas, and tthen we’ll make a scatter plot of how the two quantities are related. Since the most prolific counters have orders of magnitude more counts and k parts than the least prolific counters, we’ll show both axes on a log scale.

Code to plot total counts vs k parts
groups = counts.groupby("username")["submission_id"]
k_parts = groups.nunique()
hoc = groups.count()
combined = pd.concat([k_parts, hoc], axis=1)
combined.columns = ["k_parts", "total_counts"]
combined = combined.query("k_parts >= 10")
fig = px.scatter(
    data_frame=combined,
    x="k_parts",
    y="total_counts",
    opacity=0.7,
    log_x=True,
    log_y=True,
    trendline="ols",
    trendline_options=dict(log_x=True, log_y=True),
    labels={"k_parts": "Threads participated in", "total_counts": "Total Counts"},
)

fig.show()

Figure 2: The relationship between the total number of counts for each user, and then number of threads they’ve participated in

You can see what that looks like on Figure 2. The line is a linear fit on the log-log plot, and it has a slope of 1.3. In this model, that means that if you double the total number of threads participated in by a user, you would expect to multiply their total counts by 2.5.

Number of partners and effective number of partners

As with the number of counts vs threads participated in, we can expect that different counters might have qualitatively different behaviour when it comes to how many counting partners they have, and how often they’ve counted with each one. Some counters might count a little bit with everybody, while others might run with only a few partners, and drop a count with others every now and then.

To quantify how uneven the counting distribution is we can look at the effective number of partners of each counter, and compare with the actual number of partners.

Code to find the effective and actual number of counting partners
sorted_counters = counts.groupby("username").size().sort_values(ascending=False)
top_counters = [
    x for x in sorted_counters.index[:35] if not counters.is_banned_counter(x)
][:30]
top = sorted_counters.filter(items=top_counters)
df = (
    counts.loc[counts["username"].isin(top_counters)]
    .groupby(["username", "replying_to"])
    .size()
)
effective_partners = (
    df.groupby(level=0).apply(analysis.effective_number_of_counters).to_frame()
)
partners = df.groupby(level=0).count()
combined = pd.concat([top, effective_partners, partners], axis=1)
combined["HOC rank"] = range(1, len(combined) + 1)
combined.columns = ["counts", "c_eff", "c", "rank"]
combined = combined[["rank", "c", "c_eff"]]
combined.c_eff = combined.c_eff.round().astype(int)
combined.columns = ["HOC rank", "N", "N_(effective)"]
combined.index.name = "Username"
combined.head(25)
HOC rank N N_(effective)
Username
thephilsblogbar2 1 1511 13
Countletics 2 551 8
Antichess 3 838 7
davidjl123 4 1593 23
GarlicoinAccount 5 389 4
Smartstocks 6 1058 29
nonsensy 7 261 3
TheNitromeFan 8 1885 33
atomicimploder 9 2471 30
ClockButTakeOutTheL 10 208 3
qwertylool 11 371 7
Ezekiel134 12 905 13
Trial-Name 13 162 4
rideride 14 689 11
RandomRedditorWithNo 15 750 22
Urbul 16 1093 30
Mooraell 17 1212 24
kdiuro13 18 848 24
qualw 19 267 10
Removedpixel 20 1198 25
TehVulpez 21 710 20
Adinida 22 299 9
rschaosid 23 335 18
colby6666 24 143 6
LeMinerWithCheese 25 89 7

We can also get the replying-to and replied-by stats for a single user

Code to calculate reply stats for a single user
counter = "thephilsblogbar2"
nick = "phil"
subset = counts.loc[counts["username"] == counter].copy()
replied_by = counts["username"].shift(-1).loc[subset.index]
subset["replied_by"] = replied_by
result = pd.concat(
    [
        subset.groupby("replied_by").count().iloc[:, 0].sort_values(ascending=False),
        subset.groupby("replying_to").count().iloc[:, 0].sort_values(ascending=False),
    ],
    axis=1,
)
headers = ["Counting partner", f"No. of replies by {nick}", f"No. of replies to {nick}"]
Markdown(result.head(10).to_markdown(headers=headers))
Table 1: The most popular counting partners of a single user
Counting partner No. of replies by phil No. of replies to phil
GarlicoinAccount 98821 98844
ClockButTakeOutTheL 55440 55439
Countletics 35399 35482
Antichess 18736 18919
TheNitromeFan 16122 16225
atomicimploder 12439 12548
CutOnBumInBandHere9 11669 11767
Trial-Name 11592 11625
amazingpikachu_38 11492 11484
colby6666 10840 10830

Oldest counters

We can see who the oldest still-active counters are, where still-active is generously defined as “having made a count within the last six months”.

Code to find the oldest still-active counters
cutoff_date = pd.to_datetime("today") - pd.Timedelta("180d")
active_counters = (
    counts.loc[counts["date"] > cutoff_date].groupby("username").groups.keys()
)
oldest_counters = (
    counts.loc[counts["username"].isin(active_counters)]
    .groupby("username")["date"]
    .agg([min, max])
)
oldest_counters = oldest_counters.sort_values("min").head(25)
headers = ["**username**", "**First Count**", "**Latest Count**"]
Markdown(oldest_counters.apply(lambda x: x.dt.date).to_markdown(headers=headers))
Table 2: The 25 currently-active counters who’ve been counting for the longest time
username First Count Latest Count
Z3F 2012-06-10 2023-04-13
CanGreenBeret 2012-06-12 2023-01-30
OverlordLork 2012-06-14 2023-02-17
949paintball 2012-06-15 2022-12-27
mrstickman 2012-07-18 2023-03-22
Bluethulhu 2012-11-04 2023-05-19
kdiuro13 2013-04-07 2023-05-06
zhige 2013-05-03 2023-03-20
falsehood 2013-06-24 2023-03-22
O_Baby_Baby 2013-12-13 2023-02-03
CutOnBumInBandHere9 2013-12-13 2023-05-30
onewhitelight 2013-12-18 2023-03-22
OreoObserver 2014-01-25 2023-01-04
D-alx 2014-02-13 2023-04-06
atomicimploder 2014-02-26 2023-05-25
dude_why_would_you 2014-02-28 2023-03-03
rideride 2014-04-07 2023-04-26
Juqu 2014-04-12 2023-02-01
Uebeltank 2014-06-05 2023-04-01
slockley 2014-06-23 2023-01-20
Blimp_Blimp 2014-06-29 2022-12-13
artbn 2014-07-06 2023-04-23
davidjl123 2014-07-16 2023-05-29
Mooraell 2015-01-04 2023-05-30
Maniac_34 2015-01-12 2023-05-30

Gets and streaks

Similarly to the oldest counters, we can see what the longest difference between a counter’s first and last get is, and that’s shown on Table 3. Some counters have been active and getting gets for quite a while!

Code to find the counters with the longest get spans
Markdown(
    gets.groupby("username")
    .agg(lambda x: x.index[-1] - x.index[0])
    .iloc[:, 0]
    .sort_values(ascending=False)
    .head(10)
    .to_markdown(headers=["**Username**", "**Get span**"])
)
Table 3: The longest differences between the first and last get of r/counting users (1000s of counts)
Username Get span
atomicimploder 4833
Maniac_34 4620
origamimissile 4577
Mooraell 4563
musicbuilder 4515
Pookah 4500
FartyMcNarty 4373
TheNitromeFan 4305
Sharpeye468 4177
davidjl123 4111

We can also calculate what the longest get streaks are.

Code to find the longest get streaks
y = gets["username"]
groups = gets.groupby((y != y.shift()).cumsum())
columns = ["username", "submission_id", "basecount"]
length = 10

indices = (-groups.size()).sort_values(kind="mergesort").index
old = groups.first().loc[indices, columns]
new = groups.last().loc[indices, columns]
combined = old.join(new, rsuffix="_new")
combined = (
    combined.loc[~combined["username"].apply(counters.is_banned_counter)]
    .head(length)
    .reset_index(drop=True)
)


def old_link(row):
    return (
        f"[{int(row.basecount / 1000) + 1}K]"
        + f"(https://reddit.com/comments/{row.submission_id}/)"
    )


def new_link(row):
    return (
        f"[{int(row.basecount_new / 1000) + 1}K]"
        + f"(https://reddit.com/comments/{row.submission_id_new}/)"
    )


headers = ["Rank", "username", "First Get", "Last Get", "Streak Length"]
headers = [f"**{x}**" for x in headers]
columns = ["username", "old_link", "new_link", "streak"]
combined["old_link"] = combined.apply(old_link, axis=1)
combined["new_link"] = combined.apply(new_link, axis=1)
combined["streak"] = 1 + (combined["basecount_new"] - combined["basecount"]) // 1000
combined.index += 1
combined.index.name = "Rank"
Markdown(combined[columns].to_markdown(headers=headers))
Table 4: The longest streak
Rank username First Get Last Get Streak Length
1 Antichess 5093K 5115K 23
2 Countletics 2896K 2914K 19
3 LeMinerWithCheese 4570K 4584K 15
4 Countletics 5187K 5200K 14
5 davidjl123 3038K 3047K 10
6 Countletics 3091K 3100K 10
7 Countletics 3190K 3199K 10
8 Countletics 4384K 4393K 10
9 Countletics 5143K 5151K 9
10 qualw 2042K 2049K 8

The core of the extraction is the line that says groups = gets.groupby((y != y.shift()).cumsum()). Let’s unpack it:

  • y != y.shift() assigns a value of True to all threads with a username that’s different from their predecessor
  • .cumsum() sums up all these True values. The net result is that each get streak is given its own unique number
  • .groupby() extracts these groups for later use

The groups are then sorted according to size, and the largest ones are shown in Table 4