Analysing rcounting data

This page highlights some of the initial analysis I’ve done of the counts made on r/counting. To do this, I use the database of all counts that was put together by u/davidjl123 and u/Countletics (and others!), as modified by me. For more niche analysis, see some of the other pages on the sidebar on the left. Most of the figures and tables here have also been posted on the subreddit, but I wanted to have them in a central place. I also liked being able to show the code and the tables or figures it generates in the same document, so that people can see both. Some of it isn’t particularly interesting, so I’ve hidden it behind a code widget. You can unfold it just by clicking.

The idea of this page is also that I’ll try and keep the analysis current as more counts come in, while the other pages might slowly grow stale.

Imports and initialization

We’ll start with some imports, after which we can connect to the database of counts

Code for importing packages and connecting to the database

import re
import sqlite3
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import seaborn as sns
from rcounting import analysis, counters, parsing, side_threads
from rcounting import thread_navigation as tn
from rcounting.reddit_interface import reddit

pio.templates.default = "seaborn"
sns.set_theme()
from IPython.display import Markdown

data_directory = Path("../data")

db = sqlite3.connect(data_directory / "counting.sqlite")

Then we load some data, both the counts and the gets. We convert the timestamp to a date column, and add a “replying to” column, since some of what we’ll be doing later needs it.

Code for loading counts from the counting database

counts = pd.read_sql(
    "select comments.username, comments.submission_id, comments.timestamp "
    "from comments join submissions "
    "on comments.submission_id = submissions.submission_id "
    "where comments.position > 0 "
    "order by submissions.timestamp, comments.position",
    db,
)
counts["date"] = pd.to_datetime(counts["timestamp"], unit="s")
counts["username"] = counts["username"].apply(counters.apply_alias)
counts.drop("timestamp", inplace=True, axis=1)
counts["replying_to"] = counts["username"].shift(1)
print(f"There are {len(counts)} comments logged on main")
gets = counts.groupby("submission_id").last().sort_values("date").reset_index()
gets["basecount"] = (gets.index + 15) * 1000
gets.loc[[0, 1], ["basecount"]] = [0, 16691]

There are 5248939 comments logged on main

Counting progress over time

A first bit of analysis is to visualize the progress of r/counting over time. That isn’t particularly difficult to do

Code to plot the cumulativa counts over time

data = gets.set_index("date")
fig = px.line(
    data_frame=data.resample("30d")[["basecount"]].mean().reset_index(),
    x="date",
    y="basecount",
    labels={"basecount": "Count", "date": "Date"},
)
fig.show()

Figure 1: Cumulative counts made on /r/counting as a function of time

On Figure 1 you can see that the counting rate varies quite a bit over time, with signifcant peaks and lulls in activity. Whether or not there are active runners really changes how fast the count is progressing!

We can split the total counts into different groups according to who made them, and plot the total number of counts made over time by the top participants on rcounting.

Code to plot the hall of counters

frequency = "24h"
totals = counts.groupby("username").size().sort_values(ascending=False)
top_counters = [x for x in totals.index if not counters.is_banned_counter(x)][:25]
series = counts.loc[counts["username"].isin(top_counters), ["username", "date"]]
total = (
    pd.get_dummies(series.set_index("date")["username"])
    .resample(frequency)
    .sum()
    .cumsum()
    .melt(ignore_index=False)
    .reset_index()
)

order = list(
    total.groupby("variable")["value"].last().sort_values(ascending=False).index
)

fig = px.line(
    data_frame=total,
    x="date",
    y="value",
    line_group="variable",
    color="variable",
    category_orders={"variable": order},
    labels={"date": "Date", "variable": "Counter", "value": "Total Counts"},
)
fig.show()

Total counts vs k parts

We can try plotting thread participation vs total counts. The expectation is that generally, people who’ve made more total counts will also have counted in more threads. However, some users might have periods where they make a count every now and then but never do any runs, leading to comparatively more k_parts. On the other hand, some counters might only do runs, giving a counts/thread of up to 500.

We start extract the number of counts and the threads participated in, using the groupby functionality of pandas, and tthen we’ll make a scatter plot of how the two quantities are related. Since the most prolific counters have orders of magnitude more counts and k parts than the least prolific counters, we’ll show both axes on a log scale.

Code to plot total counts vs k parts

groups = counts.groupby("username")["submission_id"]
k_parts = groups.nunique()
hoc = groups.count()
combined = pd.concat([k_parts, hoc], axis=1)
combined.columns = ["k_parts", "total_counts"]
combined = combined.query("k_parts >= 10")
fig = px.scatter(
    data_frame=combined,
    x="k_parts",
    y="total_counts",
    opacity=0.7,
    log_x=True,
    log_y=True,
    trendline="ols",
    trendline_options=dict(log_x=True, log_y=True),
    labels={"k_parts": "Threads participated in", "total_counts": "Total Counts"},
)

fig.show()

Figure 2: The relationship between the total number of counts for each user, and then number of threads they’ve participated in

You can see what that looks like on Figure 2. The line is a linear fit on the log-log plot, and it has a slope of 1.3. In this model, that means that if you double the total number of threads participated in by a user, you would expect to multiply their total counts by 2.5.

Number of partners and effective number of partners

As with the number of counts vs threads participated in, we can expect that different counters might have qualitatively different behaviour when it comes to how many counting partners they have, and how often they’ve counted with each one. Some counters might count a little bit with everybody, while others might run with only a few partners, and drop a count with others every now and then.

To quantify how uneven the counting distribution is we can look at the effective number of partners of each counter, and compare with the actual number of partners.

Code to find the effective and actual number of counting partners

sorted_counters = counts.groupby("username").size().sort_values(ascending=False)
top_counters = [
    x for x in sorted_counters.index[:35] if not counters.is_banned_counter(x)
][:30]
top = sorted_counters.filter(items=top_counters)
df = (
    counts.loc[counts["username"].isin(top_counters)]
    .groupby(["username", "replying_to"])
    .size()
)
effective_partners = (
    df.groupby(level=0).apply(analysis.effective_number_of_counters).to_frame()
)
partners = df.groupby(level=0).count()
combined = pd.concat([top, effective_partners, partners], axis=1)
combined["HOC rank"] = range(1, len(combined) + 1)
combined.columns = ["counts", "c_eff", "c", "rank"]
combined = combined[["rank", "c", "c_eff"]]
combined.c_eff = combined.c_eff.round().astype(int)
combined.columns = ["HOC rank", "N", "N_(effective)"]
combined.index.name = "Username"
combined.head(25)

	HOC rank	N	N_(effective)
Username
thephilsblogbar2	1	1511	13
Countletics	2	551	8
Antichess	3	838	7
davidjl123	4	1593	23
GarlicoinAccount	5	389	4
Smartstocks	6	1058	29
nonsensy	7	261	3
TheNitromeFan	8	1885	33
atomicimploder	9	2471	30
ClockButTakeOutTheL	10	208	3
qwertylool	11	371	7
Ezekiel134	12	905	13
Trial-Name	13	162	4
rideride	14	689	11
RandomRedditorWithNo	15	750	22
Urbul	16	1093	30
Mooraell	17	1212	24
kdiuro13	18	848	24
qualw	19	267	10
Removedpixel	20	1198	25
TehVulpez	21	710	20
Adinida	22	299	9
rschaosid	23	335	18
colby6666	24	143	6
LeMinerWithCheese	25	89	7

We can also get the replying-to and replied-by stats for a single user

Code to calculate reply stats for a single user

counter = "thephilsblogbar2"
nick = "phil"
subset = counts.loc[counts["username"] == counter].copy()
replied_by = counts["username"].shift(-1).loc[subset.index]
subset["replied_by"] = replied_by
result = pd.concat(
    [
        subset.groupby("replied_by").count().iloc[:, 0].sort_values(ascending=False),
        subset.groupby("replying_to").count().iloc[:, 0].sort_values(ascending=False),
    ],
    axis=1,
)
headers = ["Counting partner", f"No. of replies by {nick}", f"No. of replies to {nick}"]
Markdown(result.head(10).to_markdown(headers=headers))

Table 1: The most popular counting partners of a single user
Counting partner	No. of replies by phil	No. of replies to phil
GarlicoinAccount	98821	98844
ClockButTakeOutTheL	55440	55439
Countletics	35399	35482
Antichess	18736	18919
TheNitromeFan	16122	16225
atomicimploder	12439	12548
CutOnBumInBandHere9	11669	11767
Trial-Name	11592	11625
amazingpikachu_38	11492	11484
colby6666	10840	10830

Oldest counters

We can see who the oldest still-active counters are, where still-active is generously defined as “having made a count within the last six months”.

Code to find the oldest still-active counters

cutoff_date = pd.to_datetime("today") - pd.Timedelta("180d")
active_counters = (
    counts.loc[counts["date"] > cutoff_date].groupby("username").groups.keys()
)
oldest_counters = (
    counts.loc[counts["username"].isin(active_counters)]
    .groupby("username")["date"]
    .agg([min, max])
)
oldest_counters = oldest_counters.sort_values("min").head(25)
headers = ["**username**", "**First Count**", "**Latest Count**"]
Markdown(oldest_counters.apply(lambda x: x.dt.date).to_markdown(headers=headers))

Table 2: The 25 currently-active counters who’ve been counting for the longest time
username	First Count	Latest Count
Z3F	2012-06-10	2023-04-13
CanGreenBeret	2012-06-12	2023-01-30
OverlordLork	2012-06-14	2023-02-17
949paintball	2012-06-15	2022-12-27
mrstickman	2012-07-18	2023-03-22
Bluethulhu	2012-11-04	2023-05-19
kdiuro13	2013-04-07	2023-05-06
zhige	2013-05-03	2023-03-20
falsehood	2013-06-24	2023-03-22
O_Baby_Baby	2013-12-13	2023-02-03
CutOnBumInBandHere9	2013-12-13	2023-05-30
onewhitelight	2013-12-18	2023-03-22
OreoObserver	2014-01-25	2023-01-04
D-alx	2014-02-13	2023-04-06
atomicimploder	2014-02-26	2023-05-25
dude_why_would_you	2014-02-28	2023-03-03
rideride	2014-04-07	2023-04-26
Juqu	2014-04-12	2023-02-01
Uebeltank	2014-06-05	2023-04-01
slockley	2014-06-23	2023-01-20
Blimp_Blimp	2014-06-29	2022-12-13
artbn	2014-07-06	2023-04-23
davidjl123	2014-07-16	2023-05-29
Mooraell	2015-01-04	2023-05-30
Maniac_34	2015-01-12	2023-05-30

Gets and streaks

Similarly to the oldest counters, we can see what the longest difference between a counter’s first and last get is, and that’s shown on Table 3. Some counters have been active and getting gets for quite a while!

Code to find the counters with the longest get spans

Markdown(
    gets.groupby("username")
    .agg(lambda x: x.index[-1] - x.index[0])
    .iloc[:, 0]
    .sort_values(ascending=False)
    .head(10)
    .to_markdown(headers=["**Username**", "**Get span**"])
)

Table 3: The longest differences between the first and last get of r/counting users (1000s of counts)
Username	Get span
atomicimploder	4833
Maniac_34	4620
origamimissile	4577
Mooraell	4563
musicbuilder	4515
Pookah	4500
FartyMcNarty	4373
TheNitromeFan	4305
Sharpeye468	4177
davidjl123	4111

We can also calculate what the longest get streaks are.

Code to find the longest get streaks

y = gets["username"]
groups = gets.groupby((y != y.shift()).cumsum())
columns = ["username", "submission_id", "basecount"]
length = 10

indices = (-groups.size()).sort_values(kind="mergesort").index
old = groups.first().loc[indices, columns]
new = groups.last().loc[indices, columns]
combined = old.join(new, rsuffix="_new")
combined = (
    combined.loc[~combined["username"].apply(counters.is_banned_counter)]
    .head(length)
    .reset_index(drop=True)
)


def old_link(row):
    return (
        f"[{int(row.basecount / 1000) + 1}K]"
        + f"(https://reddit.com/comments/{row.submission_id}/)"
    )


def new_link(row):
    return (
        f"[{int(row.basecount_new / 1000) + 1}K]"
        + f"(https://reddit.com/comments/{row.submission_id_new}/)"
    )


headers = ["Rank", "username", "First Get", "Last Get", "Streak Length"]
headers = [f"**{x}**" for x in headers]
columns = ["username", "old_link", "new_link", "streak"]
combined["old_link"] = combined.apply(old_link, axis=1)
combined["new_link"] = combined.apply(new_link, axis=1)
combined["streak"] = 1 + (combined["basecount_new"] - combined["basecount"]) // 1000
combined.index += 1
combined.index.name = "Rank"
Markdown(combined[columns].to_markdown(headers=headers))

Table 4: The longest streak
Rank	username	First Get	Last Get	Streak Length
1	Antichess	5093K	5115K	23
2	Countletics	2896K	2914K	19
3	LeMinerWithCheese	4570K	4584K	15
4	Countletics	5187K	5200K	14
5	davidjl123	3038K	3047K	10
6	Countletics	3091K	3100K	10
7	Countletics	3190K	3199K	10
8	Countletics	4384K	4393K	10
9	Countletics	5143K	5151K	9
10	qualw	2042K	2049K	8

The core of the extraction is the line that says groups = gets.groupby((y != y.shift()).cumsum()). Let’s unpack it:

y != y.shift() assigns a value of True to all threads with a username that’s different from their predecessor
.cumsum() sums up all these True values. The net result is that each get streak is given its own unique number
.groupby() extracts these groups for later use

The groups are then sorted according to size, and the largest ones are shown in Table 4