The Distribution of Counts or: How I Learned to Stop Worrying and Love Pandas Indexing

Introduction

This round of analysis is a deep dive into the distribution of counts and counters, across the thousands of threads that have been posted on r/counting over the years. I’ll be trying to answer questions like “Which number from 1-1000?” has been counted by the most people, and “Which people have a significant preference for counting odd or even numbers?”.

We’ll start off with some code to import the relevant packages and load the data.

Code for importing packages and loading data
import itertools
import os
import sqlite3
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns
from IPython.display import Markdown
from rcounting import analysis, counters, graph_tools

pio.templates.default = "seaborn"
sns.set_theme()

data_directory = Path(os.getenv("COUNTING_DATA"))
db = sqlite3.connect(data_directory / "counting.sqlite")
counts = pd.read_sql(
    "select username, submission_id, position from comments where position > 0 order by timestamp",
    db,
)
counts["username"] = counts["username"].apply(counters.apply_alias)
counts["position"] = (counts["position"] - 1) % 1000 + 1

100% Completion

A natural place to start is to look at how many counters have counted every number from 1 to 1000. For this (and most of the other stats in this page) I’ll be restricting my attention to people who have made at least 1000 counts.

Code to find how many values each counter is missing
totals = counts.groupby("username").size()
usernames = [x for x in totals[totals >= 1000].index if not counters.is_banned_counter(x)]
subset = counts.query("username in @usernames").copy()

missing_values = 1000 - subset.groupby("username")["position"].nunique()
print(f"{sum(missing_values == 0)} counters have counted every number from 1-1000")
80 counters have counted every number from 1-1000

We can drill a little deeper and ask how quickly each counter reached the full set. In theory it’s possible to do that after only 1000 counts, but in practice it’s extremely unlikely. Indeed, Table 1 shows that even the fastest counters needed multiple times that.

Code to make the table
completists = [username for username in usernames if missing_values[username] == 0]
complete = subset.query("username in @completists").copy()
complete["index"] = complete.index
last_counts = complete.groupby("username").apply(
    lambda x: (x.groupby("position").head(1)).tail(1)["index"]
)
totals = counts.groupby("username").cumcount()
df = (
    last_counts.reset_index(level=1)["index"]
    .apply(lambda x: totals.loc[x])
    .sort_values()
)
headers = ["**Username**", "**Number of Counts**"]
df = pd.concat([df.tail().sort_values(ascending=False), df.head()]) 
Markdown(df.to_markdown(headers=headers))
Table 1: The fastest and slowest counters to count every number from 1 to 1000, according to how many counts they had when they reached the full sets
Username Number of Counts
bluesolid 12272
KingCaspianX 11706
Removedpixel 9979
Smartstocks 9904
_zachhall 8993
noduorg 2865
Countletics 2963
MorallyGray 3606
Christmas_Missionary 3735
Sharpeye468 3794

Similarly, we can take a look at which counters are only missing a few values before they have the full set:

Code to make the table
missing = [username for username in usernames if 0 < missing_values[username] <= 5]
df = (
    subset.query("username in @missing")
    .groupby("username")
    .size()
    .sort_values(ascending=False)
    .to_frame(name="n_counts")
)
df["missing_value"] = (
    subset.query("username in @missing")
    .groupby("username")["position"]
    .apply(lambda x: (set(range(1, 1001)) - set(x.unique())))
)
headers = ["**Username**", "**Total Counts**", "**Missing Values**"]
Markdown(df.head(10).to_markdown(headers=headers))
The 10 counters with the most counts who are still missing up to five values in order to have counted the full set of numbers from 1 to 1000
Username Total Counts Missing Values
antechiss 19979 {1000, 999}
whit4you 11663 {998}
idunnowhy9000 10824 {952, 999}
949paintball 10611 {995, 991}
MetArtScroll 9923 {56, 54}
Krazeli 8303 {280}
Nekyiia 6706 {11, 5}
gordonpt8 5953 {128}
teddymaniacc 5788 {379, 9, 11}
cob331 5143 {1000}

Efficient getters

The counts on r/counting are organised in threads of 1000 counts, and getting the last count on a thread is a bit of a prize. Some counters are very motivated by gets, and some counters are less motivated by them, but everyone is aware of them.

On average, everyone should have made 1000 counts for each get they have, but some counters have ratios that are significantly different from that. We can look at which counters have made the fewest total counts but still managed to obtain a get

Code to make the table
getters = counts.groupby("submission_id").last()
gets = getters.reset_index()["username"].value_counts()
gets.name = "n_gets"
totals = pd.concat([counts["username"].value_counts().loc[gets.index], gets], axis=1)
totals.columns = ["counts", "gets"]
headers = ["**Username**", "**Counts**", "**Gets**"]
Markdown(totals.sort_values(by="counts").head().to_markdown(headers=headers))
The counters with the fewest total number of counts who still have at least one get.
Username Counts Gets
thephilsnipebar 1 1
MeNowDealWithIt 1 1
apurplish 3 2
Hotshot2k4 3 1
ItzTaken 5 1

Looking at this list, thephilsnipebar is immediately obvious as counting alts. MeNowDealWithIt has made significantly more than one count, but deleted their account before they were picked up by the script, so I have no idea how many. Hotshot2k4 seems legit. They made three counts, the first of which was the 54k get. More recently, ItzTaken got a free get that VitaminB16 left in the 4195k thread. Before that they had made two counts in the 2M era.

Code to calculate and print the <100 and <1000 getters
lt_100 = (totals["counts"] < 100).sum()
lt_1000 = (totals["counts"] < 1000).sum()
s = (f"All in all we have ~{round(lt_100, -1)} counters with a get and less than 100 total counts "
     f"and ~{round(lt_1000, -1)} with less than 1000 counts. "
     "But again, there's a significant number of counts where I don't know the author, "
     "and a significant number of usernames that are unknown aliases.")
Markdown(s)

All in all we have ~30 counters with a get and less than 100 total counts and ~190 with less than 1000 counts. But again, there’s a significant number of counts where I don’t know the author, and a significant number of usernames that are unknown aliases.

Maybe it’s better instead to look only at counters who have made at least 1000 counts. The relevant comparison would then be the ratio of counts to gets. That’s shown on Table 2, and veterans of r/counting will recognise some of the first five names as counters who like to try and snipe the get.

Code to make the table
totals = totals.loc[totals["counts"] >= 1000].copy()
totals["ratio"] = totals["counts"] / totals["gets"]
totals = totals.sort_values(by="ratio")
headers = ["**Username**", "**Counts**", "**Gets**", "**Counts / Gets**"]
df = pd.concat([totals.head(), totals.tail()])
Markdown(df.to_markdown(headers=headers, floatfmt=".0f"))
Table 2: The counters with at least 1000 counts who have the lowest and highest ratio of counts to gets.
Username Counts Gets Counts / Gets
boxofkangaroos 3704 38 97
Maniac_34 10870 70 155
Sharpeye468 6877 31 222
Damnified 1566 7 224
Unknow3n 1688 7 241
a-username-for-me 10465 2 5232
idunnowhy9000 10824 2 5412
zvrizdinus 5746 1 5746
whit4you 11663 2 5832
949paintball 10611 1 10611

The Overall Counting Distribution

I promised to write about a distribution of counts, and so far I’ve mainly written about what numbers individual counters have or have not counted. And there hasn’t been a single graph yet! But I promise that’s about to change. A fun thing to look at first is how many people have counted each number from 1 to 1000. It wouldn’t be far-fetched to assume that each number had been counted by roughly the same amount of people, but that’s not at all what happens. Figure 1 has the details.

Code to make the plot
aggregated = counts.groupby("position")["username"].nunique().to_frame()
labels = {"position": "Thread position", "username": "Number of different counters"}
fig = px.line(
    aggregated.reset_index(), x="position", y="username", labels=labels
)
fig.update_xaxes(range=[0, 1000])
fig.show()

Figure 1: The amount of people who have counted each number. You can see a very sharp rise from the start of each thread to ~50, followed by a steady decline towards the get. The most popular number has been counted by more than twice as many people as the least popular.

We can also look at the counting distributions for individual counters. Again, the default assumption would be that everybody has counted each number roughly the same number of times. Not too surprisingly, we see that this assumption holds better for some counters than it does for others. Figure 2 shows the counting distributions for the most and least regular counter, and you can really see the difference between the two.

The graph has been split into odds and evens, because there’s generally a consistent difference between those two series. Intuitively, that makes sense, since most counts are made in runs where a given user makes every second counts. It’s therefore not too strange that the behaviour at a value \(n\) is more similar to that at \(n - 2\) than at \(n - 1\).

Code to plot the distributions of the most and least regular counters
totals = complete[["username", "position"]].value_counts().unstack(level=0).fillna(0)
totals /= totals.mean()
variations = totals.agg("std").sort_values()
pair = totals[[variations.index[0], variations.index[-1]]].copy()
parity_columns = pair.index.to_series().apply(
    lambda x: pd.Series([["even", "odd"][x % 2], x - x % 2])
)
pair[["parity", "position"]] = parity_columns
labels = {
    "username": "Counter",
    "position": "Thread Position",
    "value": "Relative frequency",
    "parity": "Parity",
}
fig = px.line(
    pair.melt(id_vars=["parity", "position"]),
    color="username",
    line_dash="parity",
    x="position",
    y="value",
    labels=labels,
)
fig.show()

Figure 2: The normalized number of counts made at each value for the most and least regular counters. If every number had been counted exactly the same amount of times, there would be a flat line at y=1

We can quantify the difference for each counter through the Coefficient of Variation, which expresses how far their counting distribution is from uniform. Here’s a table of the five most and five least regular counters:

Code to make the table
headers = ["**Username**", "**Coefficient of Variation [%]**"]
df = pd.concat([100 * variations.head(), 100 * variations.tail()])

Markdown(df.to_markdown(headers=headers, floatfmt=".0f"))
Username Coefficient of Variation [%]
Countletics 9
Ezekiel134 10
Antichess 10
RandomRedditorWithNo 10
nonsensy 11
a-username-for-me 46
ThreeDomeHome 48
bluesolid 52
_zachhall 52
mistyskye14 58

Of course, we saw from Figure 1 that there’s a significant variation in how many people have counted each number, so perhaps the uniform distribution is a bad model for how often we should expect each counter to have counted a particular number. Indeed, since twice as many people have counted the number 50 as have counted 1000, then ****on average**** people who have counted 1000 have done so twice as often as people who have counted 50. That leads to a model distribution that goes as \(f(n) \propto \frac{1}{\textrm{number of people who have counted n}}\). We can again look through all the counters and see who has a counting distribution closest to this ideal:

Code to make the plot
distribution = 1 / aggregated["username"]
distribution = distribution / distribution.mean()
username = totals.agg(lambda x: ((x - distribution) ** 2).sum()).sort_values().index[0]
labels.update({username: "Relative frequency"})
column = totals[username].to_frame()
column[["parity", "position"]] = parity_columns
fig = px.line(
    column,
    x="position",
    y=username,
    color="parity",
    labels=labels,
    title=f"{username}'s counting distribution",
)
line = go.Scatter(
    x=distribution.index,
    y=distribution.values,
    mode="lines",
    line_color="hsl(0, 0%, 70%)",
    showlegend=False,
)
fig.add_trace(line)
fig.data = fig.data[-1:] + fig.data[:-1]
fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01))
fig.show()

Figure 3: The distribution of counts for the counter who most closely matches the model distribution.

It’s impressive just how closely david’s counting frequency matches the toy model I suggested above!

Odds and evens

In the previous section we saw that for some counters, there’s a significant difference between how they’ve counted the odd numbers, and how they’ve counted the even numbers. The difference is not unexpected, since the nature of counting means that for any given run you’ll be stuck on either the odd numbers or the even numbers.

It is striking though just how large the difference can be for some counters, so here’s a table of the most odd counters, the most even counters and the most balanced counters:

Code to make the table
counts["is_even"] = counts["position"] % 2 == 0
offsets = [
    "1gm10t",
    "7hn2tm",
    "b471wg",
    "bz6r0g",
    "d6pgni",
    "ebnh39",
    "grggc0",
    "oj50hj",
    "ob4a2h",
    "t81gug",
]
for offset in offsets:
    counts.loc[counts["submission_id"] == offset, "is_even"] = (
        1 - counts.loc[counts["submission_id"] == offset, "is_even"]
    )
counts["is_odd"] = 1 - counts["is_even"]
subset = counts.query("username in @usernames")
table = subset[["username", "is_even", "is_odd"]].groupby("username").sum()
table.columns = ["n_even", "n_odd"]
table["difference"] = table["n_even"] - table["n_odd"]
table["relative_difference"] = (
    (table["n_even"] - table["n_odd"]) / (table["n_even"] + table["n_odd"]) * 100
)
table["absolute_difference"] = abs(table["relative_difference"])
headers = [
    "**Username**",
    "**n_(even)**",
    "**n_(odd)**",
    "**Difference**",
    "**Relative Difference [%]**",
]
columns = ["n_even", "n_odd", "difference", "relative_difference"]
floatfmt=(".0f",) * len(columns) + (".2f",)
Markdown(
    pd.concat(
        [
            table.sort_values(by="difference").head(),
            table.sort_values(by="difference", ascending=False).head(),
            table.sort_values(by="absolute_difference").head(),
        ]
    )[columns].to_markdown(headers=headers, floatfmt=floatfmt)
)
Table 3: Three sets of counters, organised by parity: Those with most odd counts, those with most even counts, and those who are closest to being perfectly balanced.
Username n_(even) n_(odd) Difference Relative Difference [%]
thephilsblogbar2 207927 259213 -51286 -10.98
Smartstocks 85517 116357 -30840 -15.28
colby6666 13575 27755 -14180 -34.31
Trial-Name 34762 48318 -13556 -16.32
rschaosid 19358 27815 -8457 -17.93
GarlicoinAccount 113830 94681 19149 9.18
Countletics 241304 225320 15984 3.43
LeMinerWithCheese 27467 13503 13964 34.08
ClockButTakeOutTheL 63126 50520 12606 11.09
Antichess 169789 157584 12205 3.73
Ynax 7613 7615 -2 -0.01
cob331 2572 2571 1 0.02
supersammy00 7886 7900 -14 -0.09
nonsensy 85604 85441 163 0.10
animus777 819 817 2 0.12

That’s all for now!