The Distribution of Counts or: How I Learned to Stop Worrying and Love Pandas Indexing

Introduction

This round of analysis is a deep dive into the distribution of counts and counters, across the thousands of threads that have been posted on r/counting over the years. I’ll be trying to answer questions like “Which number from 1-1000?” has been counted by the most people, and “Which people have a significant preference for counting odd or even numbers?”.

We’ll start off with some code to import the relevant packages and load the data.

Code for importing packages and loading data

import itertools
import os
import sqlite3
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import seaborn as sns
from IPython.display import Markdown
from rcounting import analysis, counters, graph_tools

pio.templates.default = "seaborn"
sns.set_theme()

data_directory = Path(os.getenv("COUNTING_DATA"))
db = sqlite3.connect(data_directory / "counting.sqlite")
counts = pd.read_sql(
    "select username, submission_id, position from comments where position > 0 order by timestamp",
    db,
)
counts["username"] = counts["username"].apply(counters.apply_alias)
counts["position"] = (counts["position"] - 1) % 1000 + 1

100% Completion

A natural place to start is to look at how many counters have counted every number from 1 to 1000. For this (and most of the other stats in this page) I’ll be restricting my attention to people who have made at least 1000 counts.

Code to find how many values each counter is missing

totals = counts.groupby("username").size()
usernames = [x for x in totals[totals >= 1000].index if not counters.is_banned_counter(x)]
subset = counts.query("username in @usernames").copy()

missing_values = 1000 - subset.groupby("username")["position"].nunique()
print(f"{sum(missing_values == 0)} counters have counted every number from 1-1000")

80 counters have counted every number from 1-1000

We can drill a little deeper and ask how quickly each counter reached the full set. In theory it’s possible to do that after only 1000 counts, but in practice it’s extremely unlikely. Indeed, Table 1 shows that even the fastest counters needed multiple times that.

Code to make the table

completists = [username for username in usernames if missing_values[username] == 0]
complete = subset.query("username in @completists").copy()
complete["index"] = complete.index
last_counts = complete.groupby("username").apply(
    lambda x: (x.groupby("position").head(1)).tail(1)["index"]
)
totals = counts.groupby("username").cumcount()
df = (
    last_counts.reset_index(level=1)["index"]
    .apply(lambda x: totals.loc[x])
    .sort_values()
)
headers = ["**Username**", "**Number of Counts**"]
df = pd.concat([df.tail().sort_values(ascending=False), df.head()]) 
Markdown(df.to_markdown(headers=headers))

Table 1: The fastest and slowest counters to count every number from 1 to 1000, according to how many counts they had when they reached the full sets
Username	Number of Counts
bluesolid	12272
KingCaspianX	11706
Removedpixel	9979
Smartstocks	9904
_zachhall	8993
noduorg	2865
Countletics	2963
MorallyGray	3606
Christmas_Missionary	3735
Sharpeye468	3794

Similarly, we can take a look at which counters are only missing a few values before they have the full set:

Code to make the table

missing = [username for username in usernames if 0 < missing_values[username] <= 5]
df = (
    subset.query("username in @missing")
    .groupby("username")
    .size()
    .sort_values(ascending=False)
    .to_frame(name="n_counts")
)
df["missing_value"] = (
    subset.query("username in @missing")
    .groupby("username")["position"]
    .apply(lambda x: (set(range(1, 1001)) - set(x.unique())))
)
headers = ["**Username**", "**Total Counts**", "**Missing Values**"]
Markdown(df.head(10).to_markdown(headers=headers))

The 10 counters with the most counts who are still missing up to five values in order to have counted the full set of numbers from 1 to 1000
Username	Total Counts	Missing Values
antechiss	19979	{1000, 999}
whit4you	11663	{998}
idunnowhy9000	10824	{952, 999}
949paintball	10611	{995, 991}
MetArtScroll	9923	{56, 54}
Krazeli	8303	{280}
Nekyiia	6706	{11, 5}
gordonpt8	5953	{128}
teddymaniacc	5788	{379, 9, 11}
cob331	5143	{1000}

Efficient getters

The counts on r/counting are organised in threads of 1000 counts, and getting the last count on a thread is a bit of a prize. Some counters are very motivated by gets, and some counters are less motivated by them, but everyone is aware of them.

On average, everyone should have made 1000 counts for each get they have, but some counters have ratios that are significantly different from that. We can look at which counters have made the fewest total counts but still managed to obtain a get

Code to make the table

getters = counts.groupby("submission_id").last()
gets = getters.reset_index()["username"].value_counts()
gets.name = "n_gets"
totals = pd.concat([counts["username"].value_counts().loc[gets.index], gets], axis=1)
totals.columns = ["counts", "gets"]
headers = ["**Username**", "**Counts**", "**Gets**"]
Markdown(totals.sort_values(by="counts").head().to_markdown(headers=headers))

The counters with the fewest total number of counts who still have at least one get.
Username	Counts	Gets
thephilsnipebar	1	1
MeNowDealWithIt	1	1
apurplish	3	2
Hotshot2k4	3	1
ItzTaken	5	1

Looking at this list, thephilsnipebar is immediately obvious as counting alts. MeNowDealWithIt has made significantly more than one count, but deleted their account before they were picked up by the script, so I have no idea how many. Hotshot2k4 seems legit. They made three counts, the first of which was the 54k get. More recently, ItzTaken got a free get that VitaminB16 left in the 4195k thread. Before that they had made two counts in the 2M era.

Code to calculate and print the <100 and <1000 getters

lt_100 = (totals["counts"] < 100).sum()
lt_1000 = (totals["counts"] < 1000).sum()
s = (f"All in all we have ~{round(lt_100, -1)} counters with a get and less than 100 total counts "
     f"and ~{round(lt_1000, -1)} with less than 1000 counts. "
     "But again, there's a significant number of counts where I don't know the author, "
     "and a significant number of usernames that are unknown aliases.")
Markdown(s)

All in all we have ~30 counters with a get and less than 100 total counts and ~190 with less than 1000 counts. But again, there’s a significant number of counts where I don’t know the author, and a significant number of usernames that are unknown aliases.

Maybe it’s better instead to look only at counters who have made at least 1000 counts. The relevant comparison would then be the ratio of counts to gets. That’s shown on Table 2, and veterans of r/counting will recognise some of the first five names as counters who like to try and snipe the get.

Code to make the table

totals = totals.loc[totals["counts"] >= 1000].copy()
totals["ratio"] = totals["counts"] / totals["gets"]
totals = totals.sort_values(by="ratio")
headers = ["**Username**", "**Counts**", "**Gets**", "**Counts / Gets**"]
df = pd.concat([totals.head(), totals.tail()])
Markdown(df.to_markdown(headers=headers, floatfmt=".0f"))

Table 2: The counters with at least 1000 counts who have the lowest and highest ratio of counts to gets.
Username	Counts	Gets	Counts / Gets
boxofkangaroos	3704	38	97
Maniac_34	10870	70	155
Sharpeye468	6877	31	222
Damnified	1566	7	224
Unknow3n	1688	7	241
a-username-for-me	10465	2	5232
idunnowhy9000	10824	2	5412
zvrizdinus	5746	1	5746
whit4you	11663	2	5832
949paintball	10611	1	10611

The Overall Counting Distribution

I promised to write about a distribution of counts, and so far I’ve mainly written about what numbers individual counters have or have not counted. And there hasn’t been a single graph yet! But I promise that’s about to change. A fun thing to look at first is how many people have counted each number from 1 to 1000. It wouldn’t be far-fetched to assume that each number had been counted by roughly the same amount of people, but that’s not at all what happens. Figure 1 has the details.

Code to make the plot

aggregated = counts.groupby("position")["username"].nunique().to_frame()
labels = {"position": "Thread position", "username": "Number of different counters"}
fig = px.line(
    aggregated.reset_index(), x="position", y="username", labels=labels
)
fig.update_xaxes(range=[0, 1000])
fig.show()

Figure 1: The amount of people who have counted each number. You can see a very sharp rise from the start of each thread to ~50, followed by a steady decline towards the get. The most popular number has been counted by more than twice as many people as the least popular.

We can also look at the counting distributions for individual counters. Again, the default assumption would be that everybody has counted each number roughly the same number of times. Not too surprisingly, we see that this assumption holds better for some counters than it does for others. Figure 2 shows the counting distributions for the most and least regular counter, and you can really see the difference between the two.

The graph has been split into odds and evens, because there’s generally a consistent difference between those two series. Intuitively, that makes sense, since most counts are made in runs where a given user makes every second counts. It’s therefore not too strange that the behaviour at a value \(n\) is more similar to that at \(n - 2\) than at \(n - 1\).

Code to plot the distributions of the most and least regular counters

totals = complete[["username", "position"]].value_counts().unstack(level=0).fillna(0)
totals /= totals.mean()
variations = totals.agg("std").sort_values()
pair = totals[[variations.index[0], variations.index[-1]]].copy()
parity_columns = pair.index.to_series().apply(
    lambda x: pd.Series([["even", "odd"][x % 2], x - x % 2])
)
pair[["parity", "position"]] = parity_columns
labels = {
    "username": "Counter",
    "position": "Thread Position",
    "value": "Relative frequency",
    "parity": "Parity",
}
fig = px.line(
    pair.melt(id_vars=["parity", "position"]),
    color="username",
    line_dash="parity",
    x="position",
    y="value",
    labels=labels,
)
fig.show()

Figure 2: The normalized number of counts made at each value for the most and least regular counters. If every number had been counted exactly the same amount of times, there would be a flat line at y=1

We can quantify the difference for each counter through the Coefficient of Variation, which expresses how far their counting distribution is from uniform. Here’s a table of the five most and five least regular counters:

Code to make the table

headers = ["**Username**", "**Coefficient of Variation [%]**"]
df = pd.concat([100 * variations.head(), 100 * variations.tail()])

Markdown(df.to_markdown(headers=headers, floatfmt=".0f"))

Username	Coefficient of Variation [%]
Countletics	9
Ezekiel134	10
Antichess	10
RandomRedditorWithNo	10
nonsensy	11
a-username-for-me	46
ThreeDomeHome	48
bluesolid	52
_zachhall	52
mistyskye14	58

Of course, we saw from Figure 1 that there’s a significant variation in how many people have counted each number, so perhaps the uniform distribution is a bad model for how often we should expect each counter to have counted a particular number. Indeed, since twice as many people have counted the number 50 as have counted 1000, then ****on average**** people who have counted 1000 have done so twice as often as people who have counted 50. That leads to a model distribution that goes as \(f(n) \propto \frac{1}{\textrm{number of people who have counted n}}\). We can again look through all the counters and see who has a counting distribution closest to this ideal:

Code to make the plot

distribution = 1 / aggregated["username"]
distribution = distribution / distribution.mean()
username = totals.agg(lambda x: ((x - distribution) ** 2).sum()).sort_values().index[0]
labels.update({username: "Relative frequency"})
column = totals[username].to_frame()
column[["parity", "position"]] = parity_columns
fig = px.line(
    column,
    x="position",
    y=username,
    color="parity",
    labels=labels,
    title=f"{username}'s counting distribution",
)
line = go.Scatter(
    x=distribution.index,
    y=distribution.values,
    mode="lines",
    line_color="hsl(0, 0%, 70%)",
    showlegend=False,
)
fig.add_trace(line)
fig.data = fig.data[-1:] + fig.data[:-1]
fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01))
fig.show()

Figure 3: The distribution of counts for the counter who most closely matches the model distribution.

It’s impressive just how closely david’s counting frequency matches the toy model I suggested above!

Odds and evens

In the previous section we saw that for some counters, there’s a significant difference between how they’ve counted the odd numbers, and how they’ve counted the even numbers. The difference is not unexpected, since the nature of counting means that for any given run you’ll be stuck on either the odd numbers or the even numbers.

It is striking though just how large the difference can be for some counters, so here’s a table of the most odd counters, the most even counters and the most balanced counters:

Code to make the table

counts["is_even"] = counts["position"] % 2 == 0
offsets = [
    "1gm10t",
    "7hn2tm",
    "b471wg",
    "bz6r0g",
    "d6pgni",
    "ebnh39",
    "grggc0",
    "oj50hj",
    "ob4a2h",
    "t81gug",
]
for offset in offsets:
    counts.loc[counts["submission_id"] == offset, "is_even"] = (
        1 - counts.loc[counts["submission_id"] == offset, "is_even"]
    )
counts["is_odd"] = 1 - counts["is_even"]
subset = counts.query("username in @usernames")
table = subset[["username", "is_even", "is_odd"]].groupby("username").sum()
table.columns = ["n_even", "n_odd"]
table["difference"] = table["n_even"] - table["n_odd"]
table["relative_difference"] = (
    (table["n_even"] - table["n_odd"]) / (table["n_even"] + table["n_odd"]) * 100
)
table["absolute_difference"] = abs(table["relative_difference"])
headers = [
    "**Username**",
    "**n_(even)**",
    "**n_(odd)**",
    "**Difference**",
    "**Relative Difference [%]**",
]
columns = ["n_even", "n_odd", "difference", "relative_difference"]
floatfmt=(".0f",) * len(columns) + (".2f",)
Markdown(
    pd.concat(
        [
            table.sort_values(by="difference").head(),
            table.sort_values(by="difference", ascending=False).head(),
            table.sort_values(by="absolute_difference").head(),
        ]
    )[columns].to_markdown(headers=headers, floatfmt=floatfmt)
)

Table 3: Three sets of counters, organised by parity: Those with most odd counts, those with most even counts, and those who are closest to being perfectly balanced.
Username	n_(even)	n_(odd)	Difference	Relative Difference [%]
thephilsblogbar2	207927	259213	-51286	-10.98
Smartstocks	85517	116357	-30840	-15.28
colby6666	13575	27755	-14180	-34.31
Trial-Name	34762	48318	-13556	-16.32
rschaosid	19358	27815	-8457	-17.93
GarlicoinAccount	113830	94681	19149	9.18
Countletics	241304	225320	15984	3.43
LeMinerWithCheese	27467	13503	13964	34.08
ClockButTakeOutTheL	63126	50520	12606	11.09
Antichess	169789	157584	12205	3.73
Ynax	7613	7615	-2	-0.01
cob331	2572	2571	1	0.02
supersammy00	7886	7900	-14	-0.09
nonsensy	85604	85441	163	0.10
animus777	819	817	2	0.12

That’s all for now!