This round of analysis is a deep dive into the distribution of counts and counters, across the thousands of threads that have been posted on r/counting over the years. I’ll be trying to answer questions like “Which number from 1-1000?” has been counted by the most people, and “Which people have a significant preference for counting odd or even numbers?”.
We’ll start off with some code to import the relevant packages and load the data.
Code for importing packages and loading data
import itertoolsimport osimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.graph_objects as goimport plotly.io as pioimport seaborn as snsfrom IPython.display import Markdownfrom rcounting import analysis, counters, graph_toolspio.templates.default ="seaborn"sns.set_theme()data_directory = Path(os.getenv("COUNTING_DATA"))db = sqlite3.connect(data_directory /"counting.sqlite")counts = pd.read_sql("select username, submission_id, position from comments where position > 0 order by timestamp", db,)counts["username"] = counts["username"].apply(counters.apply_alias)counts["position"] = (counts["position"] -1) %1000+1
100% Completion
A natural place to start is to look at how many counters have counted every number from 1 to 1000. For this (and most of the other stats in this page) I’ll be restricting my attention to people who have made at least 1000 counts.
Code to find how many values each counter is missing
totals = counts.groupby("username").size()usernames = [x for x in totals[totals >=1000].index ifnot counters.is_banned_counter(x)]subset = counts.query("username in @usernames").copy()missing_values =1000- subset.groupby("username")["position"].nunique()print(f"{sum(missing_values ==0)} counters have counted every number from 1-1000")
80 counters have counted every number from 1-1000
We can drill a little deeper and ask how quickly each counter reached the full set. In theory it’s possible to do that after only 1000 counts, but in practice it’s extremely unlikely. Indeed, Table 1 shows that even the fastest counters needed multiple times that.
Code to make the table
completists = [username for username in usernames if missing_values[username] ==0]complete = subset.query("username in @completists").copy()complete["index"] = complete.indexlast_counts = complete.groupby("username").apply(lambda x: (x.groupby("position").head(1)).tail(1)["index"])totals = counts.groupby("username").cumcount()df = ( last_counts.reset_index(level=1)["index"] .apply(lambda x: totals.loc[x]) .sort_values())headers = ["**Username**", "**Number of Counts**"]df = pd.concat([df.tail().sort_values(ascending=False), df.head()]) Markdown(df.to_markdown(headers=headers))
Table 1: The fastest and slowest counters to count every number from 1 to 1000, according to how many counts they had when they reached the full sets
Username
Number of Counts
bluesolid
12272
KingCaspianX
11706
Removedpixel
9979
Smartstocks
9904
_zachhall
8993
noduorg
2865
Countletics
2963
MorallyGray
3606
Christmas_Missionary
3735
Sharpeye468
3794
Similarly, we can take a look at which counters are only missing a few values before they have the full set:
Code to make the table
missing = [username for username in usernames if0< missing_values[username] <=5]df = ( subset.query("username in @missing") .groupby("username") .size() .sort_values(ascending=False) .to_frame(name="n_counts"))df["missing_value"] = ( subset.query("username in @missing") .groupby("username")["position"] .apply(lambda x: (set(range(1, 1001)) -set(x.unique()))))headers = ["**Username**", "**Total Counts**", "**Missing Values**"]Markdown(df.head(10).to_markdown(headers=headers))
The 10 counters with the most counts who are still missing up to five values in order to have counted the full set of numbers from 1 to 1000
Username
Total Counts
Missing Values
antechiss
19979
{1000, 999}
whit4you
11663
{998}
idunnowhy9000
10824
{952, 999}
949paintball
10611
{995, 991}
MetArtScroll
9923
{56, 54}
Krazeli
8303
{280}
Nekyiia
6706
{11, 5}
gordonpt8
5953
{128}
teddymaniacc
5788
{379, 9, 11}
cob331
5143
{1000}
Efficient getters
The counts on r/counting are organised in threads of 1000 counts, and getting the last count on a thread is a bit of a prize. Some counters are very motivated by gets, and some counters are less motivated by them, but everyone is aware of them.
On average, everyone should have made 1000 counts for each get they have, but some counters have ratios that are significantly different from that. We can look at which counters have made the fewest total counts but still managed to obtain a get
The counters with the fewest total number of counts who still have at least one get.
Username
Counts
Gets
thephilsnipebar
1
1
MeNowDealWithIt
1
1
apurplish
3
2
Hotshot2k4
3
1
ItzTaken
5
1
Looking at this list, thephilsnipebar is immediately obvious as counting alts. MeNowDealWithIt has made significantly more than one count, but deleted their account before they were picked up by the script, so I have no idea how many. Hotshot2k4 seems legit. They made three counts, the first of which was the 54k get. More recently, ItzTaken got a free get that VitaminB16 left in the 4195k thread. Before that they had made two counts in the 2M era.
Code to calculate and print the <100 and <1000 getters
lt_100 = (totals["counts"] <100).sum()lt_1000 = (totals["counts"] <1000).sum()s = (f"All in all we have ~{round(lt_100, -1)} counters with a get and less than 100 total counts "f"and ~{round(lt_1000, -1)} with less than 1000 counts. ""But again, there's a significant number of counts where I don't know the author, ""and a significant number of usernames that are unknown aliases.")Markdown(s)
All in all we have ~30 counters with a get and less than 100 total counts and ~190 with less than 1000 counts. But again, there’s a significant number of counts where I don’t know the author, and a significant number of usernames that are unknown aliases.
Maybe it’s better instead to look only at counters who have made at least 1000 counts. The relevant comparison would then be the ratio of counts to gets. That’s shown on Table 2, and veterans of r/counting will recognise some of the first five names as counters who like to try and snipe the get.
Table 2: The counters with at least 1000 counts who have the lowest and highest ratio of counts to gets.
Username
Counts
Gets
Counts / Gets
boxofkangaroos
3704
38
97
Maniac_34
10870
70
155
Sharpeye468
6877
31
222
Damnified
1566
7
224
Unknow3n
1688
7
241
a-username-for-me
10465
2
5232
idunnowhy9000
10824
2
5412
zvrizdinus
5746
1
5746
whit4you
11663
2
5832
949paintball
10611
1
10611
The Overall Counting Distribution
I promised to write about a distribution of counts, and so far I’ve mainly written about what numbers individual counters have or have not counted. And there hasn’t been a single graph yet! But I promise that’s about to change. A fun thing to look at first is how many people have counted each number from 1 to 1000. It wouldn’t be far-fetched to assume that each number had been counted by roughly the same amount of people, but that’s not at all what happens. Figure 1 has the details.
Code to make the plot
aggregated = counts.groupby("position")["username"].nunique().to_frame()labels = {"position": "Thread position", "username": "Number of different counters"}fig = px.line( aggregated.reset_index(), x="position", y="username", labels=labels)fig.update_xaxes(range=[0, 1000])fig.show()
We can also look at the counting distributions for individual counters. Again, the default assumption would be that everybody has counted each number roughly the same number of times. Not too surprisingly, we see that this assumption holds better for some counters than it does for others. Figure 2 shows the counting distributions for the most and least regular counter, and you can really see the difference between the two.
The graph has been split into odds and evens, because there’s generally a consistent difference between those two series. Intuitively, that makes sense, since most counts are made in runs where a given user makes every second counts. It’s therefore not too strange that the behaviour at a value \(n\) is more similar to that at \(n - 2\) than at \(n - 1\).
Code to plot the distributions of the most and least regular counters
We can quantify the difference for each counter through the Coefficient of Variation, which expresses how far their counting distribution is from uniform. Here’s a table of the five most and five least regular counters:
Of course, we saw from Figure 1 that there’s a significant variation in how many people have counted each number, so perhaps the uniform distribution is a bad model for how often we should expect each counter to have counted a particular number. Indeed, since twice as many people have counted the number 50 as have counted 1000, then ****on average**** people who have counted 1000 have done so twice as often as people who have counted 50. That leads to a model distribution that goes as \(f(n) \propto \frac{1}{\textrm{number of people who have counted n}}\). We can again look through all the counters and see who has a counting distribution closest to this ideal:
It’s impressive just how closely david’s counting frequency matches the toy model I suggested above!
Odds and evens
In the previous section we saw that for some counters, there’s a significant difference between how they’ve counted the odd numbers, and how they’ve counted the even numbers. The difference is not unexpected, since the nature of counting means that for any given run you’ll be stuck on either the odd numbers or the even numbers.
It is striking though just how large the difference can be for some counters, so here’s a table of the most odd counters, the most even counters and the most balanced counters:
Table 3: Three sets of counters, organised by parity: Those with most odd counts, those with most even counts, and those who are closest to being perfectly balanced.
Username
n_(even)
n_(odd)
Difference
Relative Difference [%]
thephilsblogbar2
207927
259213
-51286
-10.98
Smartstocks
85517
116357
-30840
-15.28
colby6666
13575
27755
-14180
-34.31
Trial-Name
34762
48318
-13556
-16.32
rschaosid
19358
27815
-8457
-17.93
GarlicoinAccount
113830
94681
19149
9.18
Countletics
241304
225320
15984
3.43
LeMinerWithCheese
27467
13503
13964
34.08
ClockButTakeOutTheL
63126
50520
12606
11.09
Antichess
169789
157584
12205
3.73
Ynax
7613
7615
-2
-0.01
cob331
2572
2571
1
0.02
supersammy00
7886
7900
-14
-0.09
nonsensy
85604
85441
163
0.10
animus777
819
817
2
0.12
That’s all for now!
Source Code
---title: "The Distribution of Counts or: How I Learned to Stop Worrying and Love Pandas Indexing"---# IntroductionThis round of analysis is a deep dive into the distribution of counts and counters, across the thousands of threads that have been posted on `r/counting` over the years. I'll be trying to answer questions like "Which number from 1-1000?" has been counted by the most people, and "Which people have a significant preference for counting odd or even numbers?".We'll start off with some code to import the relevant packages and load the data.```{python}# | code-summary: "Code for importing packages and loading data"import itertoolsimport osimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.graph_objects as goimport plotly.io as pioimport seaborn as snsfrom IPython.display import Markdownfrom rcounting import analysis, counters, graph_toolspio.templates.default ="seaborn"sns.set_theme()data_directory = Path(os.getenv("COUNTING_DATA"))db = sqlite3.connect(data_directory /"counting.sqlite")counts = pd.read_sql("select username, submission_id, position from comments where position > 0 order by timestamp", db,)counts["username"] = counts["username"].apply(counters.apply_alias)counts["position"] = (counts["position"] -1) %1000+1```# 100% CompletionA natural place to start is to look at how many counters have counted every number from 1 to 1000. For this (and most of the other stats in this page) I'll be restricting my attention to people who have made at least 1000 counts.```{python}# | code-summary: "Code to find how many values each counter is missing"totals = counts.groupby("username").size()usernames = [x for x in totals[totals >=1000].index ifnot counters.is_banned_counter(x)]subset = counts.query("username in @usernames").copy()missing_values =1000- subset.groupby("username")["position"].nunique()print(f"{sum(missing_values ==0)} counters have counted every number from 1-1000")```We can drill a little deeper and ask how quickly each counter reached the full set. In theory it's possible to do that after only 1000 counts, but in practice it's extremely unlikely. Indeed, @tbl-complete shows that even the fastest counters needed multiple times that.```{python}# | code-summary: "Code to make the table"# | label: tbl-complete# | tbl-cap: The fastest and slowest counters to count every number from 1 to 1000, according to how many counts they had when they reached the full setscompletists = [username for username in usernames if missing_values[username] ==0]complete = subset.query("username in @completists").copy()complete["index"] = complete.indexlast_counts = complete.groupby("username").apply(lambda x: (x.groupby("position").head(1)).tail(1)["index"])totals = counts.groupby("username").cumcount()df = ( last_counts.reset_index(level=1)["index"] .apply(lambda x: totals.loc[x]) .sort_values())headers = ["**Username**", "**Number of Counts**"]df = pd.concat([df.tail().sort_values(ascending=False), df.head()]) Markdown(df.to_markdown(headers=headers))```Similarly, we can take a look at which counters are only missing a few values before they have the full set:```{python}# | code-summary: "Code to make the table"# | tbl-cap: The 10 counters with the most counts who are still missing up to five values in order to have counted the full set of numbers from 1 to 1000missing = [username for username in usernames if0< missing_values[username] <=5]df = ( subset.query("username in @missing") .groupby("username") .size() .sort_values(ascending=False) .to_frame(name="n_counts"))df["missing_value"] = ( subset.query("username in @missing") .groupby("username")["position"] .apply(lambda x: (set(range(1, 1001)) -set(x.unique()))))headers = ["**Username**", "**Total Counts**", "**Missing Values**"]Markdown(df.head(10).to_markdown(headers=headers))```# Efficient gettersThe counts on r/counting are organised in threads of 1000 counts, and getting the last count on a thread is a bit of a prize. Some counters are very motivated by gets, and some counters are less motivated by them, but everyone is aware of them.On average, everyone should have made 1000 counts for each get they have, but some counters have ratios that are significantly different from that. We can look at which counters have made the fewest total counts but still managed to obtain a get```{python}# | code-summary: "Code to make the table"# | tbl-cap: The counters with the fewest total number of counts who still have at least one get.getters = counts.groupby("submission_id").last()gets = getters.reset_index()["username"].value_counts()gets.name ="n_gets"totals = pd.concat([counts["username"].value_counts().loc[gets.index], gets], axis=1)totals.columns = ["counts", "gets"]headers = ["**Username**", "**Counts**", "**Gets**"]Markdown(totals.sort_values(by="counts").head().to_markdown(headers=headers))```Looking at this list, `thephilsnipebar`{.verbatim} is immediately obvious as counting alts. `MeNowDealWithIt`{.verbatim} has made significantly more than one count, but deleted their account before they were picked up by the script, so I have no idea how many. `Hotshot2k4`{.verbatim} seems legit. They made three counts, the first of which was the 54k get. More recently, `ItzTaken`{.verbatim} got a [free get](http://reddit.com/r/counting/comments/mlqtr1/_/gtobrvf?context=3) that `VitaminB16`{.verbatim} left in the 4195k thread. Before that they had made two counts in the 2M era.```{python}# | code-summary: "Code to calculate and print the <100 and <1000 getters"lt_100 = (totals["counts"] <100).sum()lt_1000 = (totals["counts"] <1000).sum()s = (f"All in all we have ~{round(lt_100, -1)} counters with a get and less than 100 total counts "f"and ~{round(lt_1000, -1)} with less than 1000 counts. ""But again, there's a significant number of counts where I don't know the author, ""and a significant number of usernames that are unknown aliases.")Markdown(s)```Maybe it's better instead to look only at counters who have made at least 1000 counts. The relevant comparison would then be the ratio of counts to gets. That's shown on @tbl-ratio, and veterans of r/counting will recognise some of the first five names as counters who like to try and snipe the get.```{python}# | code-summary: "Code to make the table"# | label: tbl-ratio# | tbl-cap: The counters with at least 1000 counts who have the lowest and highest ratio of counts to gets.totals = totals.loc[totals["counts"] >=1000].copy()totals["ratio"] = totals["counts"] / totals["gets"]totals = totals.sort_values(by="ratio")headers = ["**Username**", "**Counts**", "**Gets**", "**Counts / Gets**"]df = pd.concat([totals.head(), totals.tail()])Markdown(df.to_markdown(headers=headers, floatfmt=".0f"))```# The Overall Counting DistributionI promised to write about a distribution of counts, and so far I've mainly written about what numbers individual counters have or have not counted. And there hasn't been a single graph yet! But I promise that's about to change. A fun thing to look at first is how many people have counted each number from 1 to 1000. It wouldn't be far-fetched to assume that each number had been counted by roughly the same amount of people, but that's not at all what happens. @fig-distribution has the details.```{python}# | code-summary: "Code to make the plot" # | label: fig-distribution# | fig-cap: The amount of people who have counted each number. You can see a very sharp rise from the start of each thread to ~50, followed by a steady decline towards the get. The most popular number has been counted by more than twice as many people as the least popular.aggregated = counts.groupby("position")["username"].nunique().to_frame()labels = {"position": "Thread position", "username": "Number of different counters"}fig = px.line( aggregated.reset_index(), x="position", y="username", labels=labels)fig.update_xaxes(range=[0, 1000])fig.show()```We can also look at the counting distributions for individual counters. Again, the default assumption would be that everybody has counted each number roughly the same number of times. Not too surprisingly, we see that this assumption holds better for some counters than it does for others. @fig-regularity shows the counting distributions for the most and least regular counter, and you can really see the difference between the two.The graph has been split into odds and evens, because there's generally a consistent difference between those two series. Intuitively, that makes sense, since most counts are made in runs where a given user makes every second counts. It's therefore not too strange that the behaviour at a value $n$ is more similar to that at $n - 2$ than at $n - 1$.```{python}# | code-summary: "Code to plot the distributions of the most and least regular counters"# | label: fig-regularity# | fig-cap: The normalized number of counts made at each value for the most and least regular counters. If every number had been counted exactly the same amount of times, there would be a flat line at y=1totals = complete[["username", "position"]].value_counts().unstack(level=0).fillna(0)totals /= totals.mean()variations = totals.agg("std").sort_values()pair = totals[[variations.index[0], variations.index[-1]]].copy()parity_columns = pair.index.to_series().apply(lambda x: pd.Series([["even", "odd"][x %2], x - x %2]))pair[["parity", "position"]] = parity_columnslabels = {"username": "Counter","position": "Thread Position","value": "Relative frequency","parity": "Parity",}fig = px.line( pair.melt(id_vars=["parity", "position"]), color="username", line_dash="parity", x="position", y="value", labels=labels,)fig.show()```We can quantify the difference for each counter through the [Coefficient of Variation](https://en.wikipedia.org/wiki/Coefficient_of_variation), which expresses how far their counting distribution is from uniform. Here's a table of the five most and five least regular counters:```{python}# | code-summary: "Code to make the table"headers = ["**Username**", "**Coefficient of Variation [%]**"]df = pd.concat([100* variations.head(), 100* variations.tail()])Markdown(df.to_markdown(headers=headers, floatfmt=".0f"))```Of course, we saw from @fig-distribution that there's a significant variation in how many people have counted each number, so perhaps the uniform distribution is a bad model for how often we should expect each counter to have counted a particular number. Indeed, since twice as many people have counted the number 50 as have counted 1000, then ****on average**** people who have counted 1000 have done so twice as often as people who have counted 50. That leads to a model distribution that goes as $f(n) \propto \frac{1}{\textrm{number of people who have counted n}}$. We can again look through all the counters and see who has a counting distribution closest to this ideal:```{python}# | code-summary: "Code to make the plot"# | label: fig-model# | fig-cap: The distribution of counts for the counter who most closely matches the model distribution.distribution =1/ aggregated["username"]distribution = distribution / distribution.mean()username = totals.agg(lambda x: ((x - distribution) **2).sum()).sort_values().index[0]labels.update({username: "Relative frequency"})column = totals[username].to_frame()column[["parity", "position"]] = parity_columnsfig = px.line( column, x="position", y=username, color="parity", labels=labels, title=f"{username}'s counting distribution",)line = go.Scatter( x=distribution.index, y=distribution.values, mode="lines", line_color="hsl(0, 0%, 70%)", showlegend=False,)fig.add_trace(line)fig.data = fig.data[-1:] + fig.data[:-1]fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.01))fig.show()```It's impressive just how closely david's counting frequency matches the toy model I suggested above!# Odds and evensIn the previous section we saw that for some counters, there's a significant difference between how they've counted the odd numbers, and how they've counted the even numbers. The difference is not unexpected, since the nature of counting means that for any given run you'll be stuck on either the odd numbers or the even numbers.It is striking though just how large the difference can be for some counters, so here's a table of the most odd counters, the most even counters and the most balanced counters:```{python}# | code-summary: "Code to make the table"# | label: tbl-parity# | tbl-cap: "Three sets of counters, organised by parity: Those with most odd counts, those with most even counts, and those who are closest to being perfectly balanced."counts["is_even"] = counts["position"] %2==0offsets = ["1gm10t","7hn2tm","b471wg","bz6r0g","d6pgni","ebnh39","grggc0","oj50hj","ob4a2h","t81gug",]for offset in offsets: counts.loc[counts["submission_id"] == offset, "is_even"] = (1- counts.loc[counts["submission_id"] == offset, "is_even"] )counts["is_odd"] =1- counts["is_even"]subset = counts.query("username in @usernames")table = subset[["username", "is_even", "is_odd"]].groupby("username").sum()table.columns = ["n_even", "n_odd"]table["difference"] = table["n_even"] - table["n_odd"]table["relative_difference"] = ( (table["n_even"] - table["n_odd"]) / (table["n_even"] + table["n_odd"]) *100)table["absolute_difference"] =abs(table["relative_difference"])headers = ["**Username**","**n_(even)**","**n_(odd)**","**Difference**","**Relative Difference [%]**",]columns = ["n_even", "n_odd", "difference", "relative_difference"]floatfmt=(".0f",) *len(columns) + (".2f",)Markdown( pd.concat( [ table.sort_values(by="difference").head(), table.sort_values(by="difference", ascending=False).head(), table.sort_values(by="absolute_difference").head(), ] )[columns].to_markdown(headers=headers, floatfmt=floatfmt))```That's all for now!