Counting counters

I’ve previously described r/counting as a collaborative incremental game, and that for me sums up the essence of counting fairly well. A natural question to ask about the game is how many people have played over the years

We’ll start of by importing the relevant packages and loading some data. Since we’re only interested in the counters in each thread, we only load those two columns from the database.

Code for importing packages and loading data

import re
import sqlite3
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
import scipy
import seaborn as sns
from IPython.display import Markdown
from rcounting import parsing

sns.set_theme()
pio.templates.default = "seaborn"
data_directory = Path("../data")

pd.options.plotting.backend = "plotly"
db = sqlite3.connect(data_directory / "counting.sqlite")

counts = pd.read_sql(
    "select counters.canonical_username as username, submission_id from comments "
    " join counters on comments.username=counters.username "
    "where comments.position > 0 and submission_id != 'uuikz' order by timestamp",
    db,
)
submissions = pd.read_sql("select * from submissions", db)


def format_title(row):
    return (
        f"[{row.title}](http://www.reddit.com/r/counting/comments/{row.submission_id})"
    )


submissions["link"] = submissions.apply(format_title, axis=1)

Now finding the total number of counters is easy

Code

counts["username"].nunique()

That’s more than I was expecting!

The number of counters in each thread

The counts in r/counting are split into threads of 1000 counts each, and in principle it should be possible to have a thread with 1000 different counters participating. That’s never happened, especially since most counts are made as part of a series of replies between just two users. Still, it might be interesting to see which threads had the most counters taking part:

Code

levels = counts.groupby(['submission_id', 'username'], sort=False).size()
top = levels.groupby(level=0, sort=False).size().sort_values(ascending=False).head()
top_submissions =  submissions.query("submission_id in @top.index").copy()
combined = pd.concat([top, top_submissions.set_index("submission_id")], axis=1)
Markdown(combined[["link", 0]].to_markdown(headers=["**Thread**", "**Number of counters**"], index=False))

Thread	Number of counters
100k Counting thread. We are now in the land of six digits.	189
1,103k Counting Thread	172
250K Counting Thread	148
336K Counting Thread	146
389k Counting Thread	144

Some of these threads really had a lot of participants!

On the oppositve end of the scale, we can look at the threads with fewest participants. Since you’re not allowed to reply to yourself, at least two people have to take part in each thread. We can easily see how many times that’s happened:

Code

perfect = levels.groupby(level=0, sort=False).size() == 2
perfect = perfect.loc[perfect].index
len(perfect)

So not a huge amount of times, but it’s happened. The last five threads with only two counters are

Code

perfect_500s = submissions.query("submission_id in @perfect").copy().tail().iloc[::-1]
def find_counters(submission_id):
    return pd.Series(levels.loc[submission_id].index)
perfect_500s[["first_counter", "second_counter"]] = perfect_500s["submission_id"].apply(find_counters)
Markdown(perfect_500s[["link", "first_counter", "second_counter"]].to_markdown(headers=["**Thread**", "**First Counter**", "**Second Counter**"], index=False))

Thread	First Counter	Second Counter
5078k Counting Thread	Antichess	Countletics
5079k Counting Thread	Antichess	Countletics
5,104k Counting Thread	ClockButTakeOutTheL	Antichess
5120k Counting Thread	Antichess	Countletics
5121k Counting Thread	Antichess	Countletics

We can plot the distribution of the number of counters in each thread; this is shown on Figure 1.

Code

counters = levels.groupby(level=0, sort=False).size()
fig = px.histogram(
    list(counters[counters <= 100]),
    labels={"value": "Number of Counters"},
)
fig.update_layout(showlegend=False, yaxis_title_text='Occurences')
fig.show()

Figure 1: The distribution of the number of counters participating in a thread

Effective number of counters per thread

The total number of counters that participate in a thread is an inherently noisy quantity. One person making a single count can change the total even if they make no other counts in the thread. A better way is to look at the effective number of counters taking part in a thread. The effective number takes into account how skewed the distribution of participants is. If 10 people count 100 times each in a thread, then both the actual and the effective number of counters is 10. If instead two people count 496 times each, and 8 people count once each, then the effective number of counters is 2.02, because two people made basically all the counts.

We can find the submission with the highest number of effective counters.

Code

from rcounting.analysis import effective_number_of_counters
effective_counters = levels.groupby(level=0, sort=False).apply(effective_number_of_counters)
submission_id = effective_counters.idxmax()
s = (f"The thread with the highest number of effective counters is "
     f"{submissions.query('submission_id == @submission_id')['link'].iat[0]}, "
     f"with {effective_counters.loc[submission_id]:.1f} counters.")
Markdown(s)

The thread with the highest number of effective counters is 336K Counting Thread, with 28.2 counters.

We can also compare the total and the effective number of counters

Code

total_counters = levels.groupby(level=0, sort=False).size()
merged = (pd.concat([effective_counters, total_counters], axis=1))
merged.columns = ['Effective counters', 'Actual counters']

Code

table = merged.describe().transpose()[["mean", "50%", "max"]]
Markdown(table.to_markdown(floatfmt=".1f", headers=["**Mean**", "**Median**", "**Maximum**"]))

	Mean	Median	Maximum
Effective counters	4.5	3.5	28.2
Actual counters	20.4	18.0	189.0

We can see that both the total and effective number of counters have a median that is lower than the mean, indicating that the distributions have long tails to the right. We can plot these, which is done on figure Figure 2. You can clearly see how much more spread out the actual number of counters is compared with the effective number. The effective number is really sharply peaked at 2, with 25% of the counts lying in the range 2-2.4.

Code

limits = [0, 50]
kde1 = scipy.stats.gaussian_kde(merged["Effective counters"])
kde2 = scipy.stats.gaussian_kde(merged["Actual counters"])
axis = np.linspace(*limits, 100, endpoint=False)
data = pd.DataFrame(
    {
        "Number of counters": axis,
        "Effective counters": kde1(axis),
        "Actual counters": kde2(axis),
    }
)

fig = px.line(
    data_frame=data.melt(id_vars=["Number of counters"]),
    x="Number of counters",
    y="value",
    color="variable",
    labels={"value": "Probability density", "variable": "Model"},
)
fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99))
fig.update_yaxes(range=(0, 0.28))

fig.show()

Figure 2: The distributions of the number of effective and actual counters in each thread

We can also plot how the effective and actual number of counters have evolved throughout r/counting history; this is shown on figure Figure 3. The actual and effective number of counters track each other quite closely across threads. It seems there’s been a gradual decline in the number of counters participating in each thread, but with spikes of activity. One thing I was expecting to see was clear spikes at 100k threads, since running isn’t allowed on those. And those spikes just aren’t apparent in the data.

Figure 3: How the number of effective and actual counters has changed through r/counting history, a 10-thread rolling average

We can also plot the effective number of counters as a function of the actual number of counters. You can see generally, the more actual counters there are ina thread, there more effective counters there will be, but the relationship is fairly noisy.

Code

fig = px.scatter(data_frame=merged, x="Actual counters", y="Effective counters", trendline="ols")
fig.update_traces(opacity=0.5)
fig.update_yaxes(range=(2, 25))
fig.update_xaxes(range=(0, 150))
fig.show()

--- title: "Counting counters" --- I've previously described r/counting as a collaborative incremental game, and that for me sums up the essence of counting fairly well. A natural question to ask about the game is how many people have played over the years We'll start of by importing the relevant packages and loading some data. Since we're only interested in the counters in each thread, we only load those two columns from the database. ```{python} # | code-summary: "Code for importing packages and loading data" import re import sqlite3 from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pandas as pd import plotly.express as px import plotly.io as pio import scipy import seaborn as sns from IPython.display import Markdown from rcounting import parsing sns.set_theme() pio.templates.default = "seaborn" data_directory = Path("../data") pd.options.plotting.backend = "plotly" db = sqlite3.connect(data_directory / "counting.sqlite") counts = pd.read_sql( "select counters.canonical_username as username, submission_id from comments " " join counters on comments.username=counters.username " "where comments.position > 0 and submission_id != 'uuikz' order by timestamp", db, ) submissions = pd.read_sql("select * from submissions", db) def format_title(row): return ( f"[{row.title}](http://www.reddit.com/r/counting/comments/{row.submission_id})" ) submissions["link"] = submissions.apply(format_title, axis=1) ``` Now finding the total number of counters is easy ```{python} counts["username"].nunique() ``` That's more than I was expecting! # The number of counters in each thread The counts in r/counting are split into threads of 1000 counts each, and in principle it should be possible to have a thread with 1000 different counters participating. That's never happened, especially since most counts are made as part of a series of replies between just two users. Still, it might be interesting to see which threads had the most counters taking part: ```{python} levels = counts.groupby(['submission_id', 'username'], sort=False).size() top = levels.groupby(level=0, sort=False).size().sort_values(ascending=False).head() top_submissions = submissions.query("submission_id in @top.index").copy() combined = pd.concat([top, top_submissions.set_index("submission_id")], axis=1) Markdown(combined[["link", 0]].to_markdown(headers=["**Thread**", "**Number of counters**"], index=False)) ``` Some of these threads really had a lot of participants! On the oppositve end of the scale, we can look at the threads with fewest participants. Since you're not allowed to reply to yourself, at least two people have to take part in each thread. We can easily see how many times that's happened: ```{python} perfect = levels.groupby(level=0, sort=False).size() == 2 perfect = perfect.loc[perfect].index len(perfect) ``` So not a huge amount of times, but it's happened. The last five threads with only two counters are ```{python} perfect_500s = submissions.query("submission_id in @perfect").copy().tail().iloc[::-1] def find_counters(submission_id): return pd.Series(levels.loc[submission_id].index) perfect_500s[["first_counter", "second_counter"]] = perfect_500s["submission_id"].apply(find_counters) Markdown(perfect_500s[["link", "first_counter", "second_counter"]].to_markdown(headers=["**Thread**", "**First Counter**", "**Second Counter**"], index=False)) ``` We can plot the distribution of the number of counters in each thread; this is shown on @fig-counters-hist. ```{python} # | label: fig-counters-hist # | fig-cap: The distribution of the number of counters participating in a thread # | column: body counters = levels.groupby(level=0, sort=False).size() fig = px.histogram( list(counters[counters <= 100]), labels={"value": "Number of Counters"}, ) fig.update_layout(showlegend=False, yaxis_title_text='Occurences') fig.show() ``` # Effective number of counters per thread The total number of counters that participate in a thread is an inherently noisy quantity. One person making a single count can change the total even if they make no other counts in the thread. A better way is to look at the [effective](https://en.wikipedia.org/wiki/Effective_number_of_parties) number of counters taking part in a thread. The effective number takes into account how skewed the distribution of participants is. If 10 people count 100 times each in a thread, then both the actual and the effective number of counters is 10. If instead two people count 496 times each, and 8 people count once each, then the effective number of counters is 2.02, because two people made basically all the counts. We can find the submission with the highest number of effective counters. ```{python} from rcounting.analysis import effective_number_of_counters effective_counters = levels.groupby(level=0, sort=False).apply(effective_number_of_counters) submission_id = effective_counters.idxmax() s = (f"The thread with the highest number of effective counters is " f"{submissions.query('submission_id == @submission_id')['link'].iat[0]}, " f"with {effective_counters.loc[submission_id]:.1f} counters.") Markdown(s) ``` We can also compare the total and the effective number of counters ```{python} total_counters = levels.groupby(level=0, sort=False).size() merged = (pd.concat([effective_counters, total_counters], axis=1)) merged.columns = ['Effective counters', 'Actual counters'] ``` ```{python} table = merged.describe().transpose()[["mean", "50%", "max"]] Markdown(table.to_markdown(floatfmt=".1f", headers=["**Mean**", "**Median**", "**Maximum**"])) ``` We can see that both the total and effective number of counters have a median that is lower than the mean, indicating that the distributions have long tails to the right. We can plot these, which is done on figure @fig-kdes. You can clearly see how much more spread out the actual number of counters is compared with the effective number. The effective number is really sharply peaked at 2, with 25% of the counts lying in the range 2-2.4. ```{python} # | label: fig-kdes # | fig-cap: The distributions of the number of effective and actual counters in each thread # | column: body limits = [0, 50] kde1 = scipy.stats.gaussian_kde(merged["Effective counters"]) kde2 = scipy.stats.gaussian_kde(merged["Actual counters"]) axis = np.linspace(*limits, 100, endpoint=False) data = pd.DataFrame( { "Number of counters": axis, "Effective counters": kde1(axis), "Actual counters": kde2(axis), } ) fig = px.line( data_frame=data.melt(id_vars=["Number of counters"]), x="Number of counters", y="value", color="variable", labels={"value": "Probability density", "variable": "Model"}, ) fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99)) fig.update_yaxes(range=(0, 0.28)) fig.show() ``` We can also plot how the effective and actual number of counters have evolved throughout r/counting history; this is shown on figure @fig-rolling. The actual and effective number of counters track each other quite closely across threads. It seems there's been a gradual decline in the number of counters participating in each thread, but with spikes of activity. One thing I was expecting to see was clear spikes at 100k threads, since running isn't allowed on those. And those spikes just aren't apparent in the data. ```{python} #| echo: false #| label: fig-rolling #| fig-cap: How the number of effective and actual counters has changed through r/counting history, a 10-thread rolling average bar = merged.reset_index(drop=True) bar.set_index((bar.index + 15)/1000, inplace=True) data = bar.rolling(10).mean() import plotly.graph_objects as go from plotly.subplots import make_subplots # Create figure with secondary y-axis fig = make_subplots(specs=[[{"secondary_y": True}]]) # Add traces fig.add_trace( go.Scatter(x=data.index, y=data["Actual counters"], name="Actual counters"), secondary_y=False, ) fig.add_trace( go.Scatter(x=data.index, y=data["Effective counters"], name="Effective countters"), secondary_y=True, ) # Set x-axis title fig.update_xaxes(title_text="Count [millions]") fig.update_traces(opacity=0.7) fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.95)) fig.show() ``` We can also plot the effective number of counters as a function of the actual number of counters. You can see generally, the more actual counters there are ina thread, there more effective counters there will be, but the relationship is fairly noisy. ```{python} fig = px.scatter(data_frame=merged, x="Actual counters", y="Effective counters", trendline="ols") fig.update_traces(opacity=0.5) fig.update_yaxes(range=(2, 25)) fig.update_xaxes(range=(0, 150)) fig.show() ```