I’ve previously described r/counting as a collaborative incremental game, and that for me sums up the essence of counting fairly well. A natural question to ask about the game is how many people have played over the years
We’ll start of by importing the relevant packages and loading some data. Since we’re only interested in the counters in each thread, we only load those two columns from the database.
Code for importing packages and loading data
import reimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.io as pioimport scipyimport seaborn as snsfrom IPython.display import Markdownfrom rcounting import parsingsns.set_theme()pio.templates.default ="seaborn"data_directory = Path("../data")pd.options.plotting.backend ="plotly"db = sqlite3.connect(data_directory /"counting.sqlite")counts = pd.read_sql("select counters.canonical_username as username, submission_id from comments "" join counters on comments.username=counters.username ""where comments.position > 0 and submission_id != 'uuikz' order by timestamp", db,)submissions = pd.read_sql("select * from submissions", db)def format_title(row):return (f"[{row.title}](http://www.reddit.com/r/counting/comments/{row.submission_id})" )submissions["link"] = submissions.apply(format_title, axis=1)
Now finding the total number of counters is easy
Code
counts["username"].nunique()
15714
That’s more than I was expecting!
The number of counters in each thread
The counts in r/counting are split into threads of 1000 counts each, and in principle it should be possible to have a thread with 1000 different counters participating. That’s never happened, especially since most counts are made as part of a series of replies between just two users. Still, it might be interesting to see which threads had the most counters taking part:
Some of these threads really had a lot of participants!
On the oppositve end of the scale, we can look at the threads with fewest participants. Since you’re not allowed to reply to yourself, at least two people have to take part in each thread. We can easily see how many times that’s happened:
The total number of counters that participate in a thread is an inherently noisy quantity. One person making a single count can change the total even if they make no other counts in the thread. A better way is to look at the effective number of counters taking part in a thread. The effective number takes into account how skewed the distribution of participants is. If 10 people count 100 times each in a thread, then both the actual and the effective number of counters is 10. If instead two people count 496 times each, and 8 people count once each, then the effective number of counters is 2.02, because two people made basically all the counts.
We can find the submission with the highest number of effective counters.
Code
from rcounting.analysis import effective_number_of_counterseffective_counters = levels.groupby(level=0, sort=False).apply(effective_number_of_counters)submission_id = effective_counters.idxmax()s = (f"The thread with the highest number of effective counters is "f"{submissions.query('submission_id == @submission_id')['link'].iat[0]}, "f"with {effective_counters.loc[submission_id]:.1f} counters.")Markdown(s)
The thread with the highest number of effective counters is 336K Counting Thread, with 28.2 counters.
We can also compare the total and the effective number of counters
We can see that both the total and effective number of counters have a median that is lower than the mean, indicating that the distributions have long tails to the right. We can plot these, which is done on figure Figure 2. You can clearly see how much more spread out the actual number of counters is compared with the effective number. The effective number is really sharply peaked at 2, with 25% of the counts lying in the range 2-2.4.
We can also plot how the effective and actual number of counters have evolved throughout r/counting history; this is shown on figure Figure 3. The actual and effective number of counters track each other quite closely across threads. It seems there’s been a gradual decline in the number of counters participating in each thread, but with spikes of activity. One thing I was expecting to see was clear spikes at 100k threads, since running isn’t allowed on those. And those spikes just aren’t apparent in the data.
We can also plot the effective number of counters as a function of the actual number of counters. You can see generally, the more actual counters there are ina thread, there more effective counters there will be, but the relationship is fairly noisy.
---title: "Counting counters"---I've previously described r/counting as a collaborative incremental game, and that for me sums up the essence of counting fairly well. A natural question to ask about the game is how many people have played over the yearsWe'll start of by importing the relevant packages and loading some data. Since we're only interested in the counters in each thread, we only load those two columns from the database.```{python}# | code-summary: "Code for importing packages and loading data"import reimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.io as pioimport scipyimport seaborn as snsfrom IPython.display import Markdownfrom rcounting import parsingsns.set_theme()pio.templates.default ="seaborn"data_directory = Path("../data")pd.options.plotting.backend ="plotly"db = sqlite3.connect(data_directory /"counting.sqlite")counts = pd.read_sql("select counters.canonical_username as username, submission_id from comments "" join counters on comments.username=counters.username ""where comments.position > 0 and submission_id != 'uuikz' order by timestamp", db,)submissions = pd.read_sql("select * from submissions", db)def format_title(row):return (f"[{row.title}](http://www.reddit.com/r/counting/comments/{row.submission_id})" )submissions["link"] = submissions.apply(format_title, axis=1)```Now finding the total number of counters is easy```{python}counts["username"].nunique()```That's more than I was expecting!# The number of counters in each threadThe counts in r/counting are split into threads of 1000 counts each, and inprinciple it should be possible to have a thread with 1000 different countersparticipating. That's never happened, especially since most counts are made aspart of a series of replies between just two users. Still, it might beinteresting to see which threads had the most counters taking part:```{python}levels = counts.groupby(['submission_id', 'username'], sort=False).size()top = levels.groupby(level=0, sort=False).size().sort_values(ascending=False).head()top_submissions = submissions.query("submission_id in @top.index").copy()combined = pd.concat([top, top_submissions.set_index("submission_id")], axis=1)Markdown(combined[["link", 0]].to_markdown(headers=["**Thread**", "**Number of counters**"], index=False))```Some of these threads really had a lot of participants!On the oppositve end of the scale, we can look at the threads with fewest participants. Since you're not allowed to reply to yourself, at least two people have to take part in each thread. We can easily see how many times that's happened:```{python}perfect = levels.groupby(level=0, sort=False).size() ==2perfect = perfect.loc[perfect].indexlen(perfect)```So not a huge amount of times, but it's happened. The last five threads with only two counters are```{python}perfect_500s = submissions.query("submission_id in @perfect").copy().tail().iloc[::-1]def find_counters(submission_id):return pd.Series(levels.loc[submission_id].index)perfect_500s[["first_counter", "second_counter"]] = perfect_500s["submission_id"].apply(find_counters)Markdown(perfect_500s[["link", "first_counter", "second_counter"]].to_markdown(headers=["**Thread**", "**First Counter**", "**Second Counter**"], index=False))```We can plot the distribution of the number of counters in each thread; this is shown on @fig-counters-hist.```{python}# | label: fig-counters-hist# | fig-cap: The distribution of the number of counters participating in a thread# | column: bodycounters = levels.groupby(level=0, sort=False).size()fig = px.histogram(list(counters[counters <=100]), labels={"value": "Number of Counters"},)fig.update_layout(showlegend=False, yaxis_title_text='Occurences')fig.show()```# Effective number of counters per threadThe total number of counters that participate in a thread is an inherently noisy quantity. One person making a single count can change the total even if they make no other counts in the thread. A better way is to look at the [effective](https://en.wikipedia.org/wiki/Effective_number_of_parties) number of counters taking part in a thread. The effective number takes into account how skewed the distribution of participants is. If 10 people count 100 times each in a thread, then both the actual and the effective number of counters is 10. If instead two people count 496 times each, and 8 people count once each, then the effective number of counters is 2.02, because two people made basically all the counts.We can find the submission with the highest number of effective counters.```{python}from rcounting.analysis import effective_number_of_counterseffective_counters = levels.groupby(level=0, sort=False).apply(effective_number_of_counters)submission_id = effective_counters.idxmax()s = (f"The thread with the highest number of effective counters is "f"{submissions.query('submission_id == @submission_id')['link'].iat[0]}, "f"with {effective_counters.loc[submission_id]:.1f} counters.")Markdown(s)```We can also compare the total and the effective number of counters```{python}total_counters = levels.groupby(level=0, sort=False).size()merged = (pd.concat([effective_counters, total_counters], axis=1))merged.columns = ['Effective counters', 'Actual counters']``````{python}table = merged.describe().transpose()[["mean", "50%", "max"]]Markdown(table.to_markdown(floatfmt=".1f", headers=["**Mean**", "**Median**", "**Maximum**"]))```We can see that both the total and effective number of counters have a median that is lower than the mean, indicating that the distributions have long tails to the right. We can plot these, which is done on figure @fig-kdes. You canclearly see how much more spread out the actual number of counters is compared with the effective number. The effective number is really sharply peaked at 2, with 25% of the counts lying in the range 2-2.4.```{python}# | label: fig-kdes# | fig-cap: The distributions of the number of effective and actual counters in each thread# | column: bodylimits = [0, 50]kde1 = scipy.stats.gaussian_kde(merged["Effective counters"])kde2 = scipy.stats.gaussian_kde(merged["Actual counters"])axis = np.linspace(*limits, 100, endpoint=False)data = pd.DataFrame( {"Number of counters": axis,"Effective counters": kde1(axis),"Actual counters": kde2(axis), })fig = px.line( data_frame=data.melt(id_vars=["Number of counters"]), x="Number of counters", y="value", color="variable", labels={"value": "Probability density", "variable": "Model"},)fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.99))fig.update_yaxes(range=(0, 0.28))fig.show()```We can also plot how the effective and actual number of counters have evolved throughout r/counting history; this is shown on figure @fig-rolling. The actual and effective number of counters track each other quite closely across threads. It seems there's been a gradual decline in the number of counters participating in each thread, but with spikes of activity. One thing I was expecting to see was clear spikes at 100k threads, since running isn't allowed on those. And those spikes just aren't apparent in the data.```{python}#| echo: false#| label: fig-rolling#| fig-cap: How the number of effective and actual counters has changed through r/counting history, a 10-thread rolling averagebar = merged.reset_index(drop=True)bar.set_index((bar.index +15)/1000, inplace=True)data = bar.rolling(10).mean()import plotly.graph_objects as gofrom plotly.subplots import make_subplots# Create figure with secondary y-axisfig = make_subplots(specs=[[{"secondary_y": True}]])# Add tracesfig.add_trace( go.Scatter(x=data.index, y=data["Actual counters"], name="Actual counters"), secondary_y=False,)fig.add_trace( go.Scatter(x=data.index, y=data["Effective counters"], name="Effective countters"), secondary_y=True,)# Set x-axis titlefig.update_xaxes(title_text="Count [millions]")fig.update_traces(opacity=0.7)fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="right", x=0.95))fig.show()```We can also plot the effective number of counters as a function of the actual number of counters. You can see generally, the more actual counters there are ina thread, there more effective counters there will be, but the relationship is fairly noisy.```{python}fig = px.scatter(data_frame=merged, x="Actual counters", y="Effective counters", trendline="ols")fig.update_traces(opacity=0.5)fig.update_yaxes(range=(2, 25))fig.update_xaxes(range=(0, 150))fig.show()```