This page highlights some of the initial analysis I’ve done of the counts made on r/counting. To do this, I use the database of all counts that was put together by u/davidjl123 and u/Countletics (and others!), as modified by me. For more niche analysis, see some of the other pages on the sidebar on the left. Most of the figures and tables here have also been posted on the subreddit, but I wanted to have them in a central place. I also liked being able to show the code and the tables or figures it generates in the same document, so that people can see both. Some of it isn’t particularly interesting, so I’ve hidden it behind a code widget. You can unfold it just by clicking.
The idea of this page is also that I’ll try and keep the analysis current as more counts come in, while the other pages might slowly grow stale.
Imports and initialization
We’ll start with some imports, after which we can connect to the database of counts
Code for importing packages and connecting to the database
import reimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.io as pioimport seaborn as snsfrom rcounting import analysis, counters, parsing, side_threadsfrom rcounting import thread_navigation as tnfrom rcounting.reddit_interface import redditpio.templates.default ="seaborn"sns.set_theme()from IPython.display import Markdowndata_directory = Path("../data")db = sqlite3.connect(data_directory /"counting.sqlite")
Then we load some data, both the counts and the gets. We convert the timestamp to a date column, and add a “replying to” column, since some of what we’ll be doing later needs it.
Code for loading counts from the counting database
On Figure 1 you can see that the counting rate varies quite a bit over time, with signifcant peaks and lulls in activity. Whether or not there are active runners really changes how fast the count is progressing!
We can split the total counts into different groups according to who made them, and plot the total number of counts made over time by the top participants on rcounting.
We can try plotting thread participation vs total counts. The expectation is that generally, people who’ve made more total counts will also have counted in more threads. However, some users might have periods where they make a count every now and then but never do any runs, leading to comparatively more k_parts. On the other hand, some counters might only do runs, giving a counts/thread of up to 500.
We start extract the number of counts and the threads participated in, using the groupby functionality of pandas, and tthen we’ll make a scatter plot of how the two quantities are related. Since the most prolific counters have orders of magnitude more counts and k parts than the least prolific counters, we’ll show both axes on a log scale.
You can see what that looks like on Figure 2. The line is a linear fit on the log-log plot, and it has a slope of 1.3. In this model, that means that if you double the total number of threads participated in by a user, you would expect to multiply their total counts by 2.5.
Number of partners and effective number of partners
As with the number of counts vs threads participated in, we can expect that different counters might have qualitatively different behaviour when it comes to how many counting partners they have, and how often they’ve counted with each one. Some counters might count a little bit with everybody, while others might run with only a few partners, and drop a count with others every now and then.
To quantify how uneven the counting distribution is we can look at the effective number of partners of each counter, and compare with the actual number of partners.
Code to find the effective and actual number of counting partners
Table 2: The 25 currently-active counters who’ve been counting for the longest time
username
First Count
Latest Count
Z3F
2012-06-10
2023-04-13
CanGreenBeret
2012-06-12
2023-01-30
OverlordLork
2012-06-14
2023-02-17
949paintball
2012-06-15
2022-12-27
mrstickman
2012-07-18
2023-03-22
Bluethulhu
2012-11-04
2023-05-19
kdiuro13
2013-04-07
2023-05-06
zhige
2013-05-03
2023-03-20
falsehood
2013-06-24
2023-03-22
O_Baby_Baby
2013-12-13
2023-02-03
CutOnBumInBandHere9
2013-12-13
2023-05-30
onewhitelight
2013-12-18
2023-03-22
OreoObserver
2014-01-25
2023-01-04
D-alx
2014-02-13
2023-04-06
atomicimploder
2014-02-26
2023-05-25
dude_why_would_you
2014-02-28
2023-03-03
rideride
2014-04-07
2023-04-26
Juqu
2014-04-12
2023-02-01
Uebeltank
2014-06-05
2023-04-01
slockley
2014-06-23
2023-01-20
Blimp_Blimp
2014-06-29
2022-12-13
artbn
2014-07-06
2023-04-23
davidjl123
2014-07-16
2023-05-29
Mooraell
2015-01-04
2023-05-30
Maniac_34
2015-01-12
2023-05-30
Gets and streaks
Similarly to the oldest counters, we can see what the longest difference between a counter’s first and last get is, and that’s shown on Table 3. Some counters have been active and getting gets for quite a while!
Code to find the counters with the longest get spans
The core of the extraction is the line that says groups = gets.groupby((y != y.shift()).cumsum()). Let’s unpack it:
y != y.shift() assigns a value of True to all threads with a username that’s different from their predecessor
.cumsum() sums up all these True values. The net result is that each get streak is given its own unique number
.groupby() extracts these groups for later use
The groups are then sorted according to size, and the largest ones are shown in Table 4
Source Code
---title: "Analysing rcounting data"---This page highlights some of the initial analysis I've done of the counts made on [r/counting](reddit.com/r/counting). To do this, I use the database of all counts that was put together by u/davidjl123 and u/Countletics (and others!), as modified by me. For more niche analysis, see some of the other pages on the sidebar on the left. Most of the figures and tables here have also been posted on the subreddit, but I wanted to have them in a central place. I also liked being able to show the code and the tables or figures it generates in the same document, so that people can see both. Some of it isn't particularly interesting, so I've hidden it behind a code widget. You can unfold it just by clicking.The idea of this page is also that I'll try and keep the analysis current as more counts come in, while the other pages might slowly grow stale.# Imports and initializationWe'll start with some imports, after which we can connect to the database of counts```{python}#| code-summary: "Code for importing packages and connecting to the database"import reimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.io as pioimport seaborn as snsfrom rcounting import analysis, counters, parsing, side_threadsfrom rcounting import thread_navigation as tnfrom rcounting.reddit_interface import redditpio.templates.default ="seaborn"sns.set_theme()from IPython.display import Markdowndata_directory = Path("../data")db = sqlite3.connect(data_directory /"counting.sqlite")```Then we load some data, both the counts and the gets. We convert the timestamp to a date column, and add a "replying to" column, since some of what we'll be doing later needs it.```{python}# | code-summary: "Code for loading counts from the counting database"counts = pd.read_sql("select comments.username, comments.submission_id, comments.timestamp ""from comments join submissions ""on comments.submission_id = submissions.submission_id ""where comments.position > 0 ""order by submissions.timestamp, comments.position", db,)counts["date"] = pd.to_datetime(counts["timestamp"], unit="s")counts["username"] = counts["username"].apply(counters.apply_alias)counts.drop("timestamp", inplace=True, axis=1)counts["replying_to"] = counts["username"].shift(1)print(f"There are {len(counts)} comments logged on main")gets = counts.groupby("submission_id").last().sort_values("date").reset_index()gets["basecount"] = (gets.index +15) *1000gets.loc[[0, 1], ["basecount"]] = [0, 16691]```# Counting progress over timeA first bit of analysis is to visualize the progress of r/counting over time. That isn't particularly difficult to do```{python}# | code-summary: "Code to plot the cumulativa counts over time"# | label: fig-progress# | fig-cap: Cumulative counts made on /r/counting as a function of timedata = gets.set_index("date")fig = px.line( data_frame=data.resample("30d")[["basecount"]].mean().reset_index(), x="date", y="basecount", labels={"basecount": "Count", "date": "Date"},)fig.show()```On @fig-progress you can see that the counting rate varies quite a bit over time, with signifcant peaks and lulls in activity. Whether or not there are active runners really changes how fast the count is progressing!We can split the total counts into different groups according to who made them, and plot the total number of counts made over time by the top participants on `rcounting`.```{python}# | column: body-outset# | code-summary: "Code to plot the hall of counters"frequency ="24h"totals = counts.groupby("username").size().sort_values(ascending=False)top_counters = [x for x in totals.index ifnot counters.is_banned_counter(x)][:25]series = counts.loc[counts["username"].isin(top_counters), ["username", "date"]]total = ( pd.get_dummies(series.set_index("date")["username"]) .resample(frequency) .sum() .cumsum() .melt(ignore_index=False) .reset_index())order =list( total.groupby("variable")["value"].last().sort_values(ascending=False).index)fig = px.line( data_frame=total, x="date", y="value", line_group="variable", color="variable", category_orders={"variable": order}, labels={"date": "Date", "variable": "Counter", "value": "Total Counts"},)fig.show()```# Total counts vs k partsWe can try plotting thread participation vs total counts. The expectation is that generally, people who've made more total counts will also have counted in more threads. However, some users might have periods where they make a count every now and then but never do any runs, leading to comparatively more k_parts. On the other hand, some counters might only do runs, giving a counts/thread of up to 500.We start extract the number of counts and the threads participated in, using the groupby functionality of `pandas`, and tthen we'll make a scatter plot of how the two quantities are related. Since the most prolific counters have orders of magnitude more counts and k parts than the least prolific counters, we'll show both axes on a log scale.```{python}# | code-summary: "Code to plot total counts vs k parts"# | label: fig-parts-counts# | fig-cap: The relationship between the total number of counts for each user, and then number of threads they've participated ingroups = counts.groupby("username")["submission_id"]k_parts = groups.nunique()hoc = groups.count()combined = pd.concat([k_parts, hoc], axis=1)combined.columns = ["k_parts", "total_counts"]combined = combined.query("k_parts >= 10")fig = px.scatter( data_frame=combined, x="k_parts", y="total_counts", opacity=0.7, log_x=True, log_y=True, trendline="ols", trendline_options=dict(log_x=True, log_y=True), labels={"k_parts": "Threads participated in", "total_counts": "Total Counts"},)fig.show()```You can see what that looks like on @fig-parts-counts. The line is a linear fit on the log-log plot, and it has a slope of 1.3. In this model, that means that if you double the total number of threads participated in by a user, you would expect to multiply their total counts by 2.5.# Number of partners and effective number of partnersAs with the number of counts vs threads participated in, we can expect that different counters might have qualitatively different behaviour when it comes to how many counting partners they have, and how often they've counted with each one. Some counters might count a little bit with everybody, while others might run with only a few partners, and drop a count with others every now and then.To quantify how uneven the counting distribution is we can look at the [effective number of partners](https://en.wikipedia.org/wiki/Effective_number_of_parties) of each counter, and compare with the actual number of partners.```{python}# | code-summary: "Code to find the effective and actual number of counting partners"sorted_counters = counts.groupby("username").size().sort_values(ascending=False)top_counters = [ x for x in sorted_counters.index[:35] ifnot counters.is_banned_counter(x)][:30]top = sorted_counters.filter(items=top_counters)df = ( counts.loc[counts["username"].isin(top_counters)] .groupby(["username", "replying_to"]) .size())effective_partners = ( df.groupby(level=0).apply(analysis.effective_number_of_counters).to_frame())partners = df.groupby(level=0).count()combined = pd.concat([top, effective_partners, partners], axis=1)combined["HOC rank"] =range(1, len(combined) +1)combined.columns = ["counts", "c_eff", "c", "rank"]combined = combined[["rank", "c", "c_eff"]]combined.c_eff = combined.c_eff.round().astype(int)combined.columns = ["HOC rank", "N", "N_(effective)"]combined.index.name ="Username"combined.head(25)```We can also get the replying-to and replied-by stats for a single user```{python}# | code-summary: "Code to calculate reply stats for a single user"# | label: tbl-replying# | tbl-cap: The most popular counting partners of a single usercounter ="thephilsblogbar2"nick ="phil"subset = counts.loc[counts["username"] == counter].copy()replied_by = counts["username"].shift(-1).loc[subset.index]subset["replied_by"] = replied_byresult = pd.concat( [ subset.groupby("replied_by").count().iloc[:, 0].sort_values(ascending=False), subset.groupby("replying_to").count().iloc[:, 0].sort_values(ascending=False), ], axis=1,)headers = ["Counting partner", f"No. of replies by {nick}", f"No. of replies to {nick}"]Markdown(result.head(10).to_markdown(headers=headers))```# Oldest countersWe can see who the oldest still-active counters are, where still-active is generously defined as "having made a count within the last six months".```{python}# | code-summary: "Code to find the oldest still-active counters"# | label: tbl-oldest-counters# | tbl-cap: The 25 currently-active counters who've been counting for the longest timecutoff_date = pd.to_datetime("today") - pd.Timedelta("180d")active_counters = ( counts.loc[counts["date"] > cutoff_date].groupby("username").groups.keys())oldest_counters = ( counts.loc[counts["username"].isin(active_counters)] .groupby("username")["date"] .agg([min, max]))oldest_counters = oldest_counters.sort_values("min").head(25)headers = ["**username**", "**First Count**", "**Latest Count**"]Markdown(oldest_counters.apply(lambda x: x.dt.date).to_markdown(headers=headers))```# Gets and streaksSimilarly to the oldest counters, we can see what the longest difference between a counter's first and last get is, and that's shown on @tbl-get-delta. Some counters have been active and getting gets for quite a while!```{python}# | code-summary: "Code to find the counters with the longest get spans"# | label: tbl-get-delta# | tbl-cap: The longest differences between the first and last get of r/counting users (1000s of counts)Markdown( gets.groupby("username") .agg(lambda x: x.index[-1] - x.index[0]) .iloc[:, 0] .sort_values(ascending=False) .head(10) .to_markdown(headers=["**Username**", "**Get span**"]))```We can also calculate what the longest get streaks are.```{python}# | code-summary: "Code to find the longest get streaks"# | label: tbl-streaks# | tbl-cap: The longest streaky = gets["username"]groups = gets.groupby((y != y.shift()).cumsum())columns = ["username", "submission_id", "basecount"]length =10indices = (-groups.size()).sort_values(kind="mergesort").indexold = groups.first().loc[indices, columns]new = groups.last().loc[indices, columns]combined = old.join(new, rsuffix="_new")combined = ( combined.loc[~combined["username"].apply(counters.is_banned_counter)] .head(length) .reset_index(drop=True))def old_link(row):return (f"[{int(row.basecount /1000) +1}K]"+f"(https://reddit.com/comments/{row.submission_id}/)" )def new_link(row):return (f"[{int(row.basecount_new /1000) +1}K]"+f"(https://reddit.com/comments/{row.submission_id_new}/)" )headers = ["Rank", "username", "First Get", "Last Get", "Streak Length"]headers = [f"**{x}**"for x in headers]columns = ["username", "old_link", "new_link", "streak"]combined["old_link"] = combined.apply(old_link, axis=1)combined["new_link"] = combined.apply(new_link, axis=1)combined["streak"] =1+ (combined["basecount_new"] - combined["basecount"]) //1000combined.index +=1combined.index.name ="Rank"Markdown(combined[columns].to_markdown(headers=headers))``` The core of the extraction is the line that says `groups = gets.groupby((y != y.shift()).cumsum())`. Let's unpack it:- `y != y.shift()` assigns a value of True to all threads with a username that's different from their predecessor- `.cumsum()` sums up all these True values. The net result is that each get streak is given its own unique number- `.groupby()` extracts these groups for later useThe groups are then sorted according to size, and the largest ones are shown in @tbl-streaks