We have access to the body of each comment, so it’s possible to do some of analysis on those. One interesting thing could be to look at whether a given count is comma separated, space separated or uses no separator at all. And a natural question to ask is how the distribution between those three types has changed over time
Specifically, we’ll define the three types of count as:
Comma separated counts look like [digit]*{1-3}(,[digit]*3)*
Space separated counts are the same, with the comma replaced by a space
No separated counts are defined as one of
Counts with only one digit
Counts with no separators between their first and last digit, with separators defined fairly broadly.
Code for importing packages and loading data
import reimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.io as pioimport seaborn as snsfrom rcounting import analysis, counters, parsing, side_threadsfrom rcounting import thread_navigation as tnfrom rcounting.reddit_interface import redditpio.templates.default ="seaborn"sns.set_theme()from IPython.display import Markdowndata_directory = Path("../data")db = sqlite3.connect(data_directory /"counting.sqlite")counts = pd.read_sql("select comments.body, comments.timestamp from comments join submissions ""on comments.submission_id = submissions.submission_id where comments.position > 0 ""order by submissions.timestamp, comments.position", db,)counts["date"] = pd.to_datetime(counts["timestamp"], unit="s")counts.drop("timestamp", inplace=True, axis=1)
We started by making the necessary imports and loading all the data; with that out of the way we can implement the rules defined above
Once we have the data, we can get a 14-day rolling average, and resample the points to nice 6h intervals. The resampling makes plotting with pandas look nicer, since it can more easily deal with the x-axis.
Code for plotting the separator data
resampled = ( (data[["commas", "spaces", "no separator"]].rolling("14d").mean() *100) .resample("6h") .mean() .melt(ignore_index=False) .reset_index())labels = {"date": "Date","variable": "Separator style","value": "Percentage of counts",}fig = px.line( data_frame=resampled, x="date", y="value", color="variable", labels=labels, title="The separators used on r/counting over time")fig.update_yaxes(range=[0, 100])fig.show()
Notice you can clearly see when the count crossed 100k: that’s when the ‘no separators’ line quickly drops from being the majority to being a clear minority of counts. That was followed by the era of commas, when the default format was just to use commas as separators. Over the last years, commas have significantly declined, and have now been overtaken by spaces as the most popular separator, although there’s a lot of variation depending on who exactly is active. No separators has bouts of activity, but is generally below the other two options. Pretty neat!
Source Code
---title: "The use of separators"---We have access to the body of each comment, so it's possible to do some of analysis on those. One interesting thing could be to look at whether a given count is comma separated, space separated or uses no separator at all. And a natural question to ask is how the distribution between those three types has changed over timeSpecifically, we'll define the three types of count as:- Comma separated counts look like `[digit]*{1-3}(,[digit]*3)*`- Space separated counts are the same, with the comma replaced by a space- No separated counts are defined as one of - Counts with only one digit - Counts with no separators between their first and last digit, with separators defined fairly broadly.```{python}# | code-summary: "Code for importing packages and loading data"import reimport sqlite3from pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport plotly.express as pximport plotly.io as pioimport seaborn as snsfrom rcounting import analysis, counters, parsing, side_threadsfrom rcounting import thread_navigation as tnfrom rcounting.reddit_interface import redditpio.templates.default ="seaborn"sns.set_theme()from IPython.display import Markdowndata_directory = Path("../data")db = sqlite3.connect(data_directory /"counting.sqlite")counts = pd.read_sql("select comments.body, comments.timestamp from comments join submissions ""on comments.submission_id = submissions.submission_id where comments.position > 0 ""order by submissions.timestamp, comments.position", db,)counts["date"] = pd.to_datetime(counts["timestamp"], unit="s")counts.drop("timestamp", inplace=True, axis=1)```We started by making the necessary imports and loading all the data; with that out of the way we can implement the rules defined above```{python}# | cold-fold: showdata = counts.set_index("date")data["body"] = data["body"].apply(parsing.strip_markdown_links)comma_regex = re.compile(r"\d{1,3}(?:,\d{3})+")data["commas"] = data["body"].apply(lambda x: bool(re.search(comma_regex, x)))space_regex = re.compile(r"\d{1,3}(?: \d{3})+")data["spaces"] = data["body"].apply(lambda x: bool(re.search(space_regex, x)))def no_separators(body): body = body.split("\n")[0] separators = re.escape("' , .*/") regex =rf"(?:^[^\d]*\d[^\d]*$)|"rf"(?:^[^\d]*\d[^{separators}]*\d[^\d]*$)" regex = re.compile(regex) result = re.search(regex, body)returnbool(result)data["no separator"] = data["body"].apply(no_separators)data.sort_index(inplace=True)```Once we have the data, we can get a 14-day rolling average, and resample the points to nice 6h intervals. The resampling makes plotting with pandas look nicer, since it can more easily deal with the x-axis.```{python}# | code-summary: "Code for plotting the separator data"# | column: body-outsetresampled = ( (data[["commas", "spaces", "no separator"]].rolling("14d").mean() *100) .resample("6h") .mean() .melt(ignore_index=False) .reset_index())labels = {"date": "Date","variable": "Separator style","value": "Percentage of counts",}fig = px.line( data_frame=resampled, x="date", y="value", color="variable", labels=labels, title="The separators used on r/counting over time")fig.update_yaxes(range=[0, 100])fig.show()```Notice you can clearly see when the count crossed 100k: that's when the 'no separators' line quickly drops from being the majority to being a clear minority of counts. That was followed by the era of commas, when the default format was just to use commas as separators. Over the last years, commas have significantly declined, and have now been overtaken by spaces as the most popular separator, although there's a lot of variation depending on who exactly is active. No separators has bouts of activity, but is generally below the other two options. Pretty neat!