Once Upon a Time, on Reddit...
They say the internet is a chaotic place. We say it’s a mirror.
Welcome to our data story! We are the Ambassadors for Data Appreciation Team, and we’ve spent months analyzing the Reddit Hyperlink Network dataset, a massive web of connections between distinct communities on everyone's favorite social media platform. Our mission? Prove that online discourse isn't random; it follows the heartbeat of the physical world.
In the chapters ahead, you won't just see graphs. You will see the rhythm of the seasons affecting our collective mood, watch real-world history get reconstructed by anomaly detection algorithms, and (since, as everyone knows, money is everything) discover relationships between the stock market and Reddit anxiety.
Key Research Questions
To navigate a dataset of this magnitude, we needed a compass.
Our main question is:
"Can we see an impact of real-world events on Reddit posts?"
Seasonal Variations
"How do emotions vary over the seasons and months of the year?"
We investigate whether the 'Winter Blues' or the 'Summer Joy' are just myths or measurable phenomena. By tracking specific emotional markers during time, we discover the seasonal heartbeat hidden within piles of comments.
Market Sentiment
"Is the stock market trend related to Reddit sentiment?"
Can you predict sentiments of subreddits based on stock market trends of the related companies? We analyze the correlation between erratic subreddit behavior and stock volatility to see if Wall Street dances to the same tune as Reddit.
Predictive Power
"Can we leverage ML to predict Reddit sentiment based on real-world data?"
We explore if machine learning can effectively predict different aspects of Reddit sentiment using real-world data signals, reconstructing the timeline of history purely from the 'noise' of community interactions.
And Fall wins it all!
Not a surprise, Winter gets a good dose of sadness, but it scores low on anger or anxiety. It is also associated with positive emotions, likely linked to the holiday season.
What stands out is the intensity of Fall. It takes the crown for almost every emotion. As the leaves turn, Redditors turn to their keyboards with heightened feelings.
In contrast, Spring seems to be a season of emotional hibernation, scoring lowest in Affection and Positive Emotion.
We have isolated communities using PCA, explore the tabs to explore how a community has its own rythm during the year !
Darker colors indicate higher intensity.
All subreddits aggregated.
| Winter | Spring | Summer | Fall | |
|---|---|---|---|---|
| Affection | 0.20 | 0.17 | 0.19 | 0.44 |
| Positive Emotion | 0.29 | 0.17 | 0.21 | 0.33 |
| Negative Emotion | 0.17 | 0.25 | 0.24 | 0.34 |
| Anxiety | 0.24 | 0.24 | 0.25 | 0.27 |
| Anger | 0.19 | 0.27 | 0.23 | 0.30 |
| Sadness | 0.26 | 0.25 | 0.23 | 0.26 |
The Annual Pulse
Zooming in with a 7-day moving average, the year's emotional rhythm becomes clear.
Click on the stories to reveal the patterns:
Can we predict the mood?
If seasons and communities are so influential, can we predict a post's sentiment just by knowing when and where it was posted?
The Experiment
We trained a Machine Learning model using only 3 basic features:
- Source Subreddit
- Target Subreddit
- Date of the post
Significantly better than random guessing, proving that context and timing are key drivers of online sentiment.
How It Works
Distinguishing random noise from a historic moment required a precise pipeline. We started by tracking the velocity of hyperlinks, applying a 7-day rolling average to smooth out daily fluctuations.
When the signal deviated significantly from its normal behaviour, we flagged it as a Peak. For each peak, our system automatically deployed a scraper to collect the actual conversation threads from 5 days before and after the event.
This raw context was then fed into the Apertus and Qwen LLMs, accessed via publicly available open source APIs. Their task was to synthesize thousands of fragmented comments into a single, cohesive explanation. The result is what you see in the dashboard: a perfect match between data spikes and real-world history.
Data Sources
Embedding Search Strategy
To capture the true sentiment of Apple, we don't just look at the main r/apple community.
We use vector embeddings to identify subreddit communities with semantically similar conversations and user overlaps.
This approach reveals hidden discussions across the platform, from adjacent technology forums to general financial discussions.
Sentiment vs Price
Surprisingly, we find a strong correlation in this first graph. This is mostly due to Reddit’s user-base growth, as well as the rise in Apple’s stock price from 2014 to 2017.
Chapter 4: Behind the Curtain
What is there beneath the surface of the dataset?
Clustering
Analyzing the subreddit embeddings to find clusters of related communities.
LIWC Analysis
Understanding LIWC emotion relationships with sentiment classification.
Outlier Detection
Finding outlier clusters based on their emotional profiles.
Choosing the Number of Clusters
We want to understand the structure of the subreddit ecosystem by clustering similar communities together with the K-means algorithm. To determine the optimal number of clusters K, we perform a Silhouette Analysis.
While K=2 yields the highest score, it results in a trivial split (likely only separating outliers from the rest).
We chose the local maximum at K=9, which offers a much richer and more granular segmentation of the subreddit ecosystem, allowing us to identify distinct communities based on specific topics and behaviors rather than just broad differences.
Cluster Distribution
and Characteristics
Click on a cluster in the chart to see its description and characteristics, or search for your favorite subreddit!
The principle of subreddit "closeness" is that "two subreddit are similar if the users who post in them are similar". Clusters are therefore the communities of subreddits with similar user bases.
The largest cluster (Cluster 2) contains the vast majority
of subreddits, representing general interest communities and
niches.
These subreddits have diverse user bases and not
a single "user type". The other clusters are instead a reflection
of certain "characteristical"
Reddit users.
Dimensionality Reduction (PCA)
Using Principal Component Analysis (PCA), we reduced the high-dimensional embeddings of subreddits into a 3D space.
This visualization reveals the relationships between communities, making it easier to understand their relationships. Colors represent the clusters identified in the previous step. You can also click on a cluster name to hide it. Try it with Cluster 2 if you are interested in the structure of the smaller clusters!
Note: Features are scaled to focus on the main distribution (IQR). Extreme outliers (dashed lines) may extend beyond the view.
Linguistic Features (LIWC)
We analyzed the linguistic style of posts using their LIWC scores. These scores quantify the frequency in posts of various word categories related to emotions, cognitive processes, social concerns, and more.
These box plots show the distribution of various linguistic features across the dataset. We notice that some features are much more prevalent than others (e.g., Function Words (it to, no, very), Cognitive mechanisms (cause, know, ought)), as they are the basis of discourse. Others are rarer (e.g., Biological processes (eat, pain, sick), Drives (affiliation, achievement, power)) because of their more specialized nature.
From LIWC to Sentiment
But which linguistic features are most influential to the sentiment of a post? We used Logistic Regression, a machine learning model, to identify which linguistic features correlate most strongly with sentiment.
As expected, negative emotions (Anger, Sad(ness), Anx(iety)) show high negative correlation. However, surprisingly grammar also plays a role: a high usage of Verbs and Prepositions is significant predictor of negative sentiment.
Interestingly, positive coefficients are much lower-impact than negative ones, suggesting negative language has a stronger force: negativity overwhelms positivity.
| LIWC Property | Coefficient | P-Value |
|---|---|---|
| LIWC_Anger | -0.8409 | 0.00e+00 |
| LIWC_Verbs | -0.2396 | 1.73e-38 |
| LIWC_Prep | -0.2392 | 0.00e+00 |
| LIWC_Sad | -0.2331 | 0.00e+00 |
| LIWC_Article | -0.2230 | 0.00e+00 |
| LIWC_Body | -0.1987 | 0.00e+00 |
| LIWC_Conj | -0.1784 | 1.92e-273 |
| LIWC_Ipron | -0.1735 | 0.00e+00 |
| LIWC_SheHe | -0.1640 | 0.00e+00 |
| LIWC_Anx | -0.1555 | 0.00e+00 |
| LIWC_Dissent | -0.1457 | 2.36e-298 |
| LIWC_Humans | -0.1393 | 0.00e+00 |
| LIWC_They | -0.1134 | 9.04e-263 |
| LIWC_Health | -0.1079 | 1.83e-205 |
| LIWC_Posemo | 0.1055 | 3.46e-84 |
Linguistic Fingerprints
Some clusters exibit distinctive linguistic profiles, with
outlier values in certain LIWC feature scores.
This is a remarkable result: these unique "linguistic fingerprints" reveal peculiar characteristics of these communities.
High amount of Family related words in aesthetic humor subreddits.
Driven by some subreddits like r/blunderyears, where captions of funny pictures are inherently related to family embarrassment ("My mom made me wear this"). The source of humor is often family-related.
We see high usage of I, Ipron (first-person pronouns), Past tense, and Tentat (tentative) words.
"Adulting" is introspective. When fixing a sink or budgeting, users narrate their own history. The high tentative score reveals users are not experts lecturing, but regular people admitting they don't know the answer.
High scores for Body, Sexual, Swear... and Friends???
Why is a pornography cluster talking about friendship? It's likely a framing device. Titles rely on social transgression (e.g., "My best friend sent me this") to generate interest.
High concentration of They and Inhib (inhibition) words.
A stark contrast to Cluster 1. The conversation is rarely about the poster (first person), but rather about external forces and institutions (third person). The high inhibition score (block, stop, prevent) paints a picture of resistance against barriers.