Chapters

Reddit: Echoes of Reality

Exploring how millions of discussions on Reddit reflect real-world emotions, seasonal trends, and market shifts.

Once Upon a Time, on Reddit...

They say the internet is a chaotic place. We say it’s a mirror.

Welcome to our data story! We are the Ambassadors for Data Appreciation Team, and we’ve spent months analyzing the Reddit Hyperlink Network dataset, a massive web of connections between distinct communities on everyone's favorite social media platform. Our mission? Prove that online discourse isn't random; it follows the heartbeat of the physical world.

In the chapters ahead, you won't just see graphs. You will see the rhythm of the seasons affecting our collective mood, watch real-world history get reconstructed by anomaly detection algorithms, and (since, as everyone knows, money is everything) discover relationships between the stock market and Reddit anxiety.

Key Research Questions

To navigate a dataset of this magnitude, we needed a compass. Our main question is:
"Can we see an impact of real-world events on Reddit posts?"

Seasonal Variations

"How do emotions vary over the seasons and months of the year?"

We investigate whether the 'Winter Blues' or the 'Summer Joy' are just myths or measurable phenomena. By tracking specific emotional markers during time, we discover the seasonal heartbeat hidden within piles of comments.

Market Sentiment

"Is the stock market trend related to Reddit sentiment?"

Can you predict sentiments of subreddits based on stock market trends of the related companies? We analyze the correlation between erratic subreddit behavior and stock volatility to see if Wall Street dances to the same tune as Reddit.

Predictive Power

"Can we leverage ML to predict Reddit sentiment based on real-world data?"

We explore if machine learning can effectively predict different aspects of Reddit sentiment using real-world data signals, reconstructing the timeline of history purely from the 'noise' of community interactions.

Chapter 1: The Rhythm of Seasons

Before we look at the data, ask yourself:
Do you feel different in July compared to November?

We analyzed millions of comments to see if the collective mood of Reddit shifts with the turning of the calendar pages.

And Fall wins it all!

Not a surprise, Winter gets a good dose of sadness, but it scores low on anger or anxiety. It is also associated with positive emotions, likely linked to the holiday season.

What stands out is the intensity of Fall. It takes the crown for almost every emotion. As the leaves turn, Redditors turn to their keyboards with heightened feelings.

In contrast, Spring seems to be a season of emotional hibernation, scoring lowest in Affection and Positive Emotion.

We have isolated communities using PCA, explore the tabs to explore how a community has its own rythm during the year !

Seasonal Emotion Intensity

Heatmap showing the intensity of each emotion across seasons.
Darker colors indicate higher intensity.

All subreddits aggregated.

	Winter	Spring	Summer	Fall
Affection	0.20	0.17	0.19	0.44
Positive Emotion	0.29	0.17	0.21	0.33
Negative Emotion	0.17	0.25	0.24	0.34
Anxiety	0.24	0.24	0.25	0.27
Anger	0.19	0.27	0.23	0.30
Sadness	0.26	0.25	0.23	0.26

The Annual Pulse

Zooming in with a 7-day moving average, the year's emotional rhythm becomes clear.

Click on the stories to reveal the patterns:

Can we predict the mood?

If seasons and communities are so influential, can we predict a post's sentiment just by knowing when and where it was posted?

The Experiment

We trained a Machine Learning model using only 3 basic features:

Source Subreddit
Target Subreddit
Date of the post

74%

Accuracy (F1-Score) for Negative Sentiment

71%

Accuracy (F1-Score) for Positive Sentiment

Significantly better than random guessing, proving that context and timing are key drivers of online sentiment.

Chapter 2: Real World Events

The internet is not a vacuum.
When the world shakes, Reddit trembles.

To understand how reality echoes online, we performed a two-step analysis. First, we scanned the entire network for anomalies (Spikes). Then, we zoomed in to reconstruct history (Event Analysis).

1. Detecting Anomalies

We measured how much subreddit activity drifted from its usual baseline to spot communities acting wildly out of character. This allowed us to classify viral moments into four distinct archetypes.

The 'Most Hated' (Incoming Negative)

The 'Most Loved' (Incoming Positive)

The 'Most Critical' (Outgoing Negative)

The 'Most Enthusiastic' (Outgoing Positive)

2. Reading the Echoes

Finding a digital anomaly is one thing; explaining it is another. Once we identified the spikes, we zoomed in. We selected specific communities, like the basketball enthusiasts in r/nba, and analyzed the actual conversations happening during those peak moments.

How It Works

Distinguishing random noise from a historic moment required a precise pipeline. We started by tracking the velocity of hyperlinks, applying a 7-day rolling average to smooth out daily fluctuations.

When the signal deviated significantly from its normal behaviour, we flagged it as a Peak. For each peak, our system automatically deployed a scraper to collect the actual conversation threads from 5 days before and after the event.

This raw context was then fed into the Apertus and Qwen LLMs, accessed via publicly available open source APIs. Their task was to synthesize thousands of fragmented comments into a single, cohesive explanation. The result is what you see in the dashboard: a perfect match between data spikes and real-world history.

Chapter 3: The Market Pulse

"Buy the rumor, sell the news."
But what if the rumor is trending on r/Apple?

Now that we have a basic understanding of our emotions, let's look at stock data and compare it to the sentiment analysis.

Community Sources

Data Sources

Embedding Search Strategy

To capture the true sentiment of Apple, we don't just look at the main r/apple community.

We use vector embeddings to identify subreddit communities with semantically similar conversations and user overlaps.

This approach reveals hidden discussions across the platform, from adjacent technology forums to general financial discussions.

Analyzing 23 communities

r/apple100%

r/iphone91%

r/applewatch91%

r/appletv88%

r/ios887%

r/ios87%

r/mac86%

r/ios986%

r/osx86%

r/ios1083%

r/ipad82%

r/iosprogramming80%

r/applehelp80%

r/iphone6s79%

r/iosbeta78%

r/tmobile78%

r/applemusic78%

r/amazonecho77%

r/alcatelwatch77%

r/blackberry76%

r/pebble76%

r/iphone676%

r/chromecast75%

Weekly Market Trends

how weekly sentiment trends compare with stock price and market volatility.

Weekly Sentiment vs Stock Price

Weekly analysis of r/apple sentiment (Total & Negative) against APPLE Closing Price.

Sentiment vs Price

Surprisingly, we find a strong correlation in this first graph. This is mostly due to Reddit’s user-base growth, as well as the rise in Apple’s stock price from 2014 to 2017.

Feature Analysis

How much correlation is there between features and stock price?

Top Features for Price Correlation (Apple)

Top 20 Features by correlation for Apple

Single Brand Analysis

Isolating the strongest predictive features for Apple price.

Price vs Volume (apple)

Scatter plot of feature correlations for apple. Color = strength (avg correlation).

Price vs Volume

Scatter plot revealing how features correlate differently with Price vs Trading Volume.

Top Predictors (Apple)

Top features for predicting Volume correlation for Apple

Top Predictors

Isolating the strongest predictive features for Apple volume.

Statistical Hypothesis Testing

Evaluating the Predictive Power of Reddit Sentiment

Null Hypothesis (H₀): Reddit sentiment does not predict apple stock movements.

Best Price Predictor

Feature: Prep

Pearson r

-0.62

p-value

1.65e-5

Moderate negative correlation, statistically significant.

Best Volume Predictor

Feature: Prep

Pearson r

0.70

p-value

4.09e-7

Strong positive correlation, statistically significant.

Partial Evidence Found.
Specific features show predictive power, but the signal is not consistent across all metrics.

Top 20 Features by Price Correlation for Apple

Pearson correlation (r) with Price

Detailed Pearson Correlation (Price)

Detailed breakdown of feature correlations with stock price.

Top 20 Features by Volume Correlation for Apple

Pearson correlation (r) with Volume

Detailed Pearson Correlation (Volume)

Detailed breakdown of feature correlations with trading volume.

Chapter 4: Behind the Curtain

What is there beneath the surface of the dataset?

Clustering

Analyzing the subreddit embeddings to find clusters of related communities.

LIWC Analysis

Understanding LIWC emotion relationships with sentiment classification.

Outlier Detection

Finding outlier clusters based on their emotional profiles.

Choosing the Number of Clusters

We want to understand the structure of the subreddit ecosystem by clustering similar communities together with the K-means algorithm. To determine the optimal number of clusters K, we perform a Silhouette Analysis.

While K=2 yields the highest score, it results in a trivial split (likely only separating outliers from the rest).

We chose the local maximum at K=9, which offers a much richer and more granular segmentation of the subreddit ecosystem, allowing us to identify distinct communities based on specific topics and behaviors rather than just broad differences.

Silhouette Analysis for Optimal k

Silhouette score for different numbers of clusters. Higher is better.

Distribution of Subreddits Across Clusters

Number of subreddits in each identified cluster. Click a bar to see details or search for a subreddit.

Show Cluster 2 (Non-Specific)

Select a cluster bar to view details

Cluster Distribution
and Characteristics

Click on a cluster in the chart to see its description and characteristics, or search for your favorite subreddit!

The principle of subreddit "closeness" is that "two subreddit are similar if the users who post in them are similar". Clusters are therefore the communities of subreddits with similar user bases.

The largest cluster (Cluster 2) contains the vast majority of subreddits, representing general interest communities and niches.
These subreddits have diverse user bases and not a single "user type". The other clusters are instead a reflection of certain "characteristical" Reddit users.

Dimensionality Reduction (PCA)

Using Principal Component Analysis (PCA), we reduced the high-dimensional embeddings of subreddits into a 3D space.

This visualization reveals the relationships between communities, making it easier to understand their relationships. Colors represent the clusters identified in the previous step. You can also click on a cluster name to hide it. Try it with Cluster 2 if you are interested in the structure of the smaller clusters!

LIWC Feature Distributions

Box plots of LIWC features across all posts (sorted by median).
Note: Features are scaled to focus on the main distribution (IQR). Extreme outliers (dashed lines) may extend beyond the view.

Linguistic Features (LIWC)

We analyzed the linguistic style of posts using their LIWC scores. These scores quantify the frequency in posts of various word categories related to emotions, cognitive processes, social concerns, and more.

These box plots show the distribution of various linguistic features across the dataset. We notice that some features are much more prevalent than others (e.g., Function Words (it to, no, very), Cognitive mechanisms (cause, know, ought)), as they are the basis of discourse. Others are rarer (e.g., Biological processes (eat, pain, sick), Drives (affiliation, achievement, power)) because of their more specialized nature.

From LIWC to Sentiment

But which linguistic features are most influential to the sentiment of a post? We used Logistic Regression, a machine learning model, to identify which linguistic features correlate most strongly with sentiment.

As expected, negative emotions (Anger, Sad(ness), Anx(iety)) show high negative correlation. However, surprisingly grammar also plays a role: a high usage of Verbs and Prepositions is significant predictor of negative sentiment.

Interestingly, positive coefficients are much lower-impact than negative ones, suggesting negative language has a stronger force: negativity overwhelms positivity.

LIWC Property	Coefficient	P-Value
LIWC_Anger	-0.8409	0.00e+00
LIWC_Verbs	-0.2396	1.73e-38
LIWC_Prep	-0.2392	0.00e+00
LIWC_Sad	-0.2331	0.00e+00
LIWC_Article	-0.2230	0.00e+00
LIWC_Body	-0.1987	0.00e+00
LIWC_Conj	-0.1784	1.92e-273
LIWC_Ipron	-0.1735	0.00e+00
LIWC_SheHe	-0.1640	0.00e+00
LIWC_Anx	-0.1555	0.00e+00
LIWC_Dissent	-0.1457	2.36e-298
LIWC_Humans	-0.1393	0.00e+00
LIWC_They	-0.1134	9.04e-263
LIWC_Health	-0.1079	1.83e-205
LIWC_Posemo	0.1055	3.46e-84

Linguistic Fingerprints

Some clusters exibit distinctive linguistic profiles, with outlier values in certain LIWC feature scores.
This is a remarkable result: these unique "linguistic fingerprints" reveal peculiar characteristics of these communities.

Cluster 0

Visual Curiosities

The "Family" Hidden in the Memes

High amount of Family related words in aesthetic humor subreddits.

Driven by some subreddits like r/blunderyears, where captions of funny pictures are inherently related to family embarrassment ("My mom made me wear this"). The source of humor is often family-related.

Cluster 1

Practical Lifestyle & DIY

The "Me, Myself, and I" of Problem Solving

We see high usage of I, Ipron (first-person pronouns), Past tense, and Tentat (tentative) words.

"Adulting" is introspective. When fixing a sink or budgeting, users narrate their own history. The high tentative score reveals users are not experts lecturing, but regular people admitting they don't know the answer.

Cluster 4

Adult Content

The Anomaly of "Friends"

High scores for Body, Sexual, Swear... and Friends???

Why is a pornography cluster talking about friendship? It's likely a framing device. Titles rely on social transgression (e.g., "My best friend sent me this") to generate interest.

Cluster 6

Activism & Global Affairs

The "Us vs. Them" of Activism

High concentration of They and Inhib (inhibition) words.

A stark contrast to Cluster 1. The conversation is rarely about the poster (first person), but rather about external forces and institutions (third person). The high inhibition score (block, stop, prevent) paints a picture of resistance against barriers.

Project Resources & Team

Repositories

Analysis Codebase

Website Codebase

Datasets

Reddit Comments (Processed)

Stock Market Data

The Team

Eugène Bergeron

Mattia Bianco

Andrea Bissoli

Florian Déjean

Fabio Marchetti

Built for EPFL CS-401 Applied Data Analysis • 2025