Reddit supports several hundreds of different online communities. But how different is the user base between communities? Initially, we were motivated with the question, “how different are the communities of Dota2 and League of Legends?” There have been debates about one community being more negative than the other, and we wanted to see if this was true. Building on top of our initial motivation, we decided to expand our horizons and encompass multiple different types of communities.
We noticed a significant amount of differences between subreddit communities based on jargon and vocabulary usage. We wanted to extract words that hopefully define a particular subreddit, and compare them with other subreddit communities to see how closely related they were. By doing so, we can cluster certain subreddits with each other. With our data, we created a dendrogram to visualize how closely and how different each subreddit were based on jargon and vocabulary usage.
We based our methods on the methods used in the article, “Finding Cultural Holes: How Structure and Culture Diverge in Networks of Scholarly Communication”. This article describes a similar study of finding jargon differences between different scientific fields.
In conclusion, this project explores the implementation of finding cultural holes and clustering groups in the popular online communities of Reddit.
By utilizing a Python script, we accessed the Reddit API in order to extract the top 200 comments within the top 25 posts for each subreddit. We ran the script every 24 hours for 7 days in order to accumulate an extensive database of words with their counts. These words were stored within a Firebase, along with the subscriber count for each subreddit.
For our project’s scope, we selected 27 subreddits. We wanted a diverse selection of subreddits that were theoretically related, as well as subreddits that were theoretically unrelated. Thus, we handpicked the default subreddits and several other subreddits we deemed necessary.
Our results are conclusive. We successfully found similarities between subreddits based on their word usage and jargon. The dendrogram we created showed clear clusters of subreddits. There were some expected clusters such as the clustering of Dota2 and League of Legends subreddits. However, we were surprised to see some subreddits distant from each other like the Gaming and Minecraft subreddits. By examining the posts inside the subreddits closer, we understood why some of the unexpected clusters occurred. Although the topics might be different from one another, like Gaming and TIFU, the comments posted in the subreddits have a similar tone.
We are able to take the data we stored in Firebase and display the top words in a subreddit on a bar graph. From this graph, we can see that the word ranks based on the word frequency follows a zipfian distribution. However, the ranking of each word is different between each subreddit.
Our results are based on a small sample size, but it is fairly accurate. For further research, we would like to explore the jargon differences using n-grams instead of unigrams. this way we will have a better understanding of the semantics of a phrase. Additionally, collecting data from more subreddits and most posts, we will generate more accurate results and clusters.
We are four Informatics students from the University of Washington.