A brief analysis of a Reddit AMA

  • Word count: 757
  • Average reading time: 3 minutes and 47 seconds (based on 200 WPM)

Show me the code

When

March 2017

What

A short crawler, analysis tool and a bit of code to mask an image combined together to analyse a Reddit account. The code as is can be used to download an entire account post history. From this history some statistics are generated. Additionally I wanted to use the mask functionality of the word_cloud library so I went ahead to produce a mask through Python and some skimage functions. All in all the results can be viewed below.

Why

I read through the AMA of Ewan McGregor and noticed that there were very little answers and the answers that were given were very brief. Note that I'm a great fan of his work in Star Wars, Long Way Round, Long Way Down & Trainspotting. Some redditors noted that it felt more like an advertisement (as AMA's often do), I felt the same so it got me thinking: How little does one actually have to do for such an enormous advertisement? This train of thought got me wondering about the amount of words per karma and the amount of views generated per given amount of time. The frontpage of Reddit is viewed by thousands of people a day, thus, having a thread on the frontpage is very valuable.

How

Get the data

First thing to do was to get the data from Ewan's account. Luckily he had only 2 pages worth of posts (totalling to 45 posts) so I didn't have to do a lot of requests. Having quite some experience with parsers and no experience with Reddit's API I decided to get the data through web requests and parsing the HTML. With beautifulsoup parsing HTML is a breeze. Within an hour I had made the requests and parsed the data into a pandas dataframe.

Get statistics

With the data neatly in a dataframe I was able figure out all kinds of statistics. Starting first of all with the most basic ones:

Total amount of posts in AMA: 45
Total amount of time spend responding: 0 days 00:29:13
Amount of post / thread / link karma gained: 31422
Total amount of comment karma gained: 77026
Comment karma gained: Min 11 Max 6963 Average 1711 Median 813
Total amount of words: 660
Amount of words per post: Min 1 Max 48 Average 14 Median 13
Time between posts (s): Min 10 Max 224 Average 39 Median 28
Word length: Min 1 Max 14 Average 4 Median 4
Comment karma gained: Min 11 Max 6963 Average 1711 Median 813

With this data in place I figured it'd be more useful to plot the min/max/median/average values:

Having generated a similar word list already in my Simple Whatsapp Analysis code I thought it might be fun to make a word cloud for a change. Installing & building the word cloud was amazingly simple:

However, I noticed that the library also had the possibility to use a mask. Meaning that the word cloud could be cast into a certain kind of shape. The image Ewan McGregor used to provide proof that he was doing the AMA looked like a fun image to try it on. Obviously the mask needed to be generated through Python instead of something obvious like photoshop. After some time of tinkering I managed to get a good enough mask. With an Otsu threshold, closing filter and finally, applying a watershed algorithm on the remainder of the image. The watershed algorithm results in several labels and figuring out which label belogns to the white sheet was something I had to do manually. Nevertheless, I'm happy with the result, even though I did spend a bit too much time on this, haha.

Concluding I wanted to make a final plot that showed the amount of activity over time and at the same time the size of the post. In the below plot you can see each line representing a post. With on the X-axis the time. The color indicates the size of the post. I tried out several other representations, as for example, the height indicating the size of the post instead of the color. But I preferred this representation.

To conclude, I'd like to say that this set-up is easily reusable for any account on Reddit and this report can be generated for any thread in which the account is active. Aside from selecting threads, this can be used for further analysis of accounts. For example to view what other subreddits a user is active. I will toy around with this code more often in the future.

social