Published: Tue 14 March 2017
By Casper
In Quick Hacks .
Word count: 757
Average reading time: 3 minutes and 47 seconds (based on 200 WPM)
Show me the code
When
March 2017
What
A short crawler, analysis tool and a bit of code to mask an image combined together to analyse a Reddit account. The code as is can be used to download an entire account post history.
From this history some statistics are generated. Additionally I wanted to use the mask functionality of the word_cloud library so I went ahead to produce a mask through Python and some
skimage functions. All in all the results can be viewed below.
Why
I read through the AMA of Ewan McGregor and noticed that there were very little answers and the answers that were given were very brief. Note that I'm a great fan of his work in Star Wars, Long Way Round, Long Way Down & Trainspotting.
Some redditors noted that it felt more like an advertisement (as AMA's often do), I felt the same so it got me thinking: How little does one actually have to do for such an enormous advertisement?
This train of thought got me wondering about the amount of words per karma and the amount of views generated per given amount of time. The frontpage of Reddit is viewed by thousands of people a day,
thus, having a thread on the frontpage is very valuable.
How
Get the data
First thing to do was to get the data from Ewan's account. Luckily he had only 2 pages worth of posts (totalling to 45 posts) so I didn't have to do a lot of requests. Having quite some experience
with parsers and no experience with Reddit's API I decided to get the data through web requests and parsing the HTML. With beautifulsoup parsing HTML is a breeze. Within an hour I had
made the requests and parsed the data into a pandas dataframe.
Get statistics
With the data neatly in a dataframe I was able figure out all kinds of statistics. Starting first of all with the most basic ones:
Total amount of posts in AMA: 45
Total amount of time spend responding: 0 days 00:29:13
Amount of post / thread / link karma gained: 31422
Total amount of comment karma gained: 77026
Comment karma gained: Min 11 Max 6963 Average 1711 Median 813
Total amount of words: 660
Amount of words per post: Min 1 Max 48 Average 14 Median 13
Time between posts (s): Min 10 Max 224 Average 39 Median 28
Word length: Min 1 Max 14 Average 4 Median 4
Comment karma gained: Min 11 Max 6963 Average 1711 Median 813
With this data in place I figured it'd be more useful to plot the min/max/median/average values:
Having generated a similar word list already in my Simple Whatsapp Analysis code I thought it might be fun to make a word cloud for a change.
Installing & building the word cloud was amazingly simple:
However, I noticed that the library also had the possibility to use a mask. Meaning that the word cloud could be cast into a certain kind of shape. The image Ewan McGregor used to
provide proof that he was doing the AMA looked like a fun image to try it on. Obviously the mask needed to be generated through Python instead of something obvious like photoshop.
After some time of tinkering I managed to get a good enough mask. With an Otsu threshold, closing filter and finally, applying a watershed algorithm on the remainder of the image. The watershed algorithm
results in several labels and figuring out which label belogns to the white sheet was something I had to do manually. Nevertheless, I'm happy with the result, even though I did spend a bit too much time on this, haha.
Concluding I wanted to make a final plot that showed the amount of activity over time and at the same time the size of the post. In the below plot you can see each line representing a post. With on the X-axis the time.
The color indicates the size of the post. I tried out several other representations, as for example, the height indicating the size of the post instead of the color. But I preferred
this representation.
To conclude, I'd like to say that this set-up is easily reusable for any account on Reddit and this report can be generated for any thread in which the account is active. Aside from
selecting threads, this can be used for further analysis of accounts. For example to view what other subreddits a user is active. I will toy around with this code more often in the future.