- Word count: 517
- Average reading time: 2 minutes and 35 seconds (based on 200 WPM)
When
End of 2016 - current
What
This is actually a very small part of a larger project that I'm working on at the moment. Despite being a small part of a larger whole I find it a valuable exercise which I'm happy to share my thoughts about.
Crawl with Scrapy
I still wanted to crawl the second hand car websites to analyze their data and already decided to work with Scrapy. First thing I wanted to do is let scrapy make the requests through a Tor proxy.
Polipo
Having not been able to get it to work at the start of 2016 I wanted to try it again with my newly gained knowledge. To my suprise this went suprisingly well. I found out that Scrapy does not support the SOCKS5 protocol so that's why my attempts to let Scrapy make requests directly through Tor were fruitless. I already came across Polipo and installing this was a breeze - especially with the Linux subsystem for Windows 10. Having pointed Scrapy and Polipo in the right directions I quickly was able to make requests with Scrapy through Tor.
1000 Tor & 1000 Polipo instances
I wanted to see how far I could push it and I already had written a script that fires up Polipo and Tor instances. With 16GB of RAM and 20GB of Swap space I was able to keep only 5GB of Swap space free. Obviously 1000 was too much, so I settled on 500 Tor & Polipo instances. This left some RAM free for other work. On top of this I force Tor to get a new ID every 70-130 requests. This to minimize the amount of requests made per IP in a given time slot.
Low bandwith
I crawled a second hand car advertisement website to gather some initial data but noticed that in the morning the crawler was still running. On average the crawler downloaded with a slow 300 kb/s. Despite this low speed it was fast enough for my goals. Eventually I gathered around 180k advertisements racking up 2.4GB. This data posed a next challenge. This low bandwith is expected, Tor is infamous for it's low speed and high latency. Each request took an average of approx. 2 seconds to complete. Quite remarkable becaue turning it on without a proxy for a brief time I easily managed 8 Mb/s. Interesting to know but not nice towards the servers.
HTML parsing
This data looks content-wise a lot like the data that would be crawled from other websites. Cars often have the same properties: an engine, color, amount of doors, gearbox type, etc. This similarity is something I will be exploiting in the future. For now I still need to fix some small things in the crawler and those are the following:
- Get Splash working over multiple proxies
- Add logging functionality
- Debug the proxy rotater
After these things are fixed I will download the remaining selected second hand car websites for their data and start work on the clustering algorithm - my next big project.