I've used Scrapy previously, and agree it's a good tool in the right situation.
It could just be me, but I prefer more control, and using the requests library gave me that. I ran into some obscure problems when wanting to use / change multiple proxies with Scrapy.
I think for a long running spider, then maybe scrapy is worth looking into. For simple one-off scripts / things not run very often, I'd much prefer to write something custom and not have to learn about Pipelines, Filters, etc.
> It could just be me, but I prefer more control, and using the requests library gave me that
I had the exact same opinion until a month back when I started using scrapy + scrapyd[1] for a serious scraping task. Yes, it's a long running spider which is why I decided to use Scrapy in the first place. But so far I have been really impressed with it in general and might consider it even for one-off crawling in future. IMO, pipelines etc. are worth learning. I would also recommend using the httpcache middleware during development which makes testing and debugging easy without having to download content again and again.
On a separate note, for one-off scraping, it's also worth checking whether a python script is required at all. For example, if you just need to download all images, wget with recursive download option should work pretty well in most cases. For eg. a while back I wanted to download all game of life configs (plain text files with .cells extension) from a certain site[2]. Initially I wrote a Python script, but when I learnt about wget's -r option, I replaced it with the following one-liner:
FWIW, the distinct proxy business can be solved in one of two ways, depending on how complex your proxy needs are.
First, you can just put ``request.meta["proxy"] = "http://..."`` at any point along the DownloadMiddleware before the request is actually transmitted.
Second, you could also just package up that "more control over requests" you described and make it a DownloadMiddleware. AFAIK, the first one in the chain that returns a populated Response object wins, so you could short-circuit all of the built-in downloading mechanism.
Thanks, good to know regarding the proxy. There was a couple of other little things that just didn't work the way I wanted though (I honestly don't remember them now).
I've built private libraries on top of requests now that allow me to do everything in such a trivial amount of time, so I prefer this approach with more control.
I think if I was going to write a long running spider, I'd probably look into scrapy again beforehand.
I tried using this and it was too much of a straitjacket. It's really, really hard to write a decent framework (for anything!) and I don't think scrapy really succeeds.
In the end I dumped my scrapy code and replaced it with a combination of celery, mechanize and pyquery. That worked much better, used less code and was much more flexible.
That's interesting. I'd love to hear more about your problems with it. My email is in my profile, feel free to drop me a line.
I deployed a small but moderately complex crawler backed by a database with Scrapy. It had some custom pipelining as well that deduplicated information (based upon a hash) and exponentially backed-off if the page hadn't changed in a while.
This may be hypocritical given what we are discussing, but please don't use Tor for this kind of activity.
There are hundreds of proxy vendors, and other curators of open proxies if you have a tight budget, which are designed to spread out the origin of the load and do some minor masking of your traffic.
But it would make me very sad to have an exit node blocked for being involved with this.
[1] http://scrapy.org/