Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why not just use Scrapy[1]? It's built for this sort of thing, easily extensible, and written in Python.

[1] http://scrapy.org/



I've used Scrapy previously, and agree it's a good tool in the right situation.

It could just be me, but I prefer more control, and using the requests library gave me that. I ran into some obscure problems when wanting to use / change multiple proxies with Scrapy.

I think for a long running spider, then maybe scrapy is worth looking into. For simple one-off scripts / things not run very often, I'd much prefer to write something custom and not have to learn about Pipelines, Filters, etc.


> It could just be me, but I prefer more control, and using the requests library gave me that

I had the exact same opinion until a month back when I started using scrapy + scrapyd[1] for a serious scraping task. Yes, it's a long running spider which is why I decided to use Scrapy in the first place. But so far I have been really impressed with it in general and might consider it even for one-off crawling in future. IMO, pipelines etc. are worth learning. I would also recommend using the httpcache middleware during development which makes testing and debugging easy without having to download content again and again.

On a separate note, for one-off scraping, it's also worth checking whether a python script is required at all. For example, if you just need to download all images, wget with recursive download option should work pretty well in most cases. For eg. a while back I wanted to download all game of life configs (plain text files with .cells extension) from a certain site[2]. Initially I wrote a Python script, but when I learnt about wget's -r option, I replaced it with the following one-liner:

$ wget -P ./cells -nd -r -l 2 -A cells http://www.bitstorm.org/gameoflife/lexicon/

[1]: http://scrapyd.readthedocs.org/en/latest/

[2]: http://www.bitstorm.org/gameoflife/lexicon/


FWIW, the distinct proxy business can be solved in one of two ways, depending on how complex your proxy needs are.

First, you can just put ``request.meta["proxy"] = "http://..."`` at any point along the DownloadMiddleware before the request is actually transmitted.

Second, you could also just package up that "more control over requests" you described and make it a DownloadMiddleware. AFAIK, the first one in the chain that returns a populated Response object wins, so you could short-circuit all of the built-in downloading mechanism.


Thanks, good to know regarding the proxy. There was a couple of other little things that just didn't work the way I wanted though (I honestly don't remember them now).

I've built private libraries on top of requests now that allow me to do everything in such a trivial amount of time, so I prefer this approach with more control.

I think if I was going to write a long running spider, I'd probably look into scrapy again beforehand.


I have to agree that Scrapy is oriented more towards long-running spiders. For a one-off script, I'd probably do the same thing as you.


I tried using this and it was too much of a straitjacket. It's really, really hard to write a decent framework (for anything!) and I don't think scrapy really succeeds.

In the end I dumped my scrapy code and replaced it with a combination of celery, mechanize and pyquery. That worked much better, used less code and was much more flexible.


That's interesting. I'd love to hear more about your problems with it. My email is in my profile, feel free to drop me a line.

I deployed a small but moderately complex crawler backed by a database with Scrapy. It had some custom pipelining as well that deduplicated information (based upon a hash) and exponentially backed-off if the page hadn't changed in a while.


Agree. Scrapy is great for this sort of thing. Takes care of little details you don't want to be worrying about.

I coupled scrapy with Stem to control Tor on the odd chance your IP gets blocked. Works great.


This may be hypocritical given what we are discussing, but please don't use Tor for this kind of activity.

There are hundreds of proxy vendors, and other curators of open proxies if you have a tight budget, which are designed to spread out the origin of the load and do some minor masking of your traffic.

But it would make me very sad to have an exit node blocked for being involved with this.


Scrapy makes things a lot easier. I was really surprised at how fast it crawled a site the first time I used it (forgot to set a delay).


Thanks for creating Scrapy, I have used it before many times and it is a great tool!


I did not create Scrapy; you would have to thank Pablo Hoffman [1] for that. :) He's the co-creator and lead developer.

[1] https://github.com/pablohoffman




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: