Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Python web scraping (jakeaustwick.me)
138 points by Jake232 on March 10, 2014 | hide | past | favorite | 62 comments


Why not just use Scrapy[1]? It's built for this sort of thing, easily extensible, and written in Python.

[1] http://scrapy.org/


I've used Scrapy previously, and agree it's a good tool in the right situation.

It could just be me, but I prefer more control, and using the requests library gave me that. I ran into some obscure problems when wanting to use / change multiple proxies with Scrapy.

I think for a long running spider, then maybe scrapy is worth looking into. For simple one-off scripts / things not run very often, I'd much prefer to write something custom and not have to learn about Pipelines, Filters, etc.


> It could just be me, but I prefer more control, and using the requests library gave me that

I had the exact same opinion until a month back when I started using scrapy + scrapyd[1] for a serious scraping task. Yes, it's a long running spider which is why I decided to use Scrapy in the first place. But so far I have been really impressed with it in general and might consider it even for one-off crawling in future. IMO, pipelines etc. are worth learning. I would also recommend using the httpcache middleware during development which makes testing and debugging easy without having to download content again and again.

On a separate note, for one-off scraping, it's also worth checking whether a python script is required at all. For example, if you just need to download all images, wget with recursive download option should work pretty well in most cases. For eg. a while back I wanted to download all game of life configs (plain text files with .cells extension) from a certain site[2]. Initially I wrote a Python script, but when I learnt about wget's -r option, I replaced it with the following one-liner:

$ wget -P ./cells -nd -r -l 2 -A cells http://www.bitstorm.org/gameoflife/lexicon/

[1]: http://scrapyd.readthedocs.org/en/latest/

[2]: http://www.bitstorm.org/gameoflife/lexicon/


FWIW, the distinct proxy business can be solved in one of two ways, depending on how complex your proxy needs are.

First, you can just put ``request.meta["proxy"] = "http://..."`` at any point along the DownloadMiddleware before the request is actually transmitted.

Second, you could also just package up that "more control over requests" you described and make it a DownloadMiddleware. AFAIK, the first one in the chain that returns a populated Response object wins, so you could short-circuit all of the built-in downloading mechanism.


Thanks, good to know regarding the proxy. There was a couple of other little things that just didn't work the way I wanted though (I honestly don't remember them now).

I've built private libraries on top of requests now that allow me to do everything in such a trivial amount of time, so I prefer this approach with more control.

I think if I was going to write a long running spider, I'd probably look into scrapy again beforehand.


I have to agree that Scrapy is oriented more towards long-running spiders. For a one-off script, I'd probably do the same thing as you.


I tried using this and it was too much of a straitjacket. It's really, really hard to write a decent framework (for anything!) and I don't think scrapy really succeeds.

In the end I dumped my scrapy code and replaced it with a combination of celery, mechanize and pyquery. That worked much better, used less code and was much more flexible.


That's interesting. I'd love to hear more about your problems with it. My email is in my profile, feel free to drop me a line.

I deployed a small but moderately complex crawler backed by a database with Scrapy. It had some custom pipelining as well that deduplicated information (based upon a hash) and exponentially backed-off if the page hadn't changed in a while.


Agree. Scrapy is great for this sort of thing. Takes care of little details you don't want to be worrying about.

I coupled scrapy with Stem to control Tor on the odd chance your IP gets blocked. Works great.


This may be hypocritical given what we are discussing, but please don't use Tor for this kind of activity.

There are hundreds of proxy vendors, and other curators of open proxies if you have a tight budget, which are designed to spread out the origin of the load and do some minor masking of your traffic.

But it would make me very sad to have an exit node blocked for being involved with this.


Scrapy makes things a lot easier. I was really surprised at how fast it crawled a site the first time I used it (forgot to set a delay).


Thanks for creating Scrapy, I have used it before many times and it is a great tool!


I did not create Scrapy; you would have to thank Pablo Hoffman [1] for that. :) He's the co-creator and lead developer.

[1] https://github.com/pablohoffman


Good resource - I've been using BeautifulSoup[1] for the scraper I set up for my needs, and it's probably worth checking out as well!

[1]: http://www.crummy.com/software/BeautifulSoup/


Thanks! I mentioned BeautifulSoup existed very briefly in the lxml section in prerequisites. I just added a short section on CSS Selectors, you can actually use lxml with them. I'll add a little more info with BeautifulSoup and PyQuery mentioned though.


People who write (python) crawlers use BeautifulSoup because it was created with invalid input in mind. If you're crawling the wild wild internet, chances are you'll find a wide range of invalid markup and lxml barfs on those (it's got recover=True, but that's not supported for all things lxml does).


I've wrote a lot of python crawlers, and have never really experienced an issue with lxml. It doesn't really ever seem to choke, and I'm sure I must have come across some pretty funky markup in my time. lxml has always handled things fine for me. Maybe I've just gotten lucky I guess?


That's not actually lxml's fault, it depends on the libxml2 installed in your environment. Just like Jake232, I haven't found a site that lxml can't handle lately, using libxml 2.9.1. You can find your libxml2 version in Python with:

  >>> from lxml import etree
  >>> etree.LIBXML_VERSION
If you find a page in the wild that it can't handle, I'd love to know the URL.


Even BeautifulSoup makes use of lxml now (as of version 4)


PyQuery is pretty good for navigating around the DOM too.

http://pythonhosted.org//pyquery/


Used this on many project, great library.


Never used it, but looks useful. I'll add a note regarding it. Thanks.


Great summary on how to start on the topic, really nice! I only wish it was longer as I love playing with scraping (regex lover here) and unfortunately not many people consider going straight with lxml + xpath, which is ridiculously fast. Sometimes I see people writing a bunch of lines to walk through a tree of elements and using selectors with BeautifulSoup or even with a full Scrapy project and then I, like, "dude, why didn't you just extracted that tiny data with a single xpath?". Suggestion for a future update: try covering the caveats of lxml (invalid pages [technically not lxml's fault but okay], limitations of xpath 1.0 compared to 2.0 [not supported in lxml], tricky charset detection) and maybe throw a few samples codes both in BS and lxml to compare when to use each of them :-)


Hey, thanks for the feedback!

I'm planning on adding more to the article in the near future, this was just a start. I plan on it being a resource with almost everything in, so people can bookmark it for future use.

I really haven't seen lxml choking on pages, are you sure you have the latest libxml2 installed? lxml always seems to work for me. If you know of a URL where it doesn't, I'd love to see it.


For a browser that runs JS the author mentions PhantomJS, but looks like its Python support is iffy. Mechanize is super-easy in Python: http://www.pythonforbeginners.com/cheatsheet/python-mechaniz...

Edit: so easy, in fact, that I prefer to just START with using mechanize to fetch the pages--why bother testing whether or not your downloader needs JS, cookies, a reasonable user agent, etc--just start with them.


A hearty second for mechanize. It basically wraps urllib2 which is neat. Although, I've encountered situations where language support for distributed systems would have really saved some frustration. There's a version of mechanize for Erlang [1], which I intend on trying out whenever I get around to learning Erlang :)

[1] https://github.com/tokenrove/mechanizerl


Thanks for the mechanize link, I'll add a note about it.

Mechanize is awfully slow though, if you need to crawl quickly it's not asynchronous. I wouldn't want to use it for general crawling. I guess you could patch the stdlib with gevent, and try and get something working that way.


Like the OP, I needed more control over the crawling behaviour for a project. All the scraping code quickly became a mess though, so I wrote a library that lets you declaratively define the data you're interested in (think Django forms). It also provides decorators that allow you to specify imperative code for organizing the data cleanup before and after parsing. See how easy it is to extract data from the HN front page: https://github.com/aGHz/structominer/blob/master/examples/hn...

I'm still working on proper packaging so for the moment the only way to install Struct-o-miner is to clone it from https://github.com/aGHz/structominer.


Hey,

I've actually built something similar to this myself, I plan on writing an article in the future with something along these lines.

Yours look pretty polished though, good job!


Yeah, I'm pretty sure anyone having to do any moderate amount of scraping eventually arrives at a similar solution.

Thanks! It's still a work in progress, so if you have anything you'd like to see in there, I'd love to hear about it (I also welcome code contributions if you're so inclined).


> [...] prefer to stick with lxml for raw speed.

Is parsing speed really an issue when you are scraping the data you are parsing?


If you're running on a server with 1Gbps, then yes - it can be an issue. Another issue (I'm going to add a section on it) is that you can peg the CPU at 100% very easily. Parsing HTML / Running xPaths uses a lot of CPU, so if you're running a good number of threads this can quickly become a problem.


It's an issue if you have high internet speed. I've had cases where the data can be downloaded faster than it can be parsed and overflow the internal queue. It's even worse if your crawler is not asynchronous, in that case slow parser slow down crawling speed as well.


> Use the Network tab in Chrome Developer Tools to find the AJAX request, you'll usually be greeted by a response in json.

But sometimes you won't. Sometimes you'll be assaulted with a response in a proprietary, obfuscated, or encrypted format. In situations where reverse-engineering the Javascript is unrealistic (perhaps it is equally obfuscated), I recommend Selenium[1][2] for scraping. It hooks a remote control to Firefox, Opera, Chrome, or IE, and allows you to read the data back out.

[1]: http://docs.seleniumhq.org/ [2]: http://docs.seleniumhq.org/projects/webdriver/


Hi.

I mentioned selenium in the section below that, but I'll drop a note to it in the AJAX section too!


Test this real scrapper framework: https://github.com/hernan604/HTML-Robot-Scraper


It's scraper (single p).


Warning: Self-Advertisment!

http://www.joyofdata.de/blog/using-linux-shell-web-scraping/

Okay, it's not as bold has using Headers saying "Hire Me" but I would like to emphasize that sometimes even complex tasks can be super-easy when you use the right tools. And a combination of Linux shell tools makes this task really very straightforward (literally).


This is really cool -- particularly hxselect -- but I don't see how it's particularly simpler than using Python/requests. How would you handle following links and deduping visited URLs via piping?


Thanks - well recursive downloading of a web-page using wget is no big deal at all. And regarding particularly simpler - it is particularly shorter and pretty clear what's going on.


Thanks for the self-advertisement. That way, I was introduced to some fine reading on your blog.

Greetings from Hamburg.


Thanks for the kind words - you're most welcome


> If you need to extract data from a web page, then the chances are you looked for their API. Unfortunately this isn't always available and you sometimes have to fall back to web scraping.

Also, many times, not all functionality/features are available through the API.

Edit: By the way, without JS enabled, the code blocks on your website are basically unviewable (at least on Firefox).

http://imgur.com/CSYxMfL


This is a great starting point. Can anyone recommend any resources for how to best set up a remote scraping box on AWS or another similar provider? Pitfalls, best tools to help manage/automate scripts etc. I've found a few "getting started" tutorials like this one but I haven't been able to find anything good that discusses scraping beyond running basic scripts on your local machine.


http://scrapy.org/ is more systemic than these ad-hoc solutions, and http://scrapyd.readthedocs.org/en/latest/ is the daemon into which one can deploy scrapers if you want a little more structure.

The plugin architecture alone makes Scapy a hands-down winner over whatever you might dream up on your own. And with any good plugin architecture, it ships with several optional toys, you can always add your own, and there is a pretty good community (e.g. https://github.com/darkrho/scrapy-redis)

http://crawlera.com/ will start to enter into your discussion unless you have a low-volume crawler, and http://scrapinghub.com/ are the folks behind Crawlera and (AFAIK) sponsor (or actually do) the development for Scrapy.


I created a news aggregator completely based on a python scrapper. I run scrapping jobs as periodic celery tasks. usually I would look into the RSS feeds (to start) and parse them with "Feed Parser" module. Then I would use "pyquery" to find open graph tags.

Pyquery is an excellent module, but I think its parser is not very "Forgiving" so it might fail for some invalid/non-standard markups.


pyquery uses lxml, which is pretty forgiving on any non-standard/invalid markups.


The part on how to avoid detection was particularly useful for me.

I use webscraping (https://code.google.com/p/webscraping/) + BeautifulSoup. What I like about webscraping is that it automatically creates a local cache of the page you access so you don't end up needlessly hitting the site while you are testing the scraper.


He recommends http://51proxy.com/ - I was curious and signed up for their smallest package. It's been 24 hours since I created the account and I have 48 "Waiting" proxies, and I haven't been able to connect to either of the "Live" ones.

Has anyone had any success with them?


For those interested, I wrote a "web scraping 101" tutorial in the past that uses BeautifulSoup: http://www.gregreda.com/2013/03/03/web-scraping-101-with-pyt...


For all those mentioning scrapy, would it be a good fit for authenticated scraping with parameters (log in to different banks and get recent transactions)?


I do a fair amount of web scraping with Ruby. Can anyone who has dabbled in both weigh in? Does Python offer superior libraries/tools?


Hey, article author here.

I've done extensive scraping in both Python and Ruby, as I wrote most of the scraping / crawling code at http://serpiq.com, so I can chip in.

Overall, I prefer Python. That is pretty much solely down to the requests library though, it makes everything so simple and quick. I haven't covered it in the article yet, but you can extend the Response() class easily, so you can for example add methods like images(), links(nofollow=True), etc. Overall, I just think the requests library is much more polished than anything available in Ruby.

grequests (Python) means I can make things concurrent in a matter of minutes. However in Ruby the only capable library supporting concurrent HTTP requests that I liked was Typhoeus. It just wasn't to the same standard though, and I ran across certain issues when using proxies etc.

As far as the HTML parsing goes, I don't really have any preference. Nokogiri and lxml are both equally capable.

I think they're both perfectly capable languages though, stick with what you prefer. I've been experimenting with Go lately.


Scrapy mentioned here is really easy, or as the article mentions a combination of requests, lxml and something like phantomjs if there is javascript involved makes scraping sites in python a nice experience.

I have briefly made something with Scrapy to scrape my university's websites to notify me when we get new exam results [1], and that was okay. I might be slightly abusing scrapy but it was an okay experience.

Previously I have used 'scrubyt' for Ruby to scrape things, but their homepage seems to lead to a skin-related website now. What tools/libraries do you use for scraping stuff with Ruby these days? I remember Mechanize and Nokogiri was good/decent, but it's been more than a few years since I last used Ruby.

[1] https://github.com/flexd/studweb (description in Norwegian but it's not important)


I've used both Python and Ruby for web scraping. Whilst Python is my language of choice for most things, I enjoyed the web scraping experience more with Ruby (in particular, Nokogiri). Maybe it's just bad luck on my part, but I tend to find Unicode issues when scraping with Python 2.x, whereas Ruby has had decent Unicode support for a while. I've not used Python 3.x. YMMV.


RE: "whereas Ruby has had decent Unicode support for a while"

Really? I'd love for some examples for where Ruby shines when it comes to Unicode handling when dealing with web content.

I know a lot of work was done in Ruby 1.9+ to bring decent Unicode encoding support to the language, but I still see a good number of complaints/articles about issues with it.

The encoding issues I've run into with python 2 have generally been whatever framework I'm using to ingest the content took a website at face value for encoding: either it wasn't defined at all or it was defined incorrectly.

In the wild wild world of web, unless you're doing intelligent data inspection, you're just going to run into that sort of thing.

In python, that's why projects like this exist: https://github.com/LuminosoInsight/python-ftfy

They let you correct Unicode content that was decoded with the wrong encoding.


Author of ftfy here - thanks for the shoutout.

By the way, here's another problem with taking the "encoding" parameter at face value: you're opening yourself up to DoS or data corruption bugs in the case where someone tells you to use a dangerous encoding.

There have been multiple bugs found in Python's UTF-7 decoder recently, and generally they were found by people who were scraping the Web with Python. These bugs, such as [1], could cause you to write strings that corrupt your data or crash the Python interpreter. And until the latest version -- and this is possibly still the case in all versions of Python 2 -- someone could give you a gzip bomb that decompresses to petabytes of data, and tell you it's in the "gzip" encoding [2].

I'm sure there are more bugs like this out there, and that Ruby has similar lurking bugs as well, given how recently they changed their Unicode system.

Basically, you shouldn't let someone else's Web page tell you what code to run, unless it's code you're planning to run. I recommend making a short list of encodings you trust, including ASCII, UTF-8, UTF-16, ISO-8859-x, Windows-125x, and MacRoman, and maybe a few others if you're working with CJK text, and just rejecting all others.

(The x's can be filled in with digits. Don't accept UTF-7, because it's clearly horrible. And I don't have any particular reason to be suspicious of UTF-32, but I've never seen anyone seriously use it.)

[1] http://bugs.python.org/issue19279

[2] http://bugs.python.org/issue20404


Thanks for sharing ftfy. I have had several issues that this little library appears to address perfectly


Coolio. I'd use PyQuery instead of xpath.


Going to add a section on PyQuery, thanks!


How do the ruby web scrapers compare to the python ones?


Scrapy is made for this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: