Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For a browser that runs JS the author mentions PhantomJS, but looks like its Python support is iffy. Mechanize is super-easy in Python: http://www.pythonforbeginners.com/cheatsheet/python-mechaniz...

Edit: so easy, in fact, that I prefer to just START with using mechanize to fetch the pages--why bother testing whether or not your downloader needs JS, cookies, a reasonable user agent, etc--just start with them.



A hearty second for mechanize. It basically wraps urllib2 which is neat. Although, I've encountered situations where language support for distributed systems would have really saved some frustration. There's a version of mechanize for Erlang [1], which I intend on trying out whenever I get around to learning Erlang :)

[1] https://github.com/tokenrove/mechanizerl


Thanks for the mechanize link, I'll add a note about it.

Mechanize is awfully slow though, if you need to crawl quickly it's not asynchronous. I wouldn't want to use it for general crawling. I guess you could patch the stdlib with gevent, and try and get something working that way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: