Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's phantomjs for that. Though I've actually had the most success with perl's WWW::Mechanize::Firefox and a headless X server.


You don't even need phantomjs, really. You can use python + webkitgtk+ through the gobject bindings. The problem with phantomjs is that it's using an old version of webkit from an old version of qtwebkit from qt 4.8, whenever that was released. By comparison, webkitgtk+ can be compiled from upstream webkit whenever you please.

If you insist on controlling webkit through javascript, you can use gnome-seedjs. But this is problematic/annoying because there's no commonjs implementation yet.. in phantomjs you can require() node modules in the outside context. Not so much in gnome-seedjs..

Also, X means it's not actually headless, even if you're using xserver-xorg-video-dummy or xvfb. For this reason, phantomjs got rid of the X requirement a number of versions ago.


Do you have any other resources about using webkitgtk with python for this sort of purpose?


No, I haven't found a definitive tutorial or reference (even for webkit). In general, look around for "import gi.repository.webkit" and you will find relevant things. I am not very sure how the other phantomjs developers are learning webkit things.. probably just reading code.



Or you can use a full browser with extension to do the scraper. In this way you are up to date with the latest browser release.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: