*On a side note, Amazon does not allow the use of its API for price trackers, th...

Someone1234 · on Sept 8, 2014

With all due respect: Spoken like someone who's never actually tried to do scraping on a large scale. This line just cracks me up:

> If people can still access the same data just by going to Amazon with their browser, I don't see any fundamental reason why they wouldn't be able to get the data if they really wanted to.

Just so naive, like a little kid saying they can build a rocket to the moon because they have a cardboard tube and some petrol.

In its own right web-scraping is difficult, and when the adversary is intentionally fighting against it it can be almost impossible. Modern web-scrapers often have to be smart enough to incorporate almost a full web-stack of technology, since if you lack JS or CSS rendering it is trivial to hide content from you or worse give you bad content the users won't even see (and also keep in mind the scraper has to "understand" CSS using its own logic, as it cannot "view" the result).

They can also dynamically alter how the page is rendered (e.g. if you're treating DOM like a tree and going down to a specific branch, it is trivial to re-order the HTML to break that). CSS effectively removes the relationship between HTML tag order and location on the page.

If I had a team of engineers working full time, I could likely maintain a 90%+ good Amazon web-scraper after about a year of pure development on it (likely using something like WebKit as a starting point). But nobody puts in this type of effect because sooner or later Amazon would sue you for something obscure and effectively win via lawyers fees alone.

An easier solution is just to hire Mechanical Turk-like people to look at pages, note down the price, and pay them 1c/price. It would cost an absolute ton of money, but you're able to start operating almost immediately (no prior development overhead). Plus you could limit it to Amazon's top 100,000 items and nothing more.

TeMPOraL · on Sept 8, 2014

> I've always believed that APIs on websites are more than just convenience - they're a way for corporations to exert control over the data they give out, sort of like DRM.

Back in the days of Twitter API fiasco (2012/2013) I was playing Mass Effect, and this little gem of wisdom stood out for me as something eerily relevant:

"Your civilization is based on the technology of the mass relays, our technology. By using it, your society develops along the paths we desire. We impose order on the chaos of organic evolution. You exist because we allow it, and you will end because we demand it."

(context without spoilers for those who haven't played the game: mass relays are devices that allow long-range FTL travel, left behind by a mysterious ancient civilization; every species in the galaxy finds a mass relay at some point, which leads them to discovering other species and becoming a part of galactic community)

eli · on Sept 8, 2014

I'm sure it's technically possible to scrape Amazon. I'm much less confident that it would be legal.

mooreds · on Sept 8, 2014

This is the key point. If all you are doing is scraping prices for a personal project, questionable legality is ok (even if clear legality would be preferable). But if you are trying to build a business, ya gotta stay on the right side of the law.

toomuchtodo · on Sept 8, 2014

What law prevents you from scrapping publicly available site?

minouye · on Sept 8, 2014

Supposedly the Computer Fraud and Abuse Act.

http://en.wikipedia.org/wiki/Craigslist_v._3Taps

eli · on Sept 8, 2014

http://en.wikipedia.org/wiki/Web_scraping#Legal_issues

I don't think it's mentioned on the wiki but I think there might be a DMCA Anti-Circumvention case too if the site you're scraping has technical measures to prevent data from being copied.

wrath · on Sept 8, 2014

Amazon has made it much harder to scrape their site in recent months. They now give captchas to bots that try and access their site to often by the same IP address. Unless you have a very large pool of low cost proxies, crawling millions of prices on a daily basis is challenging. I think this is what he's referring to when he says """and has lately made it very difficult to get the data in other ways."""

toomuchtodo · on Sept 8, 2014

The best way to do this is with your users using a Chrome extension to feed the data back to you. User lands on a page? Send the data back to your scrapper processing API.

jarkarj · on Sept 11, 2014

I've been scraping Amazon for the past few years. Yeah the captcha that they introduced around last February is a pain, but there are ways around it. I have and still do make my living from scraping (not just Amazon) - lemme know if you are interested and maybe I can help someone out.