On a side note, Amazon does not allow the use of its API for price trackers, they made that clear, and has lately made it very difficult to get the data in other ways.
If people can still access the same data just by going to Amazon with their browser, I don't see any fundamental reason why they wouldn't be able to get the data if they really wanted to. I've always believed that APIs on websites are more than just convenience - they're a way for corporations to exert control over the data they give out, sort of like DRM.
In the same way, they're also pretty trivial to circumvent; with webscraping, the major issue is that your traffic stands out, but distribute it widely enough and it won't look anything different from the massive traffic Amazon gets already. You could make a browser plugin and have your users do it for you (just make sure your privacy policies are stated clearly...!)
I'm willing to bet that Amazon's data is already being scraped by many other sites, so that also gives rise to this similarity: the "pirates" win, the "legitimate users" going through their API lose.
With all due respect: Spoken like someone who's never actually tried to do scraping on a large scale. This line just cracks me up:
> If people can still access the same data just by going to Amazon with their browser, I don't see any fundamental reason why they wouldn't be able to get the data if they really wanted to.
Just so naive, like a little kid saying they can build a rocket to the moon because they have a cardboard tube and some petrol.
In its own right web-scraping is difficult, and when the adversary is intentionally fighting against it it can be almost impossible. Modern web-scrapers often have to be smart enough to incorporate almost a full web-stack of technology, since if you lack JS or CSS rendering it is trivial to hide content from you or worse give you bad content the users won't even see (and also keep in mind the scraper has to "understand" CSS using its own logic, as it cannot "view" the result).
They can also dynamically alter how the page is rendered (e.g. if you're treating DOM like a tree and going down to a specific branch, it is trivial to re-order the HTML to break that). CSS effectively removes the relationship between HTML tag order and location on the page.
If I had a team of engineers working full time, I could likely maintain a 90%+ good Amazon web-scraper after about a year of pure development on it (likely using something like WebKit as a starting point). But nobody puts in this type of effect because sooner or later Amazon would sue you for something obscure and effectively win via lawyers fees alone.
An easier solution is just to hire Mechanical Turk-like people to look at pages, note down the price, and pay them 1c/price. It would cost an absolute ton of money, but you're able to start operating almost immediately (no prior development overhead). Plus you could limit it to Amazon's top 100,000 items and nothing more.
> I've always believed that APIs on websites are more than just convenience - they're a way for corporations to exert control over the data they give out, sort of like DRM.
Back in the days of Twitter API fiasco (2012/2013) I was playing Mass Effect, and this little gem of wisdom stood out for me as something eerily relevant:
"Your civilization is based on the technology of the mass relays, our technology. By using it, your society develops along the paths we desire. We impose order on the chaos of organic evolution. You exist because we allow it, and you will end because we demand it."
(context without spoilers for those who haven't played the game: mass relays are devices that allow long-range FTL travel, left behind by a mysterious ancient civilization; every species in the galaxy finds a mass relay at some point, which leads them to discovering other species and becoming a part of galactic community)
This is the key point. If all you are doing is scraping prices for a personal project, questionable legality is ok (even if clear legality would be preferable). But if you are trying to build a business, ya gotta stay on the right side of the law.
I don't think it's mentioned on the wiki but I think there might be a DMCA Anti-Circumvention case too if the site you're scraping has technical measures to prevent data from being copied.
Amazon has made it much harder to scrape their site in recent months. They now give captchas to bots that try and access their site to often by the same IP address. Unless you have a very large pool of low cost proxies, crawling millions of prices on a daily basis is challenging. I think this is what he's referring to when he says """and has lately made it very difficult to get the data in other ways."""
The best way to do this is with your users using a Chrome extension to feed the data back to you. User lands on a page? Send the data back to your scrapper processing API.
I've been scraping Amazon for the past few years. Yeah the captcha that they introduced around last February is a pain, but there are ways around it. I have and still do make my living from scraping (not just Amazon) - lemme know if you are interested and maybe I can help someone out.
If people can still access the same data just by going to Amazon with their browser, I don't see any fundamental reason why they wouldn't be able to get the data if they really wanted to. I've always believed that APIs on websites are more than just convenience - they're a way for corporations to exert control over the data they give out, sort of like DRM.
In the same way, they're also pretty trivial to circumvent; with webscraping, the major issue is that your traffic stands out, but distribute it widely enough and it won't look anything different from the massive traffic Amazon gets already. You could make a browser plugin and have your users do it for you (just make sure your privacy policies are stated clearly...!)
I'm willing to bet that Amazon's data is already being scraped by many other sites, so that also gives rise to this similarity: the "pirates" win, the "legitimate users" going through their API lose.