In addition to a rate limit, a page limit per IP is needed; this is specifically for things like source code repos (with massive commit histories), mailing archives, etc.
A whitelist would be needed for sites where getting all the pages make sense. And probably in addition to the 1Hz, an additional limit of 1k/day would be needed.
I can see now why Google has not much solid competition (Yandex/Baidu arguably don't compete due to network segmentation).
Scraping reliably is hard, and the chance of kicking Google off their throne may be even further reduced due to AI crawler abuse.
PS 958k hits is a lot! Even if your pages were a tiny 7.8k each (HN front page minus assets), that would be about 7G of data (about 4.6 Bee Movies in 720p h256).
A whitelist would be needed for sites where getting all the pages make sense. And probably in addition to the 1Hz, an additional limit of 1k/day would be needed.
I can see now why Google has not much solid competition (Yandex/Baidu arguably don't compete due to network segmentation).
Scraping reliably is hard, and the chance of kicking Google off their throne may be even further reduced due to AI crawler abuse.
PS 958k hits is a lot! Even if your pages were a tiny 7.8k each (HN front page minus assets), that would be about 7G of data (about 4.6 Bee Movies in 720p h256).