Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Thanks to LLM scrapers, hosting costs went up 5000% last month"


Uggghhhh! AI crawling is fast becoming a headache for self-hosted content. Is using a CDN the "lowest effort" solution? Or is there something better/simpler?


Nah, just add a rate limiter (which any public website should have anyways). Alternatively, add some honeypot URLs to robots.txt, then setup fail2ban to ban any IP accessing those URLs and you'll get rid of 99% of the crawling in half a day.


I gave up after blocking 143,000 unique IPs hitting my personal Forgejo server one day. Rate limiting would have done literally nothing against the traffic patterns I saw.


2 unique IPs or 200,000 shouldn't make a difference, ban the ones that make too many requests automatically and you basically don't have to do anything.

Are people not using fail2ban and similar at all anymore? Used to be standard practice until I guess before people started using PaaS instead and "running web applications" became a different role than "developing web applications".


It makes a difference if there's 143,000 unique IPs and 286,000 requests. I think that's what the parent post is saying (lots of requests but also not very many per IP since there's also lots of IPs)

Even harder with IPv6 considering things like privacy extensions where the IPs intentionally and automatically rotate


Yes, this is correct. I’d get at most 2 hits from an IP, spaced minutes apart.

I went as far as blocking every AS that fetched a tripwire URL, but ended up blocking a huge chunk of the Internet, to the point that I asked myself whether it’d be easier to allowlist IPs, which is a horrid way to run a website.

But I did block IPv6 addresses as /48 networks, figuring that was a reasonable prefixlen for an individual attacker.


If only it were that easy.


And for many people, "easy" is hardly the word to describe that.

No wonder small businesses just put their information on Facebook instead of trying to manage a website.


I mean, people are held hostage by "professionals" that will set up some overcomplicated backend or vercel stuff instead of a static single html page with opening hours and the menu.


The poison's also the cure! Just ask AI for a haproxy rate limit config


It will give one. Will it work? No. You seem to not understand that AI crawlers mask as multiple clients to avoid rate limiting and are quite skilled at that.


Depending on the content and software stack, caching might be a fairly easy option. For instance, Wordpress W3 Total Cache used to be pretty easy to configure and could easily bring a small VPS from 6-10req/sec to 100-200req/sec.

Also some solutions for generating static content sites instead of "dynamic" CMS where they store everything in a DB

If it's new, I'd say the easiest option is start with a content hosting system that has built-in caching (assuming that exists for what you're trying to deploy)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: