Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Robots.txt does absolutely apply to LLMs engines and search engines equally.

It does not. It applies to whatever crawler built the search index the LLM accesses, and it would apply to an AI agent using an LLM to work recursively, but it does not apply to the LLM itself or the feature being discussed here.

The rest of your comment seems to just be repeating what I already said:

> Whatever search index they are using, the crawler for that search index needs to respect robots.txt because it’s acting recursively. But when the user asks the LLM to look at web results, it’s just getting a single set of URLs from that index and fetching them – assuming it’s even doing that and not using a cached version. It’s not acting recursively, so robots.txt does not apply.

There is a difference between an LLM, an index that it consults, and the crawler that builds that index, and I was drawing that distinction. You can’t just lump an LLM into the same category, because it’s doing a different thing.



> It does not.

Yes it does. I am the one controlling robots.txt on my server. I can put whatever user agent I want into my robots.txt, and I can block as much of my page as I want to it.

People can argue semantics as much as they want...in the end, site admins decide what's in robots.txt and what isn't.

And if people believe they can just ignore them, they are right, they can. But they are gonna find it rather difficult to ignore when fail2ban starts dropping their packets with no reply ;-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: