The LLM shouldn’t. robots.txt is intended to control *recursive* fetches. It is ...

manquer · 2025-03-21T02:25:40 1742523940

While robots.txt is not there to directly prevent automated requests, it does prevent crawling which is needed for search indices.

Without recursive crawling, it will not possible for a engine to know what are valid urls[1]. They will otherwise either have to brute-force say HEAD calls for all/common string combinations and see if they return 404s or more realistically have to crawl the site to "discover" pages.

The issue of summarizing specific a URL on demand is a different problem[2] and not related to issue at hand of search tools doing crawling at scale and depriving all traffic

Robots.txt does absolutely apply to LLMs engines and search engines equally. All types of engines create indices of some nature (RAG, Inverted Index whatever) by crawling, sometimes LLM enginers have been very aggressive without respecting robots.txt limits, as many webmasters have reported over the last couple of years.

---

[1] Unless published in sitemap.xml of course.

[2] You need to have the unique URL to ask the llm to summarize in the first place, which means you likely visited the page already, while someone sharing a link with you and a tool automatically summarizing the page deprives the webmaster of impressions and thus ad revenue or sales.

This is common usage pattern in messaging apps from Slack to iMessages and been so for a decade or more, also in news aggregators to social media sites, and webmasters have managed to live with this one way or another already.

JimDabell · 2025-03-21T10:00:19 1742551219

> Robots.txt does absolutely apply to LLMs engines and search engines equally.

It does not. It applies to whatever crawler built the search index the LLM accesses, and it would apply to an AI agent using an LLM to work recursively, but it does not apply to the LLM itself or the feature being discussed here.

The rest of your comment seems to just be repeating what I already said:

> Whatever search index they are using, the crawler for that search index needs to respect robots.txt because it’s acting recursively. But when the user asks the LLM to look at web results, it’s just getting a single set of URLs from that index and fetching them – assuming it’s even doing that and not using a cached version. It’s not acting recursively, so robots.txt does not apply.

There is a difference between an LLM, an index that it consults, and the crawler that builds that index, and I was drawing that distinction. You can’t just lump an LLM into the same category, because it’s doing a different thing.

usrbinbash · 2025-03-24T14:43:31 1742827411

> It does not.

Yes it does. I am the one controlling robots.txt on my server. I can put whatever user agent I want into my robots.txt, and I can block as much of my page as I want to it.

People can argue semantics as much as they want...in the end, site admins decide what's in robots.txt and what isn't.

And if people believe they can just ignore them, they are right, they can. But they are gonna find it rather difficult to ignore when fail2ban starts dropping their packets with no reply ;-)

theshackleford · 2025-03-21T08:33:33 1742546013

> it does prevent crawling

No it doesn’t. It politely requests to crawlers that they do not, and if said crawlers choose to honour it than those specific crawlers will not crawl. That’s it. It can and is ignored without penalty or enforcement.

It’s like suggesting that putting a sign in your front yard saying “please don’t rob my house” prevents burglaries.

> Robots.txt does absolutely apply to LLMs engines and search engines equally

No it doesn’t because again, it’s a request system. It applies only to whatever chooses to pay attention to it, and further, decides to abide by any request within it which there is no requirement to do.

From google themselves:

“The instructions in robots.txt files CANNOT ENFORCE crawler behavior to your site; it's up to the crawler to obey them.”

And as already pointed out, there is no requirement a crawler follow them, let alone anything else.

If you want to control access, and you’re using robots.txt, you’ve no idea what you’re doing and probably shouldn’t be in charge of doing it.