almost no one does, robots.txt is practically a joke at this point — right up th...

Demiurge · 2025-03-20T17:10:51 1742490651

In what circles is it a joke? Google bots seem to respect it on my sites according to logs.

mediumsmart · 2025-03-20T17:23:40 1742491420

I know an artist that had noindex turned on by mistake in robots.txt for the last 5 years - google, kagi and duckduckgo find tons of links relevant to the artist and the artwork but not a single one from the website.

so not seem to or apparently but matter of fact like. robots.txt works for the intended audience

Aloisius · 2025-03-20T18:22:34 1742494954

Not being indexed is different from not being crawled.

ysavir · 2025-03-20T17:53:53 1742493233

AI crawlers are part of the intended audience.

joecool1029 · 2025-03-20T17:28:10 1742491690

It's in a small circle of those that do. Blame the internet archive for starting this trend: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

hirako2000 · 2025-03-20T17:57:04 1742493424

Given websites do disappear or worse, get their content adultered. Also given the long history of the internet archive as a non profit - and the commons service it has served so far, the joke would be to see that bot honor it.

wuming2 · 2025-03-22T12:04:37 1742645077

Sorry to intrude with something unrelated. But YC closed the earlier discussion. Saw your comment about Kannel WAP of few months back and wanted to ask if do you know of any WAP Push full service provider still in operation.

joecool1029 · 2025-03-23T05:14:42 1742706882

Nah I just know about that public gateway I linked. I can't use it anymore as 2G was shut down on my local towers back in January.

nikisweeting · 2025-03-20T20:35:23 1742502923

lol IA did not start that, if anything they were late to the game. only the top handful of US-based search engines ever bothered respecting it in the first place

otikik · 2025-03-20T17:17:24 1742491044

Apparently, the regular search crawler does it, but the ai thingie doesn't.

lucgagan · 2025-03-20T17:20:04 1742491204

Can confirm. My website is flooded with AI bots despite attempts to block crawlers to certain parts of it.

supriyo-biswas · 2025-03-20T17:19:54 1742491194

Huh? You can add Google-Extended[1] to opt out from Generative AI summaries.

[1] https://blog.google/technology/ai/an-update-on-web-publisher...

micromacrofoot · 2025-03-20T17:41:15 1742492475

Google will still scrape it for training data either way, this only impacts search results.

supriyo-biswas · 2025-03-20T18:17:42 1742494662

> Today we’re announcing Google-Extended, a new control that web publishers can use to manage whether their sites help *improve Bard and Vertex AI generative APIs*, including future generations of models that power those products.

micromacrofoot · 2025-03-20T19:57:24 1742500644

https://www.theverge.com/news/630079/openai-google-copyright...

they're literally asking to break laws to train AI for national security. A sentence in a press release from 2 years ago is worthless... look at what they're actually doing

micromacrofoot · 2025-03-20T17:40:28 1742492428

A small number of search engines respect it, no one else does. Just about every content scraping bot ignores it, including a number of Google's.

geekrax · 2025-03-20T17:13:11 1742490791

I have replaced all robots.txt rules with simple WAF rules, which are cheaper to maintain than dealing with offending bots.

claudiulodro · 2025-03-20T18:54:33 1742496873

I do essentially both: robots.txt backed by actual server-level enforcement of the rules in robots.txt. You'd think there would be zero hits on the server-level blocking since crawlers are supposed to read and respect robots.txt, but unsurprisingly they don't always. I don't know why this isn't a standard feature in web hosting.

Joe_Cool · 2025-03-20T17:48:00 1742492880

For my personal stuff I also included a Nepenthes tarpit. Works great and slows the bots down while feeding them garbage. Not my fault when they consume stuff robots.txt says they shouldn't.

I'm just not sure if legal would love me doing that on our corporate servers...

rustc · 2025-03-20T18:10:45 1742494245

The WAF rule matches based on the user agent header? Perplexity is known to use generic browser user agents to bypass that.