Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

almost no one does, robots.txt is practically a joke at this point — right up there with autocomplete=off


In what circles is it a joke? Google bots seem to respect it on my sites according to logs.


I know an artist that had noindex turned on by mistake in robots.txt for the last 5 years - google, kagi and duckduckgo find tons of links relevant to the artist and the artwork but not a single one from the website.

so not seem to or apparently but matter of fact like. robots.txt works for the intended audience


Not being indexed is different from not being crawled.


AI crawlers are part of the intended audience.


It's in a small circle of those that do. Blame the internet archive for starting this trend: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


Given websites do disappear or worse, get their content adultered. Also given the long history of the internet archive as a non profit - and the commons service it has served so far, the joke would be to see that bot honor it.


Sorry to intrude with something unrelated. But YC closed the earlier discussion. Saw your comment about Kannel WAP of few months back and wanted to ask if do you know of any WAP Push full service provider still in operation.


Nah I just know about that public gateway I linked. I can't use it anymore as 2G was shut down on my local towers back in January.


lol IA did not start that, if anything they were late to the game. only the top handful of US-based search engines ever bothered respecting it in the first place


Apparently, the regular search crawler does it, but the ai thingie doesn't.


Can confirm. My website is flooded with AI bots despite attempts to block crawlers to certain parts of it.


Huh? You can add Google-Extended[1] to opt out from Generative AI summaries.

[1] https://blog.google/technology/ai/an-update-on-web-publisher...


Google will still scrape it for training data either way, this only impacts search results.


> Today we’re announcing Google-Extended, a new control that web publishers can use to manage whether their sites help *improve Bard and Vertex AI generative APIs*, including future generations of models that power those products.


https://www.theverge.com/news/630079/openai-google-copyright...

they're literally asking to break laws to train AI for national security. A sentence in a press release from 2 years ago is worthless... look at what they're actually doing


A small number of search engines respect it, no one else does. Just about every content scraping bot ignores it, including a number of Google's.


I have replaced all robots.txt rules with simple WAF rules, which are cheaper to maintain than dealing with offending bots.


I do essentially both: robots.txt backed by actual server-level enforcement of the rules in robots.txt. You'd think there would be zero hits on the server-level blocking since crawlers are supposed to read and respect robots.txt, but unsurprisingly they don't always. I don't know why this isn't a standard feature in web hosting.


For my personal stuff I also included a Nepenthes tarpit. Works great and slows the bots down while feeding them garbage. Not my fault when they consume stuff robots.txt says they shouldn't.

I'm just not sure if legal would love me doing that on our corporate servers...


The WAF rule matches based on the user agent header? Perplexity is known to use generic browser user agents to bypass that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: