I know an artist that had noindex turned on by mistake in robots.txt for the last 5 years - google, kagi and duckduckgo find tons of links relevant to the artist and the artwork but not a single one from the website.
so not seem to or apparently but matter of fact like. robots.txt works for the intended audience
Given websites do disappear or worse, get their content adultered. Also given the long history of the internet archive as a non profit - and the commons service it has served so far, the joke would be to see that bot honor it.
Sorry to intrude with something unrelated. But YC closed the earlier discussion. Saw your comment about Kannel WAP of few months back and wanted to ask if do you know of any WAP Push full service provider still in operation.
lol IA did not start that, if anything they were late to the game. only the top handful of US-based search engines ever bothered respecting it in the first place
> Today we’re announcing Google-Extended, a new control that web publishers can use to manage whether their sites help *improve Bard and Vertex AI generative APIs*, including future generations of models that power those products.
they're literally asking to break laws to train AI for national security. A sentence in a press release from 2 years ago is worthless... look at what they're actually doing
I do essentially both: robots.txt backed by actual server-level enforcement of the rules in robots.txt. You'd think there would be zero hits on the server-level blocking since crawlers are supposed to read and respect robots.txt, but unsurprisingly they don't always. I don't know why this isn't a standard feature in web hosting.
For my personal stuff I also included a Nepenthes tarpit. Works great and slows the bots down while feeding them garbage. Not my fault when they consume stuff robots.txt says they shouldn't.
I'm just not sure if legal would love me doing that on our corporate servers...