IMO it's more vital than ever to fund projects like the Internet Archive. They're the only ones incentivized to maintain a snapshot of un-LLM-clouded training data of human knowledge, unclouded by the hubris of "who cares about the old stuff, we should focus our archiving on the web as it exists today" that inevitably will take hold (or already has) in big tech companies who will have laid off the vast majority of those voicing these concerns. We owe it to future generations to prevent ourselves from falling into the training-cycle trap.
The problem with the Internet Archive, which does an amazing job, is that they do an amazing job despite the problem being fundamentally intractable. Web content expands too quickly and too massively.
I wonder if the answer is a network of topic-focused archives; like moving from a "Library of Alexandria" model to a modern nationwide system of libraries.
” Web content expands too quickly and too massively.”
If most of it is crap I would call not archiving it a feature.
There is a weird convoluted analogue to CERN particle detectors. They smash particles together and then image the resulting storm of particle contrails via detector that is basically a sandwhiched ccd detector (like you have in camera, but different) the size of a cathedral. Resulting in far too much data for any system to analyze or even store in the first place. Hence they need/needed to runtime filter the massive amount of particle trail signals and only pick out the critical ones.
If there is too much data you simply need to drop the parts you are fairly confident you don’t need.
There is no reason there should be only one internet archive, there might very well be parallel operations filtering a bit different things.
I guess it’s a bit odd Unesco does not already have a parallel effort.
Okay so you build a knowledge graph on top of the internet archive. Now you are struggling to prioritize the resources necessary to capture long-tail content that doesn't mesh easily into popular corpuses. I imagine this would lead to the library equivalent of an echo chamber.
I was thinking more of a federated "webring" structure, with some content being present in more than one node, and where maintenance and curation are distributed (and gathered independently) among nodes.
The nation of, say, Japan, has limited interest in funding an american noprofit today; but they would likely have a great deal of interest in funding an equivalent focused on Japanese content, for example.
Ah so more like mastadon or ipfs, but specifically for the purposes of federated archiving.
So now you get into the issue of haves and have nots. Who is allowed to be considered an authorized archivist from a robots.txt perspective? Or what happens if an archivist becomes blacklisted for not respectfully crawling? How do national sanctions affect the Internet Archive of Russia? I imagine there would be a certification process and it would probably cost some money.
It's an interesting topic and I'm simply looking at the weak spots. I'm not against the overall concept though.
All legitimate questions, but if we only built perfect systems we would never have had TCP, let alone the pile of hacks we're now using to discuss this topic.
Distributed governance on the internet is a massive issue, and it's effectively unsolved for everything from pairing to DNS. In practice, good faith goes a long way, particularly in areas that are largely academic in scope - like archiving.
The curator being bandwidth-limited is not necessarily a problem if the problem you are solving is an overwhelmed audience in need of a curator. In other words, the Archive missing things may not really be a problem if the stuff is not missing is on average of value.
It raises the issue of governance of the curator, but the IA is already more transparent than Goole & co.
You're right, and it has a better signal to noise ratio than the internet in general, even when you factor in the Wayback Machine! Here's to curated knowledge!