IMO it's more vital than ever to fund projects like the Internet Archive. They'r...

toyg · on Feb 2, 2023

The problem with the Internet Archive, which does an amazing job, is that they do an amazing job despite the problem being fundamentally intractable. Web content expands too quickly and too massively.

I wonder if the answer is a network of topic-focused archives; like moving from a "Library of Alexandria" model to a modern nationwide system of libraries.

fsloth · on Feb 2, 2023

” Web content expands too quickly and too massively.”

If most of it is crap I would call not archiving it a feature.

There is a weird convoluted analogue to CERN particle detectors. They smash particles together and then image the resulting storm of particle contrails via detector that is basically a sandwhiched ccd detector (like you have in camera, but different) the size of a cathedral. Resulting in far too much data for any system to analyze or even store in the first place. Hence they need/needed to runtime filter the massive amount of particle trail signals and only pick out the critical ones.

If there is too much data you simply need to drop the parts you are fairly confident you don’t need.

There is no reason there should be only one internet archive, there might very well be parallel operations filtering a bit different things.

I guess it’s a bit odd Unesco does not already have a parallel effort.

cheschire · on Feb 2, 2023

Okay so you build a knowledge graph on top of the internet archive. Now you are struggling to prioritize the resources necessary to capture long-tail content that doesn't mesh easily into popular corpuses. I imagine this would lead to the library equivalent of an echo chamber.

toyg · on Feb 2, 2023

I was thinking more of a federated "webring" structure, with some content being present in more than one node, and where maintenance and curation are distributed (and gathered independently) among nodes.

The nation of, say, Japan, has limited interest in funding an american noprofit today; but they would likely have a great deal of interest in funding an equivalent focused on Japanese content, for example.

cheschire · on Feb 2, 2023

Ah so more like mastadon or ipfs, but specifically for the purposes of federated archiving.

So now you get into the issue of haves and have nots. Who is allowed to be considered an authorized archivist from a robots.txt perspective? Or what happens if an archivist becomes blacklisted for not respectfully crawling? How do national sanctions affect the Internet Archive of Russia? I imagine there would be a certification process and it would probably cost some money.

It's an interesting topic and I'm simply looking at the weak spots. I'm not against the overall concept though.

toyg · on Feb 3, 2023

All legitimate questions, but if we only built perfect systems we would never have had TCP, let alone the pile of hacks we're now using to discuss this topic.

Distributed governance on the internet is a massive issue, and it's effectively unsolved for everything from pairing to DNS. In practice, good faith goes a long way, particularly in areas that are largely academic in scope - like archiving.

sho_hn · on Feb 2, 2023

The curator being bandwidth-limited is not necessarily a problem if the problem you are solving is an overwhelmed audience in need of a curator. In other words, the Archive missing things may not really be a problem if the stuff is not missing is on average of value.

It raises the issue of governance of the curator, but the IA is already more transparent than Goole & co.

pixl97 · on Feb 2, 2023

How do you measure the future value of something you don't keep?

pelasaco · on Feb 2, 2023

would be nice if we could have a way to navigate just in the old web...

williamcotton · on Feb 2, 2023

We still have curated libraries and the knowledge contained in those books has a much better signal-to-noise ratio than the internet ever will!

throwanem · on Feb 2, 2023

The Internet Archive is a library.

williamcotton · on Feb 2, 2023

You're right, and it has a better signal to noise ratio than the internet in general, even when you factor in the Wayback Machine! Here's to curated knowledge!

throwanem · on Feb 2, 2023

The Wayback Machine is part of the library, too. Specifically, it's the periodicals room.

imhoguy · on Feb 2, 2023

This! As data horder I see even more value in preserving pre-AI content, art, source code. That genuine stuff will be priceless.