Even with 1 TB of weights (probable size of the largest state of the art models)...

jesse__ · 2026-02-05T20:49:34 1770324574

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

FeepingCreature · 2026-02-05T21:52:41 1770328361

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

jesse__ · 2026-02-05T22:54:39 1770332079

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

onraglanroad · 2026-02-06T17:51:49 1770400309

Because we can count? How could you possibly think that Wikipedia was 5% of the whole Internet? It's just such a bizarrely foolish idea.

kgeist · 2026-02-05T21:46:34 1770327994

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

FeepingCreature · 2026-02-05T21:53:02 1770328382

I would be extremely surprised if it was that small.

artisin · 2026-02-06T14:52:55 1770389575

I was curious about the scale of 1TiB of text. According to WolframAlpha, it's roughly 1.1 trillion characters, which breaks down to 180.2 billion words, 360.5 million pages, or 16.2 billion lines. In terms of professional typing speed, that's about 3800 years of continuous work.

So post-deduplication, I think it's a fair assessment that a significant portion of high-quality text could fit within 1TiB. Tho 'high-quality' is a pretty squishy and subjective term.

FeepingCreature · 2026-02-09T09:14:48 1770628488

Yes, a million books is a reasonably big library.

But I would be surprised if the internet only filled a reasonably big library.

kaibee · 2026-02-06T04:31:02 1770352262

Well, a terabyte of text is... quite a lot of text.

gmueckl · 2026-02-05T21:34:32 1770327272

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.