Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.




This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.


This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.


> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..


Because we can count? How could you possibly think that Wikipedia was 5% of the whole Internet? It's just such a bizarrely foolish idea.

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

I would be extremely surprised if it was that small.

I was curious about the scale of 1TiB of text. According to WolframAlpha, it's roughly 1.1 trillion characters, which breaks down to 180.2 billion words, 360.5 million pages, or 16.2 billion lines. In terms of professional typing speed, that's about 3800 years of continuous work.

So post-deduplication, I think it's a fair assessment that a significant portion of high-quality text could fit within 1TiB. Tho 'high-quality' is a pretty squishy and subjective term.


Yes, a million books is a reasonably big library.

But I would be surprised if the internet only filled a reasonably big library.


Well, a terabyte of text is... quite a lot of text.

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: