You can already pre-train compression on text without using an LLM: $ curl https...

Terretta · on May 5, 2024

Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*

(To your point, one of those measures isn't including gigabytes of LLM in its size savings, as if it's part of the .exe size instead.)

* EDIT to link to discussion further down: https://news.ycombinator.com/item?id=40245530

Intralexical · on May 5, 2024

> Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*

Yeah. But I don't think it's hinting at any fundamental theoretical limit.

Both the LLM and my examples were trained on data including the full text of Alice in Wonderland, which we're "compressing". Probably many copies of it, for the LLM. In theory they should both be able to reach 0% (or very close).

So both the blog post and my examples are a bit silly— Like "losslessly compressing" an image by diffing it with a lossy JPEG, then claiming a higher compression ratio than PNG/JPXL because the compression program is a 1TB binary that bundles Sloot-style blurry copies of every known image.

In fact, by just adding `--maxdict=1MB` to my first example, `zstd -D` gets down to 13.5%. Probably lower with further tweaking. And adding an explicit `cat text.txt >> FullTextsSample.txt` brings `zstd --patch-from` down to… Uh. 0.02%. 40 bytes total. …And probably around a third of that is headers and checksum… So… Yeah. A bit silly.

I think a better comparison should probably:

- Have a clean separation between training data, and data to be compressed. Usually the compressed data should be similar to, but not included in, the training data.

- Use the same training data for both the LLM and conventional compressor.

- Include the dictionary/model size. And compare methods at the same dictionary/model size.

Also, as an aside, the method in the blog post could probably also get smaller by storing token probability ranks for most of its current explicit letters.

Alifatisk · on May 5, 2024

I am so glad I read your comment, so faschinating!