Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can already pre-train compression on text without using an LLM:

    $ curl https://www.gutenberg.org/cache/epub/11/pg11.txt > text.txt
    $ split -n 500 text.txt trainpart.
Using a normal compression algorithm:

    $ zstd --train trainpart.* -o dictionary
    Save dictionary of size 112640 into file dictionary

    $ zstd -vD dictionary text.txt 
    *** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
    text.txt             : 15.41%   (   170 KiB =>   26.2 KiB, text.txt.zst)
For this example, ZSTD warns that the dictionary training set is 10X-100X too small to be efficient. Realistically, I guess you'd train it over E.G. the entire Gutenberg library. Then you can distribute specific books to people who already have the dictionary.

Or:

    $ curl -L https://archive.org/download/completeworksofl1920carr/completeworksofl1920carr_hocr_searchtext.txt.gz |
        gzip -d |
        sed -E 's/\s+/ /g' > FullTextsSample.txt

    $ zstd -v -19 --patch-from FullTextsSample.txt text.txt
    text.txt             : 16.50%   (   170 KiB =>   28.1 KiB, text.txt.zst)
Not sure how much performance would drop for realistic use. But there are also some knobs you can tune.

Refer to:

https://github.com/facebook/zstd/#dictionary-compression-how...

https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...

    $ man zstd
- Dictionary occupies only kilobytes or megabytes of storage, instead of gigabytes or terabytes.

- Dictionary can be re-trained for specific data at negligble cost.

- Compression and decompression are deterministic by default.

- Doesn't take large amount of GPU resources to compress/decompression.

- This is actually designed to do this.



Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*

(To your point, one of those measures isn't including gigabytes of LLM in its size savings, as if it's part of the .exe size instead.)

* EDIT to link to discussion further down: https://news.ycombinator.com/item?id=40245530


> Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*

Yeah. But I don't think it's hinting at any fundamental theoretical limit.

Both the LLM and my examples were trained on data including the full text of Alice in Wonderland, which we're "compressing". Probably many copies of it, for the LLM. In theory they should both be able to reach 0% (or very close).

So both the blog post and my examples are a bit silly— Like "losslessly compressing" an image by diffing it with a lossy JPEG, then claiming a higher compression ratio than PNG/JPXL because the compression program is a 1TB binary that bundles Sloot-style blurry copies of every known image.

In fact, by just adding `--maxdict=1MB` to my first example, `zstd -D` gets down to 13.5%. Probably lower with further tweaking. And adding an explicit `cat text.txt >> FullTextsSample.txt` brings `zstd --patch-from` down to… Uh. 0.02%. 40 bytes total. …And probably around a third of that is headers and checksum… So… Yeah. A bit silly.

I think a better comparison should probably:

- Have a clean separation between training data, and data to be compressed. Usually the compressed data should be similar to, but not included in, the training data.

- Use the same training data for both the LLM and conventional compressor.

- Include the dictionary/model size. And compare methods at the same dictionary/model size.

Also, as an aside, the method in the blog post could probably also get smaller by storing token probability ranks for most of its current explicit letters.


I am so glad I read your comment, so faschinating!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: