$ zstd --train trainpart.* -o dictionary
Save dictionary of size 112640 into file dictionary
$ zstd -vD dictionary text.txt
*** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
text.txt : 15.41% ( 170 KiB => 26.2 KiB, text.txt.zst)
For this example, ZSTD warns that the dictionary training set is 10X-100X too small to be efficient. Realistically, I guess you'd train it over E.G. the entire Gutenberg library. Then you can distribute specific books to people who already have the dictionary.
> Interesting you're showing 15% - 16%, and the LLM technique showed 15%.*
Yeah. But I don't think it's hinting at any fundamental theoretical limit.
Both the LLM and my examples were trained on data including the full text of Alice in Wonderland, which we're "compressing". Probably many copies of it, for the LLM. In theory they should both be able to reach 0% (or very close).
So both the blog post and my examples are a bit silly— Like "losslessly compressing" an image by diffing it with a lossy JPEG, then claiming a higher compression ratio than PNG/JPXL because the compression program is a 1TB binary that bundles Sloot-style blurry copies of every known image.
In fact, by just adding `--maxdict=1MB` to my first example, `zstd -D` gets down to 13.5%. Probably lower with further tweaking. And adding an explicit `cat text.txt >> FullTextsSample.txt` brings `zstd --patch-from` down to… Uh. 0.02%. 40 bytes total. …And probably around a third of that is headers and checksum… So… Yeah. A bit silly.
I think a better comparison should probably:
- Have a clean separation between training data, and data to be compressed. Usually the compressed data should be similar to, but not included in, the training data.
- Use the same training data for both the LLM and conventional compressor.
- Include the dictionary/model size. And compare methods at the same dictionary/model size.
Also, as an aside, the method in the blog post could probably also get smaller by storing token probability ranks for most of its current explicit letters.
Or:
Not sure how much performance would drop for realistic use. But there are also some knobs you can tune.Refer to:
https://github.com/facebook/zstd/#dictionary-compression-how...
https://github.com/facebook/zstd/wiki/Zstandard-as-a-patchin...
- Dictionary occupies only kilobytes or megabytes of storage, instead of gigabytes or terabytes.- Dictionary can be re-trained for specific data at negligble cost.
- Compression and decompression are deterministic by default.
- Doesn't take large amount of GPU resources to compress/decompression.
- This is actually designed to do this.