Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have the same question, what is the different between LLM and Dictionary in the context of compression. Can I not "train" a dictionary?


AIUI, a dictionary is built during compression to specify the heuristics of a particular dataset and belongs to that specific dataset only. For example, it could be a ranking of the most frequent 10 symbols in the compressed file. That will be different for every input file.


> That will be different for every input file

That could be different for every input file, but it doesn't have to be. It could also be a fixed dictionary. For example, ZLIB allows for a user-defined dictionary [1].

In this case, I'd consider the LLM to be a fixed dictionary of sorts. A very large, fixed dictionary with probabilistic return values.

[1] https://www.rfc-editor.org/rfc/rfc1950#page-9


Ah, I see. I’d never thought of the possibility of using a dictionary not created specifically from the given input dataset, heh


Admittedly, I don’t think it is common, but I think there was a project a few years ago (Google?) that tried to compress HTML using at least a partially fixed dictionary.

Nowadays though, it’s apparently still something that’s being tried. Chrome now supports shared dictionaries for Zstd and Brotli. One idea being, you would likely benefit from having a shared dictionary used to decompress multiple artifacts for a site. But, you many not want everything compressed all together, so this way you get the compression benefit, but can have those artifacts split into different files.

https://developer.chrome.com/blog/shared-dictionary-compress...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: