This is an important trick for LZ4's speed. But, there is a bit more to it than this. Some important tricks are:
* LZ4 can represent a literal length of 14, and a match length of 18 in a single byte [0]. Most LZ4 sequences will have a literal length LL <= 14, and most LZ4 sequences will have a match length ML <= 18. Instead of copying LL and ML bytes, we can copy 14 bytes and 18 bytes. Then only if LL > 14 we copy extra bytes, and same for ML > 18.
* The case of overlapping matches (e.g. offset = 1, and ML = 100) needs to be handled specially, otherwise you will run into store forwarding problems.
* Always copy 16 or 32 bytes at a time and allow overwrites. Allowing overcopies simplifies the copy routine, improves branch prediction, and reduces store forwarding issues, at the expense of ensuring you always have enough slack at the end of your buffers.
The performance optimizations in modern CPUs are a deal with the Devil: They make your programs go faster, but you have to do increasingly crazy things to make them work.
Sadly, this is true generally in tech whenever an implementation evolves past the abstraction. The alternative is that CPUs get faster, but only if all programs are rewritten for the new architecture. Arguably, this has happened already on a market-by-market basis, several times, with no end in sight :)
* LZ4 can represent a literal length of 14, and a match length of 18 in a single byte [0]. Most LZ4 sequences will have a literal length LL <= 14, and most LZ4 sequences will have a match length ML <= 18. Instead of copying LL and ML bytes, we can copy 14 bytes and 18 bytes. Then only if LL > 14 we copy extra bytes, and same for ML > 18.
* The case of overlapping matches (e.g. offset = 1, and ML = 100) needs to be handled specially, otherwise you will run into store forwarding problems.
* Always copy 16 or 32 bytes at a time and allow overwrites. Allowing overcopies simplifies the copy routine, improves branch prediction, and reduces store forwarding issues, at the expense of ensuring you always have enough slack at the end of your buffers.
[0] https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md