This comment reads like real scientific skepticism, but from my recollection of ...

4bpp · on April 1, 2024

The loading time improvements largely held up, and on the balance the mmap contribution was ultimately good (though the way it was implemented was really quite problematic, as a matter of process and communication). However, as I point out in https://news.ycombinator.com/item?id=39894542, JT quite unambiguously did try to cash in on the "low memory usage" claim - uncritically reprinting positive claims by others about your own work that otherwise would have been largely invisible should really not be treated differently as making those claims yourself.

I do think that there is a real risk that the numbers are wrong (not necessarily "fabricated", as this implies malfeasance, but possibly based on an erroneous measurement insufficiently questioned due to an excess of trust from themselves and others, as the mmap ones were). This is also in part based on the circumstance that at the time (of the mmap story, and myself being more involved in the project) I was actually involved in trying to optimise the SIMD linear algebra code, and unless llama.cpp has since switched to a significantly less performant implementation the proposition that so much more performance could be squeezed out strikes me as quite surprising. Here, your intuitions may say that Justine Tunney is just so brilliant that they make the seemingly impossible possible; but it was exactly this attitude that at the time made it so hard to evaluate the mmap memory usage claims rationally and turned the discussion around it much more dysfunctional than it had to be.

throwup238 · on April 1, 2024

I distinctly remember most of the people in the comments misunderstanding kernel memory paging or learning about it for the first time.

It genuinely did make llama.cpp a lot more usable at the time.