This comment reads like real scientific skepticism, but from my recollection of events, is more of a hit piece that takes what should be a technical discussion and drags in bunch of personal baggage. In particular:
> HN has previously fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model,
is not true at all. Someone else made the claims about 6GB RAM usage for a 30B model, I remember reading it at the time and thinking "Yeah, that doesn't make sense, but the loading time improvement is immense!" And it was - I run all my LLMs locally on CPU because I don't have dedicated hardware, and jart's work has improved usability a lot.
> and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation
I was reading the same HN discussions you were at the time, and it was pretty trivial to see that the loading time claim held up, and the RAM claim was dubious and likely simply due to not understanding some effect of the change completely. Heck, jart's own discussion of the topic reflected this at the time.
For the current change, I feel like your comment is even more misplaced. The blog post linked to for this story has a huge amount of detail about performance on specific processors (Skylake, Alderlake, RPi5/4, M2 Ultra, and 7995WX) with specific models. So when you say:
> It would be good to see some independent verification of this claim.
What I hear is "4bpp thinks there's a real risk the numbers in the linked post are fabricated, and jart is just trying to get attention."
And that doesn't seem reasonable at all, given the history of her work and the evidence in front of us.
The loading time improvements largely held up, and on the balance the mmap contribution was ultimately good (though the way it was implemented was really quite problematic, as a matter of process and communication). However, as I point out in https://news.ycombinator.com/item?id=39894542, JT quite unambiguously did try to cash in on the "low memory usage" claim - uncritically reprinting positive claims by others about your own work that otherwise would have been largely invisible should really not be treated differently as making those claims yourself.
I do think that there is a real risk that the numbers are wrong (not necessarily "fabricated", as this implies malfeasance, but possibly based on an erroneous measurement insufficiently questioned due to an excess of trust from themselves and others, as the mmap ones were). This is also in part based on the circumstance that at the time (of the mmap story, and myself being more involved in the project) I was actually involved in trying to optimise the SIMD linear algebra code, and unless llama.cpp has since switched to a significantly less performant implementation the proposition that so much more performance could be squeezed out strikes me as quite surprising. Here, your intuitions may say that Justine Tunney is just so brilliant that they make the seemingly impossible possible; but it was exactly this attitude that at the time made it so hard to evaluate the mmap memory usage claims rationally and turned the discussion around it much more dysfunctional than it had to be.
> HN has previously fallen for a claim by the same author to have reduced llama.cpp memory usage for a dense model way below the size of the model,
is not true at all. Someone else made the claims about 6GB RAM usage for a 30B model, I remember reading it at the time and thinking "Yeah, that doesn't make sense, but the loading time improvement is immense!" And it was - I run all my LLMs locally on CPU because I don't have dedicated hardware, and jart's work has improved usability a lot.
> and it's hard to overstate the degree of social pressure that needed to be overcome at the time for the skeptic position to reach fixation
I was reading the same HN discussions you were at the time, and it was pretty trivial to see that the loading time claim held up, and the RAM claim was dubious and likely simply due to not understanding some effect of the change completely. Heck, jart's own discussion of the topic reflected this at the time.
For the current change, I feel like your comment is even more misplaced. The blog post linked to for this story has a huge amount of detail about performance on specific processors (Skylake, Alderlake, RPi5/4, M2 Ultra, and 7995WX) with specific models. So when you say:
> It would be good to see some independent verification of this claim.
What I hear is "4bpp thinks there's a real risk the numbers in the linked post are fabricated, and jart is just trying to get attention."
And that doesn't seem reasonable at all, given the history of her work and the evidence in front of us.