KVarN: Native vLLM backend for KV-cache quantization by Huawei

throwa356262 · 2026-06-04T15:54:56 1780588496

Better performance than TQ and better quality than FP16?

Am I reading this right??

qeternity · 2026-06-04T17:04:42 1780592682

It's not better quality: 59.3% vs 59.4% fp16 on AIME 25

sheepscreek · 2026-06-05T00:42:54 1780620174

0.1% is within margin of error. Depending on the performance boost, it might be worthwhile taking a minuscule quality hit.

electroglyph · 2026-06-04T21:33:54 1780608834

any divergence (even if the benchmark is better) from full precision is error

thefox96 · 2026-06-04T17:02:26 1780592546

Faster than Fp16, not better quality i guess

v3ss0n · 2026-06-04T15:53:48 1780588428

Why this is not a PR for vLLM ?

woadwarrior01 · 2026-06-04T20:09:07 1780603747

Last I heard, vLLM was backed by a company that has raised $150m in seed funding. I'm sure they've got the resources to port it.

esafak · 2026-06-04T16:00:19 1780588819

It's the output of a research paper; the authors are not trying to build up vLLM, and they probably have no incentive to do so. You can submit a PR, though! It's easier now while the divergence is low, so don't wait. Since there are six authors, I bet you could get help with the inevitable review chores if you just take the step of creating the PR.

edit: It might not be clear that it is based on vLLM 0.22, which is the current version: https://github.com/huawei-csl/KVarN/commit/d6290e99098d7426d.... All you have to do is create a diff off it; it's fairly straightforward.

jmalicki · 2026-06-04T16:14:14 1780589654

And with the help of AI, pointing at AI at this paper and saying "making a vLLM PR from this paper" tends to work surprisingly well, even if you need to nudge it a little bit along the way.

electronsoup · 2026-06-04T23:05:37 1780614337

Why this is not a PR for llama.cpp

thefox96 · 2026-06-04T17:28:33 1780594113

it should be easy to do btw

0xjeffro · 2026-06-04T21:58:09 1780610289

yao yao ling xian