I made a kernel 2.2x faster. It made my training loop 3x slower

saagarjha · 2026-06-05T04:38:45 1780634325

I hope you don’t take this the wrong way but this entire blog post seems a little like what I see happens when someone asks an LLM how to optimize their code. There are like half a dozen, or more, different things that were tried: AI is really good at suggesting all sorts of things that might possibly help. But in the end very few of them panned out and a lot of them were only really explained after the fact even though the failure modes did not see that unreasonable to predict to someone who might have experience in this area. Unfortunately I find that this kind of workflow rarely results in nice wins because it’s kind of a “try everything and see what sticks” without truly understanding what the problem space is and what’s reasonable to do.

querez · 2026-06-05T04:30:17 1780633817

I didn't know about the static cache, that was a nice learning for me. The rest is pretty obvious /too lengthy for everyone who ever did profiling/benching. Also, the text has a lot of LLM slop-phrases in it that need cleaning up ("the X is structural", "the Y is real", "the gap is Z")

vishal-padia · 2026-06-02T17:26:13 1780421173

Quick context on what's in the post:

1. From scratch Dr. GRPO implementation in ~300 lines of PyTorch (Qwen2.5-0.5B on GSM8K, A10G). 2. Profiling deep dive on the training loop. Generate is 90% of step time. Pre-allocating the KV cache via StaticCache took GPU utilization from 26% to 86%, biggest single win in the project. 3. Wrote a fused decode-attention kernel in CuteDSL (RoPE + KV cache write + attention in one launch). Benchmarks 2.2x faster than the SDPA path it replaces at the relevant scale. 4. Plugged it into HF generate and the decode step got 3x slower. The post is mostly about why this happened, what it took to figure out, and what would actually close the gap.

Happy to answer questions.