You can still cache prompts; this just affects the cache for during generation p...

yorwba · 2025-08-09T09:09:47 1754730587

If you completely do away with autoregression, prompt tokens can pay attention to generated tokens, so even the prompt tokens' KV vectors change at every step and you cannot cache anything.

For this reason, models that generate text using diffusion typically generate blocks of tokens at a time, where tokens within a block freely attend to each other, but across blocks there's causal masking so that each block only depends on the preceding ones and we're back to autoregression again. That makes caching possible, but also means you still can't have diffusion change the beginning of a long text to match the end.

namibj · 2025-08-09T18:06:53 1754762813

I specifically mean prompts here, and I don't mean they'd have casual attention. Just run an encoder to get your KV cache pre-filling of the prompt, then do non-causal diffusion generation of the response referencing the cached prompt without re-encoding the prompt.

You don't need to revert to chunks to get to enjoy prompt caching, especially if you use it in a RAG type way with minor provisions to allow KV caching the RAG fragments (a bunch of work has been done on that, iirc even DeepSeekV3 would allow that).