I have the exact same questions as you. I can barely understand how diffusion wo...

janalsncm · 2025-05-22T07:18:12 1747898292

Let’s suppose we have 10k possible tokens in the vocabulary.

Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.

For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.

Then the diffusion process is the same. Repeatedly denoising.

moralestapia · 2025-05-22T09:42:35 1747906955

No, that intuition is incorrect.

Denoising models work because a lot of regions turn out to be smooth, you cannot do that "in a discrete way" if that makes sense.

janalsncm · 2025-05-22T21:46:58 1747950418

Feel free to give a better explanation. I am not an expert. Clearly denoising models do work on text though.

moralestapia · 2025-05-22T22:10:43 1747951843

This one's closer to the thing.

lostmsu · 2025-05-22T16:40:42 1747932042

They may be smooth in embedding space