Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.
For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.
Then the diffusion process is the same. Repeatedly denoising.
Denoising models work because a lot of regions turn out to be smooth, you cannot do that "in a discrete way" if that makes sense.
https://news.ycombinator.com/item?id=44059646