Yes!!! 20 years ago, it was extremely obvious to anyone who had to write forward...

vlovich123 · 2025-07-05T16:36:59 1751733419

Because SIMT is not a general programming framework like CPUs are. It’s a technique for a dedicated accelerator for a specific kind of problem. SIMD on the other hand lets you get meaningful speed up inline with traditional code.

smallmancontrov · 2025-07-05T17:12:07 1751735527

No no no! The programming model "meets you where you are" in exactly the way that an auto-vectorizer does. You write unconstrained single-threaded code, the compiler tries to make it parallel, and if it fails your code still works, just slowly. The difference is a few abstractions and social contract tweaks to make the auto-vectorizer reliable and easy to think about. These tweaks "smell like" hacks, but CPU folks have spent 20 years trying to do better and their auto-vectorizers are still failing at the basics so it's past time to copy what works and move on.

vlovich123 · 2025-07-05T21:56:14 1751752574

I think you maybe misunderstood what I was trying to say.

A model like CUDA only works well for the problems it works well on. It requires both HW designed for these kinds of problems, a SW stack that can use it, and problems that fit well within that paradigm. It does not work well for problems that aren’t embarrassingly parallel, where you process a little bit of data, make a decision, process a little bit more etc. As an example, go try to write a TCP stack in CUDA vs a normal language to understand the inherent difficulty of such an approach.

And when I say “hw designed for this class of problems” I mean it. Why does the GPU have so much compute? It throws away HW blocks that modern CPUs have that help with “normal” code. Like speculative execution hardware, thread synchronization, etc.

It’s an tradeoffs and there’s no easy answers.

fancyfredbot · 2025-07-05T18:29:44 1751740184

I'm so glad someone else gets it. We don't want an auto-vectorizer, it doesn't work, just give us a trivial way to vectorise the easy parts and leave the difficult parts to be difficult. We're better at the difficult stuff than your compiler.

janwas · 2025-07-06T11:31:20 1751801480

If SIMT is so obviously the right path, why have just about all GPU vendors and standards reinvented SIMD, calling it subgroups (Vulkan), __shfl_sync (CUDA), work group/sub-group (OpenCL), wave intrinsics (HLSL), I think also simdgroup (Metal)?