20 years ago, it was extremely obvious to anyone who had to write forward/backward compatible parallelism that the-thing-nvidia-calls-SIMT was the correct approach. I thought CPU hardware manufacturers and language/compiler writers were so excessively stubborn that it would take them a decade to catch up. I was wrong. 20 years on, they still refuse to copy what works.
They search every corner of the earth for a clue, from the sulfur vents at the bottom of the ocean to tallest mountains, all very impressive as feats of exploration -- but they are still suffering for want of a clue when clue city is right there next to them, bustling with happy successful inhabitants, and they refuse to look at it. Look, guys, I'm glad you gave a chance to alternatives, sometimes they just need a bit of love to bloom, but you gave them that love, they didn't bloom, and it's time to move on. Do what works and spend your creative energy on a different problem, of which there are plenty.
Because SIMT is not a general programming framework like CPUs are. It’s a technique for a dedicated accelerator for a specific kind of problem. SIMD on the other hand lets you get meaningful speed up inline with traditional code.
No no no! The programming model "meets you where you are" in exactly the way that an auto-vectorizer does. You write unconstrained single-threaded code, the compiler tries to make it parallel, and if it fails your code still works, just slowly. The difference is a few abstractions and social contract tweaks to make the auto-vectorizer reliable and easy to think about. These tweaks "smell like" hacks, but CPU folks have spent 20 years trying to do better and their auto-vectorizers are still failing at the basics so it's past time to copy what works and move on.
I think you maybe misunderstood what I was trying to say.
A model like CUDA only works well for the problems it works well on. It requires both HW designed for these kinds of problems, a SW stack that can use it, and problems that fit well within that paradigm. It does not work well for problems that aren’t embarrassingly parallel, where you process a little bit of data, make a decision, process a little bit more etc. As an example, go try to write a TCP stack in CUDA vs a normal language to understand the inherent difficulty of such an approach.
And when I say “hw designed for this class of problems” I mean it. Why does the GPU have so much compute? It throws away HW blocks that modern CPUs have that help with “normal” code. Like speculative execution hardware, thread synchronization, etc.
I'm so glad someone else gets it. We don't want an auto-vectorizer, it doesn't work, just give us a trivial way to vectorise the easy parts and leave the difficult parts to be difficult. We're better at the difficult stuff than your compiler.
If SIMT is so obviously the right path, why have just about all GPU vendors and standards reinvented SIMD, calling it subgroups (Vulkan), __shfl_sync (CUDA), work group/sub-group (OpenCL), wave intrinsics (HLSL), I think also simdgroup (Metal)?
20 years ago, it was extremely obvious to anyone who had to write forward/backward compatible parallelism that the-thing-nvidia-calls-SIMT was the correct approach. I thought CPU hardware manufacturers and language/compiler writers were so excessively stubborn that it would take them a decade to catch up. I was wrong. 20 years on, they still refuse to copy what works.
They search every corner of the earth for a clue, from the sulfur vents at the bottom of the ocean to tallest mountains, all very impressive as feats of exploration -- but they are still suffering for want of a clue when clue city is right there next to them, bustling with happy successful inhabitants, and they refuse to look at it. Look, guys, I'm glad you gave a chance to alternatives, sometimes they just need a bit of love to bloom, but you gave them that love, they didn't bloom, and it's time to move on. Do what works and spend your creative energy on a different problem, of which there are plenty.