It’s such a ridiculous situation we’re in. Just about every consumer CPU of the past 20 years packs an extra order of magnitude or two of punch for data processing workloads, but to not let it go to waste you have to resort to writing your inner loops using low-level nonportable intrinsics that are just a step above assembly. Or pray that the gods of autovectorization are on your side.
Adding parallelism is much easier on the hardware side than the software side. We've kind of figured out the easy cases, such as independent tasks with limited interactions with each other, and made them practical for the average developer. But nobody understands the harder cases (such as SIMD) well enough to create useful abstractions that don't constrain hardware development too much.
Except for GPU manufacturers, who figured out the right way to do it 20 years ago.
20 years ago, it was extremely obvious to anyone who had to write forward/backward compatible parallelism that the-thing-nvidia-calls-SIMT was the correct approach. I thought CPU hardware manufacturers and language/compiler writers were so excessively stubborn that it would take them a decade to catch up. I was wrong. 20 years on, they still refuse to copy what works.
That's because the things GPUs do just isn't what CPUs do. GPUs don't have to deal with ad-hoc 10..100-char strings. They don't have to deal with 20-iteration loops with necessarily-serial dependencies between invocations of the small loops. They don't have to deal with parallelizing mutable hashmap probing operations.
Indeed what GPUs do is good for what GPUs do. But we have a tool for doing things that GPUs do well - it's called, uhh, what's it, uhh... oh yeah, GPUs. Copying that into CPUs is somewhere between just completely unnecessary, and directly harmful to things that CPUs are actually supposed to be used for.
The GPU approach has pretty big downsides for anything other than the most embarassingly-parallel code on very massive inputs; namely, anything non-trivial (sorting, prefix sum) will typically require log(n) iterations, and somewhere between twice as much, and O(n*log(n)) memory access (and even summing requires stupid things like using memory for an accumulator instead of being able to just use vector registers), compared to the CPU SIMD approach of doing a single pass with some shuffles. GPUs handle this via trading off memory latency for more bandwidth, but any CPU that did that would immediately go right in the thrash because that'd utterly kill scalar code performance.
This reads like you haven't tried CUDA. The whole point of CUDA is that your CUDA code has single-threaded semantics. The problem you assumed it has is the problem it doesn't have, and the fact that it doesn't have this problem is the reason why it works and the reason why it wins.
EDIT: ok, now you're talking about the hardware difference between CPUs and GPUs. This is relevant for the types of programs that each can accelerate -- barrel processors are uniquely suited to embarrassingly parallel problems, obviously -- but it is not relevant for the question of "how to write code that is generic across block size, boundary conditions, and thread divergence." CUDA figured this out, non-embarassingly-parallel programs still have this problem, and they should copy what works. The best time to copy what works was 20 years ago, but the second best time is today.
It has single-threaded semantics per element. Which is fine for anything that does completely independent computation for each element, but is quite annoying for everything else, requiring major algorithmic changes. And CPU SIMD is used for a lot of such things.
"Completely independent" except for anything that can be expressed using branches, queues, and locks. Which is everything. Again, are you sure you've tried CUDA? Past, like, the first tutorial?
I haven't used CUDA (don't have nvidia gpu), but I've looked at examples of code before. And it doesn't look any more simple than CPU SIMD for anything non-trivial.
And pleasepleaseplease don't have locks in something operating over a 20-element array, I'm pretty damn sure that's just simply gonna be a suboptimal approach in any scenario. (even if those are just "software" locks for forcing serialized computation that don't actually end up in any memory atomics or otherwise more instructions, as just needing such is hinting to me of awful approaches like log(n) loops over a n=20-element array, or some in-memory accumulators, or something awful)
As an extreme case of something I've had to do in CPU SIMD that I don't think would be sane in any other way:
How would I in CUDA implement code that does elementwise 32-bit integer addition of two input arrays into a third array (which may be one of the inputs), checking for overflow, and, in the case of any addition overflowing (ideally early-exiting on such to not do useless work), report in some way how much was processed, such that further code could do the addition with a wider result type, being able to compute the full final wider result array even in the case where some of the original inputs aren't available due to the input overlapping the output (which is fine as for those the computed results can be used directly)?
This is a pretty trivial CPU SIMD loop consisting of maybe a dozen intrinsics (even easily doable via any of the generalized arch-independent SIMD libraries!), but I'm pretty sure it'd require a ton of synchronization in anything CUDA-like, and probably being forced to do early-exiting in way larger blocks, and probably having to return a bitmask of which threads wrote their results, as opposed to the SIMD loop having a trivial guarantee of the processed and unprocessed elements being split exactly on where the loop stopped.
(for addition specifically you can also undo the addition to recover the input array, but that gets way worse for multiplication as the inverse there is division; and perhaps in the CUDA approach you might also want to split into separately checking for overflow and writing the results, but that's an inefficient two passes over memory just to split out a store of something the first part already computes)
re edit - While the hardware differences are significant, some by necessity, some by tradition, it's not my point.
My specific point is that "how to write code that is generic across block size, boundary conditions, and thread divergence." is just not the correct question to ask for many CPU SIMD use-cases. Many of those Just Do Not Fit That Paradigm. If you think you can just squeeze CPU SIMD usage into that box then I don't think you've actually done any CPU SIMD beyond very trivial things (see my example problem in the other thread).
You want to take advantage of block size on CPUs. It's sad that GPU programmers don't get to. In other places I've seen multiple GPU programmers annoyed at not being able to do the CPU SIMD programming paradigm of explicit registers on GPUs. And doing anything about thread divergence on CPUs is just not gonna go well due to the necessary focus on high clock rates (and as such having branch mispredictions be relatively ridiculously expensive).
You of course don't need any of fancy anything if you have a pure embarassingly-parallel problem, for which GPUs are explicitly made. But for these autovectorization does actually already work, given hardware that has necessary instructions (memory gathers/scatters, masked loads/stores if necessary; and of course no programming paradigm would magically make it work for hardware that doesn't). At worst you may need to add a _Pragma to tell the compiler to ignore memory aliasing, at which point the loop body is exactly the same programming paradigm as CUDA (with thread synchronization being roughly "} for (...) {", but you gain better control over how things happen).
Intel actually built and launched this 15 years ago. A GPU-like barrel processor with tons of memory bandwidth and wide vector instructions that ran x86. In later versions they even addressed the critical weakness of GPUs for many use cases (poor I/O bandwidth).
It went nowhere, aside from Intel doing Intel things, because most programmers struggle to write good code for those types of architectures so all of that potential was wasted.
That’s like saying that you have to describe your data flow in terms of gotos because the CPU doesn’t understand for loops and compilers aren’t magic. I don’t mean that autovectorization should just work (tm), I just mean that reasonable portable SIMD abstractions should not be this hard.
> I just mean that reasonable portable SIMD abstractions should not be this hard.
Morally, no, it really ought to not be this hard, we need this. Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for, and even within the X86 family, even within a particular product class, things are inconsistent:
- On normal words, one has lzcnt (leading-zero count) and tzcnt (trailing-zero count), but on SIMD vectors there is only lzcnt. And you get lzcnt only on AVX512, the latest-and-greatest in X86.
- You have horizontal adds (adding adjacent cells in a vector) for 16-bit ints, 32-bit ints, floats and doubles, and saturating horizontal add for 16-bit ints. https://www.intel.com/content/www/us/en/docs/intrinsics-guid... Where are horizontal adds for 8-bit or 64-bit ints, or any other saturating instructions?
- Since AVX-512 filled up a bunch of gaps in the instruction set, you have absolute value instructions on 8, 16, 32 and 64 bit ints in 128, 256 and 512 bit vectors. But absolute value on floats only exists on 512-bit vectors.
These are just the ones that I could find now, there is more. With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.
If by that absolute value thing you mean _mm512_abs_pd, that's a pseudoinstruction for 'and'ing via a mask that zeroes out the top bit, which can be done equally as well on 128/256-bit vectors, just without an intrinsic for some arbitrary reason. But yeah the gaps are super annoying. Some of my personal picks:
- There's only 8- and 16-bit integer saturating add/subtract, even on AVX-512
- No 8-bit shifts anywhere either; AVX2 only has 32- and 64-bit dynamic shifts (and ≥16-bit constant shifts; no 64-bit arithmetic shift right though!), AVX-512 adds dynamic 16-bit shifts, still no 8-bit shifts (though with some GFNI magic you can emulate constant 8-bit shifts)
- Narrowing integer types pre-AVX-512 is rather annoying, taking multiple instructions. And even though AVX-512 has instructions for narrowing vectors, you're actually better off using multiple-table-input permute instructions and narrowing multiple vectors at the same time.
- Multiplies on x86 are extremely funky (there's a 16-bit high half instr, but no other width; a 32×32→64-bit instr, but no other doubling width instr; proper 32-bit multiply is only from AVX2, proper 64-bit only in AVX-512). ARM NEON doesn't have 64-bit multiplication.
- Extracting a single bit from each element (movemask/movmsk) exists for 8-/32-/64-bit elements, but not 16-bit on x86 pre-AVX512; ARM NEON has none of those, requiring quite long instruction sequences to do so (and you quite benefit from unrolling and packing multiple vectors together, or even doing structure loads to do some of the rearranging)
- No 64-bit int min/max nor 16-bit element top-bit dynamic blend pre-AVX512
I know dzaima is aware, but for all the other posters who might not be, our Highway library provides all these missing instructions, via emulation if required.
I do not understand why folks are still making do with direct use of intrinsics or compiler builtins. Having a library centralize workarounds (such an an MSAN compiler change which hit us last week) seems like an obvious win.
> Practically, it really is hard, because SIMD instruction sets in CPUs are a mess. X86 and ARM have completely different sets of things that they have instructions for
Not disagreeing it's a mess, but there's also quite a big common subset containing all the basic arithmetic ops and some specialized ones rsqrt, rcp, dot product, etc.
These should be easier to use without having to write the code for each instruction set. And they are with C vector extensions or Rust std::simd.
Some of the inconsistencies you mention are less of a problem in portable simd, taking Rust for example:
- lzcnt and tzcnt: std::simd::SimdInt has both leading_zeros and trailing_zeros (also leading/trailing_ones) for every integer size and vector width.
- horizontal adds: notably missing from std::simd (gotta use intrinsics if you want it), but there is reduce_sum (although it compiles to add and swizzle). Curiously LLVM does not compile `x + simd_swizzle!(x, [1, 0, 3, 2])` into haddps
- absolute values for iBxN and fBxN out of the box.
Also these have fallback code (which is mostly reasonable, but not always) when your target CPU doesn't have the instruction. You'll need to enable the features you want at compile time (-C target-features=+avx2).
> With this kind of inconsistency, any portable SIMD abstraction will be difficult to efficiently compile for the majority of CPUs, negating part of the advantage.
I agree it negates a part of the advantage. But only a part, and for that you have zero cost fallback to intrinsics. And in my projects that part has been tiny compared to the overall amount of SIMD code I've written.
For basic arithmetic ops it's a huge win to have to write the code only once, and use normal math operations (+, -, *, /) instead of memorizing the per-CPU intrinsics for two (or more) CPU vendors.
There's different ways of approaching it which have different performance consequences. Which is why accelerated libraries are common, but if you want accelerated primitives, you kinda have to roll your own.
this turned into a larger rant than I wanted it to be. But I need to let it out every now and then. Feel free tos kip it.
IMHO programming languages are, for the most part, designed such that a compiler's job is easy if you want to compile to scalar code, but damned near impossible if you want it to compile to vectorized code.
So I have a point type, right? 'struct point3f { float x,y,z; };'. Easy peasy lemon squeezy. I add a bunch of member functions for normal stuff like addition, scalar multiplication, dot/cross product, etc. I write a triangle type: 'struct triangle { point3f a,b,c; };'. I write a bunch of functions for geometry stuff, intersections with rays, normals, etc.
Then I make an array of triangles. I have an origin point and a look ray. I want to iterate over each triangle and figure out which triangles intersect my origin/ray. Now I'm stuck. This is a perfect use case for vectorization, I can trivially get 4x/8x/16x speedup with SSE/AVX/AVX512, but the compiler can't do it. The data's in the wrong layout. It's in the correct layout for scalar code, but the wrong format for vector code. If you want to write vector code, your data has to be in a struct of arrays (SoA) layout.
There ought to exist a programming language that will, by default, automagically convert everything to SoA layout, unless you flag your array as AoS or your class as non-SoA-able. And ranged for loops are, by default, unsequenced.
This will make autovectorization an order of magnitude easier, and will enable vectorization on complicated stuff that might be fiendishly difficult to vectorize, even by hand.
This isn't trivial, and it can't simply be tacked on to a language later on. SIMD needs to be a day-0 priority. Everything else needs to be in support of that.
Until then, until this happens, we will either be leaving 60% of our CPU's silicon idle 99% of the time, or scrubs like me will continue to write SIMD intrinsic laden code with lots of manual bullshit to do what the compiler/language design should be doing for me for free.
The Vector<T> type in dotnet core works precisely like this. Just yesterday I tried it for some spatial map aggregation filters and calculations, and it worked great.
On my older PC it used AVX2 instructions and 256-bits at a time processing, on my newer laptop it automatically switched to AVX-512 and its speed improved 4x without a single line of code having to change!
I.e.: Vector<int> has 8x "int" elements on older processors, but it has 16x elements on newer ones. You can retrieve the count with Vector<int>.Count and use that to split your data.
The only issue is that swizzles and scatter/gather is unavailable for these generic types.
Here we are discussing the merits of built in SIMD facilities of not one but two programming languages. Waiting for the Zig guys to chime in to make it a three.
People have said this for longer than I've been alive. I don't think it's a meaningful concept.