See how the code has only been written once, but multiple versions of the same functions where generated targeting different hardware features (e.g. SSE, AVX, AVX512). Then `HWY_DYNAMIC_DISPATCH` can be used to dynamically call the fastest one matching your CPU at runtime.
Thank you so much, this explains it well. I was initially afraid that the dispatch would be costly, but from what I understand it's (almost) zero cost after the first call.
I only code for x86 with vectorclass library, so I never had to worry about portability. In practice, is it really possible to write generic SIMD code like the example using Highway? Or could you often find optimization opportunities if you targeted a particular architecture?
You can go quite far with such libraries if you only perform data-parallel numerics on the CPU. However, if you work on complex algorithms or exotic data structures, there's almost always more upside in avoiding them and writing specialized code for each platform of interest.
I don't understand why it helps to "avoid them" entirely. For the (in my experience) >90% of shared code, we can gain the convenience of the wrapper library. For the rest, Highway allows target-specific specializations amidst your otherwise portable code: `#if HWY_TARGET == HWY_AVX2 ...`
In my experience, ISPC and Google's Highway project lead to better results in practice - this mostly due to their dynamic dispatching features.