I don't think the native C++, even when bundled with OMP, goes far enough. In my...

William_BB · 2025-07-05T09:56:09 1751709369

Could you elaborate on the dynamic dispatching features a bit more? Is that for portability?

camel-cdr · 2025-07-05T10:54:43 1751712883

Here is an example using google highway: https://godbolt.org/z/Y8vsonTb8

See how the code has only been written once, but multiple versions of the same functions where generated targeting different hardware features (e.g. SSE, AVX, AVX512). Then `HWY_DYNAMIC_DISPATCH` can be used to dynamically call the fastest one matching your CPU at runtime.

William_BB · 2025-07-05T11:39:49 1751715589

Thank you so much, this explains it well. I was initially afraid that the dispatch would be costly, but from what I understand it's (almost) zero cost after the first call.

I only code for x86 with vectorclass library, so I never had to worry about portability. In practice, is it really possible to write generic SIMD code like the example using Highway? Or could you often find optimization opportunities if you targeted a particular architecture?

ashvardanian · 2025-07-05T12:36:33 1751718993

You can go quite far with such libraries if you only perform data-parallel numerics on the CPU. However, if you work on complex algorithms or exotic data structures, there's almost always more upside in avoiding them and writing specialized code for each platform of interest.

janwas · 2025-07-06T11:24:49 1751801089

I don't understand why it helps to "avoid them" entirely. For the (in my experience) >90% of shared code, we can gain the convenience of the wrapper library. For the rest, Highway allows target-specific specializations amidst your otherwise portable code: `#if HWY_TARGET == HWY_AVX2 ...`

jeffreygoesto · 2025-07-05T12:58:40 1751720320

Nice. First time that I saw this dynamic dispatch was in FFTW.

ashvardanian · 2025-07-05T12:33:55 1751718835

Here's an explanation from one of my repos: <https://github.com/ashvardanian/simsimd?tab=readme-ov-file#d...>