Thanks for the explanation. The only SIMD programming I've seen is where the programmer would carefully call the cpu-brand based instructions and painstakingly manage the memory register making sure the numbers to be added, multiplied etc. is evenly divided then given to the SIMD ALUs.
Sounds like what you are saying that fork join model translates easely by the compiler to these SIMD instructions?
Some compilers can also vectorize plain loops, but you would advocate for fork join?
> Sounds like what you are saying that fork join model translates easely by the compiler to these SIMD instructions?
Why do you think CUDA has become so popular recently? That's exactly what CUDA, OpenCL, and ISPC does.
> Some compilers can also vectorize plain loops, but you would advocate for fork join?
CUDA style / OpenCL style fork-join is clearly easier than reading compiler output, trying to debug why your loop failed to vectorize. That's the thing about auto-vectorizers, you end up having to grok through tons of compiler output, or check out the assembly, to make sure it works.
ALL fork-join style CUDA / OpenCL code automagically compiles into SIMD instructions. Ditto with ISPC. Heck, GPU programmers have been doing this since DirectX 7 HLSL / OpenGL decades ago.
There's no "failed to vectorize". There's no looking up SIMD-instructions or registers or intrinsics. (Well... GPU-assembly is allowed but not necessary). It just works.
-------
If you've never tried it, really try one of those languages. CUDA is for NVidia GPUs. OpenCL for AMD. ISPC for Intel CPUs (instead of SIMD intrinsics, ISPC was developed for an OpenCL-like fork-join SIMD programming environment).
And of course, Julia and Python have some CUDA plugins.
Its not as reliable as a dedicated language like OpenCL or ISPC. But this might be easier for you to play with rather than learning another language.
OpenMP is just #pragmas on top of your standard C, C++, or Fortran code. So any C / C++ / Fortran compiler can give this sort of thing a whirl rather easily.
---------
OpenMP always was a fork-join model #pragma add on to C / C++. They eventually realized that their fork-join model works for SIMD, and finally added SIMD explicitly to their specification.
Fortran Coarray is far beyond simple Fork-Join. It enables one-sided remote memory access, something that is impossible with OpenMP or CUDA, as far as I am aware, and requires the highest levels of skill to do it right in MPI.
Sounds like what you are saying that fork join model translates easely by the compiler to these SIMD instructions?
Some compilers can also vectorize plain loops, but you would advocate for fork join?