1. The compiler will vectorize simple operations like that pretty well.
2. Zig has built-in @Vector types that are fixed-size data types designed to compile down to things like SIMD as efficiently as possible given that you might be asking it to do 16x operations on a CPU only supporting 8x width SIMD. You'd often write your high-level code as a runtime-known iteration count over those comptime-known vector widths.
2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).
3. You can always link to LAPACK and friends. Zig has a great c interface.
4. Matrix/tensor ops aren't built-in. That doesn't matter for a lot of what this demo shows since it'll be RAM/cache bandwidth bound, but you'd definitely need to link in or hand-code inner product routines to have better asymptotics and cache friendliness if you were doing too many large matrix multiplies.
5. Wrapping any of the above into a library would be pretty easy. That sort of code is easy to write, so I haven't looked to see what other people have made in that space, but I'm sure there's something.
And their async/await is being redesigned IIRC, but at least the last version would make for a low overhead (in both developer time and runtime) way to parallelize any of the above. In the old version it was totally trivial to write a high performance parallel variant (knight's move and whatnot) sudoku solver, and I doubt the new version will be any worse. Not that the interior of a linear algebra operation is often the best place to apply parallelization, but if you had a good reason to do so it wouldn't be hard.
I'm not aware of anything in particular that would make multi-machine computations even slightly less painful than other languages, but maybe someone can chime in here with ideas.
C++ / Eigen performs a lot of optimizations that you might expect from a Fortran compiler. For example, it ability to broadcast and perform reductions is pretty slick, e.g.:
X.noalias() = (A - B).colwise().squaredNorm().mean();
Short of having tensor cores, this does a good job of static optimization for underlying vectorization support.
> 2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).
Inline assembly is great but support for intrinsics would be really valuable for Zig IMO.
I could write a `pub fn main()` that consisted only of a call to a C library that implemented world_peace(). But it wouldn't change the fact that "support for intrinsics would be really valuable for Zig IMO."
2. Zig has built-in @Vector types that are fixed-size data types designed to compile down to things like SIMD as efficiently as possible given that you might be asking it to do 16x operations on a CPU only supporting 8x width SIMD. You'd often write your high-level code as a runtime-known iteration count over those comptime-known vector widths.
2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).
3. You can always link to LAPACK and friends. Zig has a great c interface.
4. Matrix/tensor ops aren't built-in. That doesn't matter for a lot of what this demo shows since it'll be RAM/cache bandwidth bound, but you'd definitely need to link in or hand-code inner product routines to have better asymptotics and cache friendliness if you were doing too many large matrix multiplies.
5. Wrapping any of the above into a library would be pretty easy. That sort of code is easy to write, so I haven't looked to see what other people have made in that space, but I'm sure there's something.