Obviously you didn't read what I posted. The Quantum Transport is compute-intensive physics problem that has a lot of optimization research going behind it. One of the main bottleneck to solve this problem is Strided Matrix Multiplication.
There is no C vs C++ issue, you keep saying that constexpr and template metaprogramming matter in high performance computing and GPGPU, I have given you links, benchmarks and actual code that showed you that what makes a difference is memory locality.
Ergo, as long as your language is low-level enough to control that locality, be it C, C++, Fortran, Rust, Nim, Zig, ... you can achieve speedups by several order of magnitude and it is absolutely required to get high-performance.
Constexpr and template metaprogramming don't matter in high performance computing, prove me wrong, walk the talk, don't drink the kool-aid.
There are plenty of well studied computation kernels you can use: matrix multiplication, convolution, ray-tracing, recurrent neural network, laplacian, video encoding, Cholesky decomposition, Gaussian filter, Jacobi, Heat, Gauss Seidel, ...
But then since you saw it was a lousing battle going down that path, you pulled the hardware rabbit trick out of the magician hat.
So we moved from C++ is not faster than C assertion, to memory layouts, hardware design and data representation.
Now you are even asserting that it's not about C vs C++ vs PTX, and going down quantum transport lane?
Yeah, whatever.