I didn't say C is faster than C++, I said C++ is not faster than C and so speed ...

pjmlp · on April 22, 2020

Playing word games here?

"I said C++ is not faster than C" implies that C++ compilers don't beat C compilers, which as many in HPC, HFT and GPGPU computing domains know is false for years now, and no restrict doesn't help that much against template metaprogramming and constexpr.

mratsim · on April 22, 2020

I'm not playing word games.

Template metaprogramming and constexpr doesn't help being faster in HPC or GPGPU, it helps reduce the redundancy of your code, for example if you want a generic algorithm on float, double, int, complex.

What helps speed is being able to control memory allocations and having the tool to place the data required on registers, L1 cache or L2 cache as required by your kernel (and similarly for GPU).

On current architectures, what is hard to optimize is memory and data movement, if your data is at the wrong place or not prefetched at the right time it will be literally 100 times more costly than a saved addition from constexpr.

pjmlp · on April 22, 2020

Enough theory,

"Scientific Computing: C++ Versus Fortran" (1997)

https://www.drdobbs.com/cpp/scientific-computing-c-versus-fo...

"Micro-Optimisation in C++: HFT and Beyond"

http://research.ma.cx/NDCTechTown_2017_JMMcG_v1_1.pdf

"The Speed Game: Automated Trading Systems in C++"

https://www.youtube.com/watch?v=ulOLGX3HNCI

"When a Microsecond Is an Eternity: High Performance Trading Systems in C++"

https://www.youtube.com/watch?v=NH1Tta7purM

mratsim · on April 22, 2020

It might be Internet and the issue of communicating emotions across but you sound quite taken by this issue.

Anyway, I stand by what I say and I'm backed by my high performance code:

- Writing matrix multiplication that is as fast as Assembly, complete with analysis and control on register allocations, L1 and L2 cache tiling and avoiding TLB cache miss:

- https://github.com/numforge/laser/blob/master/laser/primitiv...

- Code, including caveat about hyperthreading: https://github.com/numforge/laser/blob/master/laser/primitiv...

- The code is all pure Nim and is as fast/faster than OpenBLAS when multithreaded, caveat, the single-threaded kernel are slightly slower but it scales better on multiple cores.

- I've also written my own multithreading runtime. It's scale better and has lower overhead than Intel TBB. There is no constexpr, you need type-erasure to handle everything people can use a multithreading runtime for, same comparison on GEMM: https://github.com/mratsim/weave/tree/v0.4.0/benchmarks/matm...

- More resources on the importance of memory bandwidth: optimization convolutions https://github.com/numforge/laser/wiki/Convolution-optimisat...

- Optimizing matrix multiplication on GPUs: https://github.com/NervanaSystems/maxas/wiki/SGEMM, again it's all about memory and caches optimization

- Let's switch to another domain with critical perf need, cryptography. Even when knowing the bounds of iterating on a bigint at compile-time, compiler are very bad at producing optimized code, see GCC vs Clang https://gcc.godbolt.org/z/2h768y

- And crypto is the one thing where integer templates are very useful since you know the bounds.

- Another domain? VM interpretation. The slowness there is due to function call overhead and/or switch dispatching and not properly using hardware prefetchers. Same thing, C++ constexpr doesn't help it's lower-level, see resources: https://github.com/status-im/nimbus/wiki/Interpreter-optimiz...

Also all the polyhedral research, and deep learning compiler research including the Halide compiler, Taichi, Tiramisu, Legion, DaCE confirm that memory is the big bottleneck.

Now since you want to stop on the theory and you mentioned HPC, pick your algorithm, it could be matrix multiplication, QR decomposition, Cholesky, ... Any fast C++ code (or C, or Fortran or Assembly) that you find will be fast because of careful memory layout and all level of caches, not constexpr.

If you have your own library in one of those domains I would be also very happy to have a look.

As a simple example, let's pick an out-of-place transposition kernel to transpose a matrix. Show me how you use constexpr and template metaprogramming to speed it up. Here is a detailed analysis on the impact of 1D-tiling and 2D tiling: https://github.com/numforge/laser/blob/master/benchmarks/tra..., throughput can be increased 4x with proper usage of memory caches.

pjmlp · on April 22, 2020

Ah now we are into the opinion of experts in the matter don't count, only if I prove it myself?

I guess that is why NVidia has spent 10 years doing hardware design to optimize their cards for C++ execution.

Apparently that was wasted money, they should have kept using C.

mratsim · on April 22, 2020

I mentioned theory and experts, you said enough theory.

I switched to practical applications and walk the talk showing my code, and then you back off and want to go back to opinions.

I see now that you want to back myself with experts since reproducible code and runnable benchmarks is not enough.

Apparently you recognize Nvidia as an expert so let's talk about CuDNN where optimizing convolution is all about memory layout, source: https://github.com/soumith/convnet-benchmarks/issues/93#issu... and it's not about C vs C++ vs PTX.

Or let's hear about what Nvidia says about optimizing GEMM: https://github.com/NVIDIA/cutlass/blob/master/media/docs/eff..., it's all about memory locality and tiling.

Or maybe Stanford, the US government and Nvidia Research are also wrong when pouring significant research in Legion? https://legion.stanford.edu/

> Legion is a data-centric parallel programming system for writing portable high performance programs targeted at distributed heterogeneous architectures. Legion presents abstractions which allow programmers to describe properties of program data (e.g. independence, locality). By making the Legion programming system aware of the structure of program data, it can automate many of the tedious tasks programmers currently face, including correctly extracting task- and data-level parallelism and moving data around complex memory hierarchies. A novel mapping interface provides explicit programmer controlled placement of data in the memory hierarchy and assignment of tasks to processors in a way that is orthogonal to correctness, thereby enabling easy porting and tuning of Legion applications to new architectures.

Are you saying they should have just called it a day once they were done with C++?

Or you can read the DaCE paper on how to beat CuBLAS and CuDNN: https://arxiv.org/pdf/1902.10345.pdf, it's all about data movement. 6.4 Case Study III: Quantum Transport to optimize transistors heat dissipation, Nvidia strided matrix multiplication was improved upon by over 30%, and this part is pure Assembly, the improvement was about better utilizing the hardware caches.

pjmlp · on April 22, 2020

Nah, I was answering the whole "C vs C++" issue.

But then since you saw it was a lousing battle going down that path, you pulled the hardware rabbit trick out of the magician hat.

So we moved from C++ is not faster than C assertion, to memory layouts, hardware design and data representation.

Now you are even asserting that it's not about C vs C++ vs PTX, and going down quantum transport lane?

Yeah, whatever.

mratsim · on April 22, 2020

Obviously you didn't read what I posted. The Quantum Transport is compute-intensive physics problem that has a lot of optimization research going behind it. One of the main bottleneck to solve this problem is Strided Matrix Multiplication.

There is no C vs C++ issue, you keep saying that constexpr and template metaprogramming matter in high performance computing and GPGPU, I have given you links, benchmarks and actual code that showed you that what makes a difference is memory locality.

Ergo, as long as your language is low-level enough to control that locality, be it C, C++, Fortran, Rust, Nim, Zig, ... you can achieve speedups by several order of magnitude and it is absolutely required to get high-performance.

Constexpr and template metaprogramming don't matter in high performance computing, prove me wrong, walk the talk, don't drink the kool-aid.

There are plenty of well studied computation kernels you can use: matrix multiplication, convolution, ray-tracing, recurrent neural network, laplacian, video encoding, Cholesky decomposition, Gaussian filter, Jacobi, Heat, Gauss Seidel, ...