Fair enough, this is not meant to be some endorsement of the standard Fortran BL...

steppi · on April 1, 2024

Check out The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ort. Chapter 5 walks through how to write an optimized GEMM. It involves clever use of block multiplication, choosing block sizes for optimal cache behavior for specific chips. Modern compilers just aren't able to do such things now. I've spent a little time debugging things in scipy.linalg by swapping out OpenBLAS with reference BLAS and have found the slowdown from using reference BLAS is typically at least an order of magnitude.

[0] https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf

marshallward · on April 1, 2024

You are right, I just tested this out and my speed from BLAS to OpenBLAS went from 6 GFLOP/s to 150 GFLOP/s. I can only imagine what BLIS and MKL would give. I apologize for my ignorance. Apparently my faith in the compilers was wildly misplaced.

TheRealKing · on April 3, 2024

No, you can still trust compilers: 1) The hand-tuned BLAS routines are essentially a different algorithm with hard-coded information. 2) The default OpenBLAS uses OpenMP parallelism, so much speed likely originates from multithreading. Set OMP_NUM_THREADS environment variable to 1 before running your benchmarks. You will still see a significant performance difference due to a few factors, such as extra hard-coded information in OpenBLAS implementation.

marshallward · on April 5, 2024

I ran with OMP_NUM_THREADS=1, but your point is well taken.

As for the original post, I felt a bit embarrassed about my original comments, but I think the compilers actually did fairly well based on what they were given, which I think is what you are saying in your first part.