What did you think of their explanation of "memory parallelism" in section 4.3? ...

BeeOnRope · on Oct 14, 2018

The analysis looks exactly right to me (they don't explain how they calculated MLP, but the numbers look about as expected for x86 ROB sizes/FB sizes).

One thing they don't mention is that the low MLP isn't inherent in the other algorithms! With the default/obvious implementation of a benchmark loop using each method you hit this issue, but in principle I think you could re-write almost any algorithm which is "instruction count MLP limited" like this into another one that isn't. A simple sketch is to simply do a batch of N loads up front, storing them into a temporary buffer, then do the computation part.

The loading part will have maximum MLP (after all, it's pure loads) and then the computation part will get all L1 hits unless you picked N wrong.

It loses a bit because the loads aren't as effectively overlapped with computation, but if the loads time "dominates" as presented here, it would work well.

A more sophisticated approach would still try to overlap computation and loads, at max MLP. You could do this by inserting additional prefetches or actual loads into the computation stream, enough to get the required MLP. This is actually kind of complicated and more dependent on the actual hardware parameters which is why I mentioned the other way first.