Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What did you think of their explanation of "memory parallelism" in section 4.3? I read it quickly, but thought it was pretty accurate. I liked that they made explicit the link between the number of instructions and the degree of parallelism. I was a little disappointed by the final statement that they "suspect" that the CPU is capable of only 8 parallel requests.


The analysis looks exactly right to me (they don't explain how they calculated MLP, but the numbers look about as expected for x86 ROB sizes/FB sizes).

One thing they don't mention is that the low MLP isn't inherent in the other algorithms! With the default/obvious implementation of a benchmark loop using each method you hit this issue, but in principle I think you could re-write almost any algorithm which is "instruction count MLP limited" like this into another one that isn't. A simple sketch is to simply do a batch of N loads up front, storing them into a temporary buffer, then do the computation part.

The loading part will have maximum MLP (after all, it's pure loads) and then the computation part will get all L1 hits unless you picked N wrong.

It loses a bit because the loads aren't as effectively overlapped with computation, but if the loads time "dominates" as presented here, it would work well.

A more sophisticated approach would still try to overlap computation and loads, at max MLP. You could do this by inserting additional prefetches or actual loads into the computation stream, enough to get the required MLP. This is actually kind of complicated and more dependent on the actual hardware parameters which is why I mentioned the other way first.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: