When Phoronix tested POWER9 SMT4 a while back, single-thread performance seems d...

omikun · on Aug 18, 2020

Sounds like IBM is not wasting area and power on out of order scheduling to find independent instructions within one thread. If you're running a lot of threads anyway, you get more independent instructions to work with for free!

dragontamer · on Aug 18, 2020

When in SMT4 mode, various hardware resources are "partitioned off" in Power9.

The first, and third, threads use the "Left Superslice", while the second and fourth threads use the "Right Superslice". All four threads share a decoder (Bulldozer style).

1/4th of the branch predictor (EAT) is given to each of the 4x threads per core.

Register rename buffer is shared 2-threads at a time. (Two threads use the "left superslice", two other threads use the "right superslice"). An SMT1 mode, the single thread can use all 4 resources simultaneously.

A lot of the out-of-order stuff looks like it'd work as expected in 1-thread to 4-thread modes. At least, looking through the Power9 user guide / in theory.

--------

Honestly, I think the weirdest thing about POWER9 is the 2-cycle minimum latency (even on simple instructions like ADD and XOR). With that kind of latency, I bet that a number of inner-loops and code needs 2-threads loaded on the core, just to stay fully fed.

That'd be my theory for why 2-threads seem to be needed before POWER9 cores feel like they're being utilized well.

Obviously, POWER10 probably will change some of these details. But I'd expect POWER10 to largely be the same as POWER9 (aside from being bigger, faster, more efficient).