Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When Phoronix tested POWER9 SMT4 a while back, single-thread performance seems disappointing at first glance.

But it seems to be made up with their implementation of SMT4. The 2nd thread on a core didn't have much slowdown at all, while thread3 and thread4 per core barely affected performance.

It seems like POWER9 at least, benefits from running significantly more threads per core (at least compared to Xeon or AMD).

EDIT: It should be noted that IBM's 128-bit vector units are downright terrible compared to Intel's 512-bit or AMD's 256-bit vector units. SIMD compute is the weakest point of the Power9, and probably the Power10. They'll be the worst at multimedia performance (or other code using SIMD units: Raytracing, graphics, etc. etc.).

Power9's best use case was highly-threaded 64-bit code without SIMD. Power10 looks like the SIMD units are improving, but they're still grossly undersized compared to AMD or Intel SIMD units.



Sounds like IBM is not wasting area and power on out of order scheduling to find independent instructions within one thread. If you're running a lot of threads anyway, you get more independent instructions to work with for free!


When in SMT4 mode, various hardware resources are "partitioned off" in Power9.

The first, and third, threads use the "Left Superslice", while the second and fourth threads use the "Right Superslice". All four threads share a decoder (Bulldozer style).

1/4th of the branch predictor (EAT) is given to each of the 4x threads per core.

Register rename buffer is shared 2-threads at a time. (Two threads use the "left superslice", two other threads use the "right superslice"). An SMT1 mode, the single thread can use all 4 resources simultaneously.

A lot of the out-of-order stuff looks like it'd work as expected in 1-thread to 4-thread modes. At least, looking through the Power9 user guide / in theory.

--------

Honestly, I think the weirdest thing about POWER9 is the 2-cycle minimum latency (even on simple instructions like ADD and XOR). With that kind of latency, I bet that a number of inner-loops and code needs 2-threads loaded on the core, just to stay fully fed.

That'd be my theory for why 2-threads seem to be needed before POWER9 cores feel like they're being utilized well.

Obviously, POWER10 probably will change some of these details. But I'd expect POWER10 to largely be the same as POWER9 (aside from being bigger, faster, more efficient).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: