- They leapfrogged everyone else with PCIe v5 and DDR5
- 1 TB/s memory bandwidth, which is comparable to high-end NVIDIA GPUs, but for CPUs
- Socket-to-socket interconnect is 1 TB/s also.
- 120 GB/s/core L3 cache read rate sustained.
- Floating point rate comparable to GPUs
- 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.
- Memory disaggregation similar to how most modern enterprise architectures separate disk from compute. You can have memory-less compute nodes talking to a central memory node!
I appreciate the slides, but you've got some critical errors.
> - Floating point rate comparable to GPUs
No where close. POWER10 caps out at 60x SMT4, and only has 2x 128-bit SIMD units per SMT4. That's 480 FLOPs per clock cycle. At 4GHz, that's only 1.9 TFlops single-precision compute.
An NVidia 2070 Super ($400 consumer GPU) hits 8.2 TFlops with 448 GB/s bandwidth.
> - 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.
Note that 8-way SMT is the "big core", and I'm almost convinced that 8-SMT is more about licensing than actually scaling. By supporting 8 threads per core (and doubling the size of the core), you get a 30-core system that supports 240-threads. That's a licensing hack: since most enterprise software is paid per-core.
I'd expect that 4-SMT will be more popular with consumers (ie: Talos II sized systems). Similar to how Power9 was
Note that 8-way SMT is the "big core", and I'm almost convinced that 8-SMT is more about licensing than actually scaling. By supporting 8 threads per core (and doubling the size of the core), you get a 30-core system that supports 240-threads. That's a licensing hack: since most enterprise software is paid per-core.
I don't think that's entirely fair: Massively multi-user transactional applications (read: databases) are right in the wheelhouse of POWER, and they're exactly the kind of applications that benefit most from SMT. Lots of opportunities for latency hiding as you're chasing pointers from indexes to data blocks.
While that's a fair point, databases are also among the most costly software that is paid per-core. So the SMT8 system will have half the per-core costs as an SMT4 system (because an SMT8 system will have half the "cores" of any SMT4 system. Even though an SMT8 core is just two SMT4 cores grafted together).
Oracle charges 1.0 for any POWER9 core. Which means that SMT8 cores (which are twice as big as SMT4 cores) get you more performance with less licensing costs.
So what would you rather buy? A 24-core SMT4 Nimbus chip, or an 12-core SMT8 Cumulus chip?
The two chips have the same number of execution units. They both support 96 threads. But the 12-core SMT8 Cumulus chip will have half the license costs from Oracle.
-------
For DB2...
The SMT8 model (E950) only has a 100x multiplier for DB2, while SMT4 models have a 70x multiplier. So you're only paying 30% more per core (E950), despite getting twice the execution resources.
Even the top end SMT8 model (E980) has a 120x multiplier. So you're still saving big bucks on licensing.
The 12-core POWER10 will have 120MB of L3 cache and 1000GBps main-memory bandwidth with 96-threads.
That's unified L3 cache by the way, none of the "16MB/CCX with remote off-chip L3 caches being slower than DDR4 reads" that EPYC has.
Intel Xeon Platinum 8180 only has 38.5MB L3 across 28 cores / 56-threads. With 6-memory controllers at 2666 MHz (21GBps per stick), that's 126 GBps bandwidth.
AMD EPYC really only has 16MB L3 per CCX (because the other L3 cache is "remote" and slower than DDR4). With 8 memory controllers at 2666 MHz, we're at 170 GBps bandwidth on 64-threads.
If we're talking about "best CPU for in-memory database", its pretty clear that POWER9 / POWER10 is the winner. You get the fewest cores (license-cost hack) with the highest L3 and RAM bandwidths, with the most threads supported.
--------
On the other hand, x86 has far superior single-threaded performance, far superior SIMD units, and is generally cheaper. For compute-heavy situations (raytracing, H265 encoding, etc. etc.) the x86 is superior.
But as far as being a thin processor supporting as much memory bandwidth as possible (with the lowest per-core licensing costs), POWER9 / POWER10 clearly wins.
And again: those SMT8 cores are no slouch. They can handle 4-threads with very little slowdown, and the full 8-threads only has a modest slowdown. They're designed to execute many threads, instead of speeding up a single thread (which happens to be really good for databases anyway, where your CPU will spend large amounts of time waiting on RAM to respond instead of computing).
> No where close. POWER10 caps out at 60x SMT4, and only has 2x 128-bit SIMD units per SMT4. That's 480 FLOPs per clock cycle. At 4GHz, that's only 1.9 TFlops single-precision compute.
Too late for me to edit this: but POWER10 has a full 128-bit vector unit per slice now. So one more x2, for 3.8 TFlops single-precision on 30x SMT8 or 60x SMT4.
So I was off by a factor of 2x in my earlier calculation. Power10 has a dedicated matrix-multiplication unit, but I consider a full matrix-multiplication to be highly specialized (comparable to a TPU or Tensor-core), so its not really something to compare flop-per-flop except against other matrix-multiplication units.
The Talos computers are only compatible with SMT4 chips (aka: Sforza chips). You can't buy an SMT8 chip for Talos.
SMT8 chips are also known as "Scale up" chips. The Summit supercomputer was made with SMT4 chips by the way.
I never used an SMT8, but from the documents... its really similar to two SMT4 cores working together (at least on Power9). SMT8 chips are a completely different core than SMT4 chips, with double the execution resources, double the decoder width, double everything. SMT8 is a really, really fat core.
The scale-up chips (used in the E950 and E980 only) aren't the only SMT8 chips. The scale-out chips (Sforza, Monza, LaGrange) can be fused as SMT4 or SMT8, but IBM doesn't appear to sell SMT8-fused chips for other parties.
If a given IBM server runs PowerVM, it's SMT8. You may find this table of mine helpful (assembled from various sources and partially inferred, so accuracy not guaranteed, but it represents my understanding): https://www.devever.net/~hl/f/SERVERS
Where do you get the FP performance, exactly, and for what value of "FP"? It's unclear to me in the slides from El Reg, which appear to be about the MMA specifically, and it's not clear what the SIMD units actually are. (I don't know if that's specified by the ISA.)
One thing is that it presumably has a better chance of keeping the SIMD fed than some.
I recently did a bunch of tests to see what the "ultimate bottlenecks" are for basic web applications. Think latency to the database and AES256 throughput.
Some rough numbers:
- Local latency to SQL Server from ASP.NET is about 150 μs, or about 6000 synchronous queries per second, max.
- Even with SR-IOV and Mellanox adapters, that rises to 250 μs if a physical network hop is involved.
- Typical networks have a latency floor around 500-600 μs, and it's not uncommon to see 1.3 ms VM-to-VM. Now we're down to 800 queries per second!
- Similarly, older CPUs struggle to exceed 250 MB/s/core (2 Gbps) for AES256, which is the fundamental limit to HTTPS throughput for a single client.
- Newer CPUs, e.g.: AMD EPYC or any recent Intel Xeon can do about 1 GB/s/core, but I haven't seen any CPUs that significantly exceed that. That's not even 10 Gbps. If you have a high-spec cloud VM with 40 or 50 Gbps NICs, there is no way a single HTTPS stream can saturate that link. You have to parallelise somehow to get the full throughout (or drop encryption.)
- HTTPS accelerators such as F5 BIG IP or Citrix ADC (NetScaler) are actually HTTPS decelerators for individual users, because even hardware models with SSL offload cards can't keep up with the 1 GB/s from a modern CPU. Their SSL cards are designed for improving the aggregate bandwidth of hundreds of simultaneous streams, and don't do well at all for a single stream, or even a couple of concurrent streams. This matters when "end-to-end encryption" is mandated, because back end connections are often pooled. So you end up with N users being muxed onto just one back-end connection which then becomes the bottleneck.
RTT between VMs in the same GCP zone is often well below 100us. Are you also measuring the latency of the query itself? I realize not all networks are built the same, but it seems like your benchmark case is much worse than what's possible, even without buying specialized hardware.
The test was to run "SELECT 1" using the low-level ADO.NET database query API in a tight loop. This is the relevant metric, as it represents the performance ceiling. It doesn't matter how fast the packets can get on the wire if the application can't utilise this because of some other bottleneck.
Of course, the underlying TCP latency is significantly lower. Using Microsoft's "Latte.exe" testing tool, I saw ~50 μs in Azure with "Accelerated Networking" enabled. As far as I know, they use Mellanox adapters.
Something I found curious is that no matter what I did, the local latency wouldn't go below about 125 μs. Neither shared memory nor named pipes had any benefit. This is on a 4 GHz computer, so in practice this is the "ultimate latency limit" for SQL Server, unless Intel and AMD start up the megahertz war again...
It would be an interesting exercise comparing the various database engines to see what their latency overheads are, and what their response time is to trivial queries such as selecting a single row given a key.
Unfortunately, due to the DeWitt clauses in EULAs, this would be risky to publish...
It should be noted that AES-CBC doesn't have any instruction-level parallelism available.
Both EPYC and Xeons have 2-AES units per core now. But CBC can only effectively use one of them at a time. (Block(n+1) cannot be computed until block(n) is done computing. Because Block(n) is used as input into Block(n+1) in CBC mode).
AES-256-GCM can compute block(n) and block(n+1) simultaneously. So you need to use such a parallel algorithm if you actually want to use the 2x AES pipelines on EPYC or Xeon.
More like ~3.9 GB/s for a desktop CPU and somewhere north of 7 GB/s for AES-256-CTR. The performance difference between CBC and CTR is pretty much exactly what you'd expect, you can only use one out of two units, and you have to wait out the full latency (1:4 disadvantage on recent cores) => ~8x slower.
Not for any single user or connection. High capacity doesn't equal low latency, and scalability to many users doesn't necessarily help one user get better performance!
Very often, you'll see n-tier applications where for some reason (typically load-balancers), the requests are muxed into a single TCP stream. In the past, this improved efficiency by eliminating the per-connection overhead.
Now, with high core counts and high bandwidths, some parallelism is absolutely required to even begin to approach the performance ceiling. If the application is naturally single-threaded, such as some ETL jobs, these single-stream bottlenecks are very difficult to overcome.
In the field, I very often see 8-core VMs with a "suspicious" utilisation graph flatlining at 12.5% because of issues like this. It boils my blood when people say that this is perfectly fine, because clearly the server has "adequate capacity". In reality, there's a problem, and the server is at 100% of its capacity an the other 7 cores are just heating the data centre air.
You and pretty much every modern data centre application, irrespective of the technology stack. The typical 1.3ms latency I see kills performance across the board, but very few people are aware of the magnitude of the issue.
Yeah, I had the same thought. Also gzip has been around for many decades and newer lossless compression algorithms are way better both in terms of speed (comparing with the same hardware) and compression ratio, e.g. LZ4 algorithm and Zstandard.
It just doesn't make sense for the HW engineers and chip designers to optimize backwards for older software algorithm instead of optimizing forward for newer algorithms.
How specific are those "gzip instructions", though? Are there newer algorithms that could benefit from some of the same acceleration primitives, even if they do higher-level things differently?
They're not instructions; it's a dedicated hardware unit. Most of the area appears to be devoted to finding matches so it should be possible to add other LZ-style algorithms without much additional area.
How is the single thread performance? Intel Xeon was always faster (thanks to competition from AMD Opteron...), But maybe that's changed with Intel's recent problems...
When Phoronix tested POWER9 SMT4 a while back, single-thread performance seems disappointing at first glance.
But it seems to be made up with their implementation of SMT4. The 2nd thread on a core didn't have much slowdown at all, while thread3 and thread4 per core barely affected performance.
It seems like POWER9 at least, benefits from running significantly more threads per core (at least compared to Xeon or AMD).
EDIT: It should be noted that IBM's 128-bit vector units are downright terrible compared to Intel's 512-bit or AMD's 256-bit vector units. SIMD compute is the weakest point of the Power9, and probably the Power10. They'll be the worst at multimedia performance (or other code using SIMD units: Raytracing, graphics, etc. etc.).
Power9's best use case was highly-threaded 64-bit code without SIMD. Power10 looks like the SIMD units are improving, but they're still grossly undersized compared to AMD or Intel SIMD units.
Sounds like IBM is not wasting area and power on out of order scheduling to find independent instructions within one thread. If you're running a lot of threads anyway, you get more independent instructions to work with for free!
When in SMT4 mode, various hardware resources are "partitioned off" in Power9.
The first, and third, threads use the "Left Superslice", while the second and fourth threads use the "Right Superslice". All four threads share a decoder (Bulldozer style).
1/4th of the branch predictor (EAT) is given to each of the 4x threads per core.
Register rename buffer is shared 2-threads at a time. (Two threads use the "left superslice", two other threads use the "right superslice"). An SMT1 mode, the single thread can use all 4 resources simultaneously.
A lot of the out-of-order stuff looks like it'd work as expected in 1-thread to 4-thread modes. At least, looking through the Power9 user guide / in theory.
--------
Honestly, I think the weirdest thing about POWER9 is the 2-cycle minimum latency (even on simple instructions like ADD and XOR). With that kind of latency, I bet that a number of inner-loops and code needs 2-threads loaded on the core, just to stay fully fed.
That'd be my theory for why 2-threads seem to be needed before POWER9 cores feel like they're being utilized well.
Obviously, POWER10 probably will change some of these details. But I'd expect POWER10 to largely be the same as POWER9 (aside from being bigger, faster, more efficient).
I am excited. But also sad that most of us wont ever get to play with it. Unlike Intel and AMD's Xeon and EPYC, getting hold of POWER doesn't seems like an easy task unless the cooperation you work for have very specific needs.
Edit: Turns out there is a major section on it below.
“GAFAM
A more inclusive grouping referred to as GAFAM or "Big Five", defines Google, Amazon, Facebook, Apple, and Microsoft as the tech giants.[18][19][20][21] Besides Saudi Aramco, the GAFAM companies are the five most valuable public corporations in the world as measured by market capitalization.[3] Nikos Smyrnaios justified the GAFAM grouping as an oligopoly that appears to take control of the Internet by concentrating market power, financial power and using patent rights and copyright within a context of capitalism.[22]“
Those wouldn't be IBM's top competitors in semiconductors would they? I don't think Facebook or Amazon or Microsoft is spending more than 1% of their R&D budget on semiconductors.
CPU-SIMD is less about competing against GPUs and more about latency.
GPUs will always have more GFlops and memory bandwidth at a lower cost. They're specifically built GFlop and memory-bandwidth machines. Case in point: the NVidia 2070 Super is 8 TFlops of compute at $400, a tiny fraction of what this POWER10 will cost.
If POWER10 costs anything like POWER9, we're looking at well over $2000 for the bigger chips and $1000 for reasonable multisocket motherboards. And holy moly: 602mm^2 at 7nm is going to be EXPENSIVE. EDIT: I'm only calculating ~2 TFlops from the hypothetical 60x SMT4 Power10 at 4GHz. That's no where close to GPU-level Flops.
However, the CPU-GPU link is slow in comparison to CPU-L1 cache (or even CPU-DDR4 / DDR5). A CPU can "win the race" by using SIMD onboard, completing your task before it even spent the ~5-microseconds needed to communicate to the GPU.
----------
With that being said: POWER10 also implements PCIe 5.0, which means it will be one of the fastest processors for communicating with future GPUs.
It's more appropriate to compare pricing of Tesla with a datacenter-grade CPU like POWER10 (or Epyc/Xeon/etc.).
A64FX (in Fugaku, the current #1 machine on all popular supercomputing benchmarks) has shown that CPUs can compete with top-shelf GPUs on bandwidth and floating point energy efficiency.
Fugaku has 158,976 nodes x2 chips each, or 317,952 A64FX chips.
Summit has 4,608 nodes x 6 GPUs each, or 27,648 V100 GPUs. It also was built back in 2018.
---------
While Fugaku is certainly an interesting design, it seems inevitable that a modern GPU (say A100 Amperes) would crush it in FLOPs. Really, Fugaku's most interesting point is its high rate of HPCG, showing that its interconnect is hugely efficient.
Per-node, Fugaku is weaker. They built an amazing interconnect to compensate for that weakness. Fugaku also is an HBM-based computer, meaning you cannot easily add or remove RAM (like a CPU / GPU team can configure to more, or less RAM by adding sticks).
These are the little differences that make a difference in practicality. But yes, A64FX is certainly an accomplishment, but I wouldn't go so far as to say its proven that CPUs can keep up with GPUs in terms of raw FLOPs.
A100 has a 20% edge on energy efficiency for HPL, along with higher intrinsic latencies. It's also 6-12 months behind A64FX in deployment. https://www.top500.org/lists/green500/2020/06/
HPCG mostly tests memory bandwidth rather than interconnect, but Fugaku does have a great network.
Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).
Normalizing per node (versus per energy or cost) isn't particularly useful unless your software doesn't work well with distributed memory.
> Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).
This POWER10 chip under discussion has 1TB bandwidth to devices with expandable RAM.
Yeah, I didn't think it was possible. But... congrats to IBM for getting this done. Within the context of this hypothetical POWER10, 1TB bandwidth interconnects to expandable RAM is on the table.
It's 410 GB/s peak for DDR5. The "up to 800 GB/s sustained" is for GDDR6 and POWER10 isn't slated to ship until Q4 2021 so it isn't really a direct comparison with hardware that was deployed in 2019.
Your point about latency is the reason why we can't use GPUs for realtime audio processing, even though they would be absolutely otherwise well-suited. Stuff like computing FFTs and convolution could be done much faster on GPUs, but the latency would be prohibitive in realtime audio.
GPU latency is ~5 microseconds per kernel, plus the time it takes to transfer data into and out of the GPU. (~15GBps on PCIe 3.0 x16 lanes). Given that audio is probably less than 1MBps, PCIe bandwidth won't be an issue at all.
Any audio system based on USB controls would be on the order of 1000 microseconds of latency (1,000,000 microseconds/1000 USB updates per second == 1000 microseconds). Lets assume that we have a hard realtime cutoff of 1000 microseconds.
While GPU-latency is an issue for tiny compute tasks, I don't think its actually big enough to make a huge difference for audio applications (which usually use standard USB controllers at 1ms specified latency).
I mean, you only have room for 200 CPU-GPU transfers (5-microseconds per CPU-GPU message), but if the entire audio-calculation was completed inside the GPU, you wouldn't need any more than 1-message there and 1-message back.
Depends on the application. Your brain can notice the delay when playing something like digital drums after ~10-15 ms. Something with less 'attack' like guitar has a bit more wiggle room, and ambient synths even more so.
edit: also, vocals are most latency-sensitive if you are the one singing and hearing it played back at the same time.
> They're specifically built GFlop and memory-bandwidth machines
That would indicate their theoretical peak performance is higher. Unless you can line up your data so that the GPU will be processing it all the time in all its compute units without any memory latency, you won't get the theoretical peak. In those cases, it's perfectly possible a beefy CPU will be able to out-supercompute a GPU-based machine. It's just that some problems are more amenable to some architectures.
For games and graphics? I don't think so, a GPU has a ton of dedicated hardware that is very costly to simulate in software: triangle setup, rasterizers, tessellation units, texture mapping units, ROPs... and now even raytracing units.
In a sense yes, and this has already happened. It's just that GPUs are under lock and key just like early processors were. There are interesting development leaks you can find, like NVIDIA cards supporting USB connectors on some models, implying you can use just the GPU as a full computer.
- They leapfrogged everyone else with PCIe v5 and DDR5
- 1 TB/s memory bandwidth, which is comparable to high-end NVIDIA GPUs, but for CPUs
- Socket-to-socket interconnect is 1 TB/s also.
- 120 GB/s/core L3 cache read rate sustained.
- Floating point rate comparable to GPUs
- 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.
- Memory disaggregation similar to how most modern enterprise architectures separate disk from compute. You can have memory-less compute nodes talking to a central memory node!
- 16-socket glueless servers
- Has instructions for accelerating gzip.