- They leapfrogged everyone else with PCIe v5 and DDR5
- 1 TB/s memory bandwidth, which is comparable to high-end NVIDIA GPUs, but for CPUs
- Socket-to-socket interconnect is 1 TB/s also.
- 120 GB/s/core L3 cache read rate sustained.
- Floating point rate comparable to GPUs
- 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.
- Memory disaggregation similar to how most modern enterprise architectures separate disk from compute. You can have memory-less compute nodes talking to a central memory node!
I appreciate the slides, but you've got some critical errors.
> - Floating point rate comparable to GPUs
No where close. POWER10 caps out at 60x SMT4, and only has 2x 128-bit SIMD units per SMT4. That's 480 FLOPs per clock cycle. At 4GHz, that's only 1.9 TFlops single-precision compute.
An NVidia 2070 Super ($400 consumer GPU) hits 8.2 TFlops with 448 GB/s bandwidth.
> - 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.
Note that 8-way SMT is the "big core", and I'm almost convinced that 8-SMT is more about licensing than actually scaling. By supporting 8 threads per core (and doubling the size of the core), you get a 30-core system that supports 240-threads. That's a licensing hack: since most enterprise software is paid per-core.
I'd expect that 4-SMT will be more popular with consumers (ie: Talos II sized systems). Similar to how Power9 was
Note that 8-way SMT is the "big core", and I'm almost convinced that 8-SMT is more about licensing than actually scaling. By supporting 8 threads per core (and doubling the size of the core), you get a 30-core system that supports 240-threads. That's a licensing hack: since most enterprise software is paid per-core.
I don't think that's entirely fair: Massively multi-user transactional applications (read: databases) are right in the wheelhouse of POWER, and they're exactly the kind of applications that benefit most from SMT. Lots of opportunities for latency hiding as you're chasing pointers from indexes to data blocks.
While that's a fair point, databases are also among the most costly software that is paid per-core. So the SMT8 system will have half the per-core costs as an SMT4 system (because an SMT8 system will have half the "cores" of any SMT4 system. Even though an SMT8 core is just two SMT4 cores grafted together).
Oracle charges 1.0 for any POWER9 core. Which means that SMT8 cores (which are twice as big as SMT4 cores) get you more performance with less licensing costs.
So what would you rather buy? A 24-core SMT4 Nimbus chip, or an 12-core SMT8 Cumulus chip?
The two chips have the same number of execution units. They both support 96 threads. But the 12-core SMT8 Cumulus chip will have half the license costs from Oracle.
-------
For DB2...
The SMT8 model (E950) only has a 100x multiplier for DB2, while SMT4 models have a 70x multiplier. So you're only paying 30% more per core (E950), despite getting twice the execution resources.
Even the top end SMT8 model (E980) has a 120x multiplier. So you're still saving big bucks on licensing.
The 12-core POWER10 will have 120MB of L3 cache and 1000GBps main-memory bandwidth with 96-threads.
That's unified L3 cache by the way, none of the "16MB/CCX with remote off-chip L3 caches being slower than DDR4 reads" that EPYC has.
Intel Xeon Platinum 8180 only has 38.5MB L3 across 28 cores / 56-threads. With 6-memory controllers at 2666 MHz (21GBps per stick), that's 126 GBps bandwidth.
AMD EPYC really only has 16MB L3 per CCX (because the other L3 cache is "remote" and slower than DDR4). With 8 memory controllers at 2666 MHz, we're at 170 GBps bandwidth on 64-threads.
If we're talking about "best CPU for in-memory database", its pretty clear that POWER9 / POWER10 is the winner. You get the fewest cores (license-cost hack) with the highest L3 and RAM bandwidths, with the most threads supported.
--------
On the other hand, x86 has far superior single-threaded performance, far superior SIMD units, and is generally cheaper. For compute-heavy situations (raytracing, H265 encoding, etc. etc.) the x86 is superior.
But as far as being a thin processor supporting as much memory bandwidth as possible (with the lowest per-core licensing costs), POWER9 / POWER10 clearly wins.
And again: those SMT8 cores are no slouch. They can handle 4-threads with very little slowdown, and the full 8-threads only has a modest slowdown. They're designed to execute many threads, instead of speeding up a single thread (which happens to be really good for databases anyway, where your CPU will spend large amounts of time waiting on RAM to respond instead of computing).
> No where close. POWER10 caps out at 60x SMT4, and only has 2x 128-bit SIMD units per SMT4. That's 480 FLOPs per clock cycle. At 4GHz, that's only 1.9 TFlops single-precision compute.
Too late for me to edit this: but POWER10 has a full 128-bit vector unit per slice now. So one more x2, for 3.8 TFlops single-precision on 30x SMT8 or 60x SMT4.
So I was off by a factor of 2x in my earlier calculation. Power10 has a dedicated matrix-multiplication unit, but I consider a full matrix-multiplication to be highly specialized (comparable to a TPU or Tensor-core), so its not really something to compare flop-per-flop except against other matrix-multiplication units.
The Talos computers are only compatible with SMT4 chips (aka: Sforza chips). You can't buy an SMT8 chip for Talos.
SMT8 chips are also known as "Scale up" chips. The Summit supercomputer was made with SMT4 chips by the way.
I never used an SMT8, but from the documents... its really similar to two SMT4 cores working together (at least on Power9). SMT8 chips are a completely different core than SMT4 chips, with double the execution resources, double the decoder width, double everything. SMT8 is a really, really fat core.
The scale-up chips (used in the E950 and E980 only) aren't the only SMT8 chips. The scale-out chips (Sforza, Monza, LaGrange) can be fused as SMT4 or SMT8, but IBM doesn't appear to sell SMT8-fused chips for other parties.
If a given IBM server runs PowerVM, it's SMT8. You may find this table of mine helpful (assembled from various sources and partially inferred, so accuracy not guaranteed, but it represents my understanding): https://www.devever.net/~hl/f/SERVERS
Where do you get the FP performance, exactly, and for what value of "FP"? It's unclear to me in the slides from El Reg, which appear to be about the MMA specifically, and it's not clear what the SIMD units actually are. (I don't know if that's specified by the ISA.)
One thing is that it presumably has a better chance of keeping the SIMD fed than some.
I recently did a bunch of tests to see what the "ultimate bottlenecks" are for basic web applications. Think latency to the database and AES256 throughput.
Some rough numbers:
- Local latency to SQL Server from ASP.NET is about 150 μs, or about 6000 synchronous queries per second, max.
- Even with SR-IOV and Mellanox adapters, that rises to 250 μs if a physical network hop is involved.
- Typical networks have a latency floor around 500-600 μs, and it's not uncommon to see 1.3 ms VM-to-VM. Now we're down to 800 queries per second!
- Similarly, older CPUs struggle to exceed 250 MB/s/core (2 Gbps) for AES256, which is the fundamental limit to HTTPS throughput for a single client.
- Newer CPUs, e.g.: AMD EPYC or any recent Intel Xeon can do about 1 GB/s/core, but I haven't seen any CPUs that significantly exceed that. That's not even 10 Gbps. If you have a high-spec cloud VM with 40 or 50 Gbps NICs, there is no way a single HTTPS stream can saturate that link. You have to parallelise somehow to get the full throughout (or drop encryption.)
- HTTPS accelerators such as F5 BIG IP or Citrix ADC (NetScaler) are actually HTTPS decelerators for individual users, because even hardware models with SSL offload cards can't keep up with the 1 GB/s from a modern CPU. Their SSL cards are designed for improving the aggregate bandwidth of hundreds of simultaneous streams, and don't do well at all for a single stream, or even a couple of concurrent streams. This matters when "end-to-end encryption" is mandated, because back end connections are often pooled. So you end up with N users being muxed onto just one back-end connection which then becomes the bottleneck.
RTT between VMs in the same GCP zone is often well below 100us. Are you also measuring the latency of the query itself? I realize not all networks are built the same, but it seems like your benchmark case is much worse than what's possible, even without buying specialized hardware.
The test was to run "SELECT 1" using the low-level ADO.NET database query API in a tight loop. This is the relevant metric, as it represents the performance ceiling. It doesn't matter how fast the packets can get on the wire if the application can't utilise this because of some other bottleneck.
Of course, the underlying TCP latency is significantly lower. Using Microsoft's "Latte.exe" testing tool, I saw ~50 μs in Azure with "Accelerated Networking" enabled. As far as I know, they use Mellanox adapters.
Something I found curious is that no matter what I did, the local latency wouldn't go below about 125 μs. Neither shared memory nor named pipes had any benefit. This is on a 4 GHz computer, so in practice this is the "ultimate latency limit" for SQL Server, unless Intel and AMD start up the megahertz war again...
It would be an interesting exercise comparing the various database engines to see what their latency overheads are, and what their response time is to trivial queries such as selecting a single row given a key.
Unfortunately, due to the DeWitt clauses in EULAs, this would be risky to publish...
It should be noted that AES-CBC doesn't have any instruction-level parallelism available.
Both EPYC and Xeons have 2-AES units per core now. But CBC can only effectively use one of them at a time. (Block(n+1) cannot be computed until block(n) is done computing. Because Block(n) is used as input into Block(n+1) in CBC mode).
AES-256-GCM can compute block(n) and block(n+1) simultaneously. So you need to use such a parallel algorithm if you actually want to use the 2x AES pipelines on EPYC or Xeon.
More like ~3.9 GB/s for a desktop CPU and somewhere north of 7 GB/s for AES-256-CTR. The performance difference between CBC and CTR is pretty much exactly what you'd expect, you can only use one out of two units, and you have to wait out the full latency (1:4 disadvantage on recent cores) => ~8x slower.
Not for any single user or connection. High capacity doesn't equal low latency, and scalability to many users doesn't necessarily help one user get better performance!
Very often, you'll see n-tier applications where for some reason (typically load-balancers), the requests are muxed into a single TCP stream. In the past, this improved efficiency by eliminating the per-connection overhead.
Now, with high core counts and high bandwidths, some parallelism is absolutely required to even begin to approach the performance ceiling. If the application is naturally single-threaded, such as some ETL jobs, these single-stream bottlenecks are very difficult to overcome.
In the field, I very often see 8-core VMs with a "suspicious" utilisation graph flatlining at 12.5% because of issues like this. It boils my blood when people say that this is perfectly fine, because clearly the server has "adequate capacity". In reality, there's a problem, and the server is at 100% of its capacity an the other 7 cores are just heating the data centre air.
You and pretty much every modern data centre application, irrespective of the technology stack. The typical 1.3ms latency I see kills performance across the board, but very few people are aware of the magnitude of the issue.
Yeah, I had the same thought. Also gzip has been around for many decades and newer lossless compression algorithms are way better both in terms of speed (comparing with the same hardware) and compression ratio, e.g. LZ4 algorithm and Zstandard.
It just doesn't make sense for the HW engineers and chip designers to optimize backwards for older software algorithm instead of optimizing forward for newer algorithms.
How specific are those "gzip instructions", though? Are there newer algorithms that could benefit from some of the same acceleration primitives, even if they do higher-level things differently?
They're not instructions; it's a dedicated hardware unit. Most of the area appears to be devoted to finding matches so it should be possible to add other LZ-style algorithms without much additional area.
How is the single thread performance? Intel Xeon was always faster (thanks to competition from AMD Opteron...), But maybe that's changed with Intel's recent problems...
When Phoronix tested POWER9 SMT4 a while back, single-thread performance seems disappointing at first glance.
But it seems to be made up with their implementation of SMT4. The 2nd thread on a core didn't have much slowdown at all, while thread3 and thread4 per core barely affected performance.
It seems like POWER9 at least, benefits from running significantly more threads per core (at least compared to Xeon or AMD).
EDIT: It should be noted that IBM's 128-bit vector units are downright terrible compared to Intel's 512-bit or AMD's 256-bit vector units. SIMD compute is the weakest point of the Power9, and probably the Power10. They'll be the worst at multimedia performance (or other code using SIMD units: Raytracing, graphics, etc. etc.).
Power9's best use case was highly-threaded 64-bit code without SIMD. Power10 looks like the SIMD units are improving, but they're still grossly undersized compared to AMD or Intel SIMD units.
Sounds like IBM is not wasting area and power on out of order scheduling to find independent instructions within one thread. If you're running a lot of threads anyway, you get more independent instructions to work with for free!
When in SMT4 mode, various hardware resources are "partitioned off" in Power9.
The first, and third, threads use the "Left Superslice", while the second and fourth threads use the "Right Superslice". All four threads share a decoder (Bulldozer style).
1/4th of the branch predictor (EAT) is given to each of the 4x threads per core.
Register rename buffer is shared 2-threads at a time. (Two threads use the "left superslice", two other threads use the "right superslice"). An SMT1 mode, the single thread can use all 4 resources simultaneously.
A lot of the out-of-order stuff looks like it'd work as expected in 1-thread to 4-thread modes. At least, looking through the Power9 user guide / in theory.
--------
Honestly, I think the weirdest thing about POWER9 is the 2-cycle minimum latency (even on simple instructions like ADD and XOR). With that kind of latency, I bet that a number of inner-loops and code needs 2-threads loaded on the core, just to stay fully fed.
That'd be my theory for why 2-threads seem to be needed before POWER9 cores feel like they're being utilized well.
Obviously, POWER10 probably will change some of these details. But I'd expect POWER10 to largely be the same as POWER9 (aside from being bigger, faster, more efficient).
I am excited. But also sad that most of us wont ever get to play with it. Unlike Intel and AMD's Xeon and EPYC, getting hold of POWER doesn't seems like an easy task unless the cooperation you work for have very specific needs.
Edit: Turns out there is a major section on it below.
“GAFAM
A more inclusive grouping referred to as GAFAM or "Big Five", defines Google, Amazon, Facebook, Apple, and Microsoft as the tech giants.[18][19][20][21] Besides Saudi Aramco, the GAFAM companies are the five most valuable public corporations in the world as measured by market capitalization.[3] Nikos Smyrnaios justified the GAFAM grouping as an oligopoly that appears to take control of the Internet by concentrating market power, financial power and using patent rights and copyright within a context of capitalism.[22]“
Those wouldn't be IBM's top competitors in semiconductors would they? I don't think Facebook or Amazon or Microsoft is spending more than 1% of their R&D budget on semiconductors.
CPU-SIMD is less about competing against GPUs and more about latency.
GPUs will always have more GFlops and memory bandwidth at a lower cost. They're specifically built GFlop and memory-bandwidth machines. Case in point: the NVidia 2070 Super is 8 TFlops of compute at $400, a tiny fraction of what this POWER10 will cost.
If POWER10 costs anything like POWER9, we're looking at well over $2000 for the bigger chips and $1000 for reasonable multisocket motherboards. And holy moly: 602mm^2 at 7nm is going to be EXPENSIVE. EDIT: I'm only calculating ~2 TFlops from the hypothetical 60x SMT4 Power10 at 4GHz. That's no where close to GPU-level Flops.
However, the CPU-GPU link is slow in comparison to CPU-L1 cache (or even CPU-DDR4 / DDR5). A CPU can "win the race" by using SIMD onboard, completing your task before it even spent the ~5-microseconds needed to communicate to the GPU.
----------
With that being said: POWER10 also implements PCIe 5.0, which means it will be one of the fastest processors for communicating with future GPUs.
It's more appropriate to compare pricing of Tesla with a datacenter-grade CPU like POWER10 (or Epyc/Xeon/etc.).
A64FX (in Fugaku, the current #1 machine on all popular supercomputing benchmarks) has shown that CPUs can compete with top-shelf GPUs on bandwidth and floating point energy efficiency.
Fugaku has 158,976 nodes x2 chips each, or 317,952 A64FX chips.
Summit has 4,608 nodes x 6 GPUs each, or 27,648 V100 GPUs. It also was built back in 2018.
---------
While Fugaku is certainly an interesting design, it seems inevitable that a modern GPU (say A100 Amperes) would crush it in FLOPs. Really, Fugaku's most interesting point is its high rate of HPCG, showing that its interconnect is hugely efficient.
Per-node, Fugaku is weaker. They built an amazing interconnect to compensate for that weakness. Fugaku also is an HBM-based computer, meaning you cannot easily add or remove RAM (like a CPU / GPU team can configure to more, or less RAM by adding sticks).
These are the little differences that make a difference in practicality. But yes, A64FX is certainly an accomplishment, but I wouldn't go so far as to say its proven that CPUs can keep up with GPUs in terms of raw FLOPs.
A100 has a 20% edge on energy efficiency for HPL, along with higher intrinsic latencies. It's also 6-12 months behind A64FX in deployment. https://www.top500.org/lists/green500/2020/06/
HPCG mostly tests memory bandwidth rather than interconnect, but Fugaku does have a great network.
Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).
Normalizing per node (versus per energy or cost) isn't particularly useful unless your software doesn't work well with distributed memory.
> Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).
This POWER10 chip under discussion has 1TB bandwidth to devices with expandable RAM.
Yeah, I didn't think it was possible. But... congrats to IBM for getting this done. Within the context of this hypothetical POWER10, 1TB bandwidth interconnects to expandable RAM is on the table.
It's 410 GB/s peak for DDR5. The "up to 800 GB/s sustained" is for GDDR6 and POWER10 isn't slated to ship until Q4 2021 so it isn't really a direct comparison with hardware that was deployed in 2019.
Your point about latency is the reason why we can't use GPUs for realtime audio processing, even though they would be absolutely otherwise well-suited. Stuff like computing FFTs and convolution could be done much faster on GPUs, but the latency would be prohibitive in realtime audio.
GPU latency is ~5 microseconds per kernel, plus the time it takes to transfer data into and out of the GPU. (~15GBps on PCIe 3.0 x16 lanes). Given that audio is probably less than 1MBps, PCIe bandwidth won't be an issue at all.
Any audio system based on USB controls would be on the order of 1000 microseconds of latency (1,000,000 microseconds/1000 USB updates per second == 1000 microseconds). Lets assume that we have a hard realtime cutoff of 1000 microseconds.
While GPU-latency is an issue for tiny compute tasks, I don't think its actually big enough to make a huge difference for audio applications (which usually use standard USB controllers at 1ms specified latency).
I mean, you only have room for 200 CPU-GPU transfers (5-microseconds per CPU-GPU message), but if the entire audio-calculation was completed inside the GPU, you wouldn't need any more than 1-message there and 1-message back.
Depends on the application. Your brain can notice the delay when playing something like digital drums after ~10-15 ms. Something with less 'attack' like guitar has a bit more wiggle room, and ambient synths even more so.
edit: also, vocals are most latency-sensitive if you are the one singing and hearing it played back at the same time.
> They're specifically built GFlop and memory-bandwidth machines
That would indicate their theoretical peak performance is higher. Unless you can line up your data so that the GPU will be processing it all the time in all its compute units without any memory latency, you won't get the theoretical peak. In those cases, it's perfectly possible a beefy CPU will be able to out-supercompute a GPU-based machine. It's just that some problems are more amenable to some architectures.
For games and graphics? I don't think so, a GPU has a ton of dedicated hardware that is very costly to simulate in software: triangle setup, rasterizers, tessellation units, texture mapping units, ROPs... and now even raytracing units.
In a sense yes, and this has already happened. It's just that GPUs are under lock and key just like early processors were. There are interesting development leaks you can find, like NVIDIA cards supporting USB connectors on some models, implying you can use just the GPU as a full computer.
It’s very cool, but unfortunately inaccessible for those without sky high budgets and time to talk to sales reps.
I would happily experiment with one of these at our HPC cluster(a small group at an University), but the idea of talking to a sales rep to even figure out what it would cost puts me off completely, ignoring the licenses for most interesting things to do with the hardware you buy.
I wish t. power boxes were as easy to buy as x86 boxes. Simply configure, get an idea of a price, talk to the distributor and place an order.
I worked with a financial trading firm that was interested in evaluating a POWER system. They called IBM, who arranged a meeting our our offices.
From memory, about 8 IBM people showed up. They didn't seem to actually know each other, but were from several different groups within IBM.
We sat down, and started by explaining what we did with our existing x86-64 systems, and what we thought we'd like to try with the POWER system. We asked for a single box to evaluate, roughly the equivalent to our existing HP DL380 dual-Xeon boxes.
The folks from IBM then spent the next 40 minutes arguing with each other about exactly which system we should be using. Five minutes before the meeting was scheduled to end, one of them took charge and said they'd figure it out offline, and get back to use with the details.
Several more rounds of email were exchanged, but we never actually got to the point of being told what system we could have, or what its specs were, let alone actually being able to physically get one and test it.
It was perhaps the most absurd situation I've seen in 30 years in the industry.
There used to be a joke that IBM sales reps don't have children because all they do is sit on the bed and tell their spouses how good it's going to be.
Rather more than 20 years ago, a DEC/COMPAQ salesperson cold-called me to see if the ISP I was working at would like to switch to Alpha servers. After about 20 minutes, he offered a six-month free loan of a mid-range server -- probably $10-15K, I don't recall. It arrived a week later. We determined the hardware was pretty nice but the operating system was a major PITA -- this was when you expected to compile a large fraction of your software -- so it mostly sat on someone's desk for the last four months of the loan before the salesperson came to claim it.
Compaq also used to provide publicly-available "testdrive" Alpha servers that you could just shell into. This was way before the idea of cloud computing was mainstream. A lot of gnu+linux/alpha development happened thanks to testdrive.compaq.com (which is dead now, of course).
This was before the Alpha port of Linux happened; if it had been available instead of OSF/1, we would have done more with the machine.
But, yes: if you want to market hardware that nobody else can supply, you need to get it into the hands of people who will use it and evangelize for it.
In a 30+ year professional IT career I've had exactly three interactions with IBM, and every one of them went very much as you described. How this company stays in businesss I'll never understand.
Big IT contracts and their VAR doing the actual selling.
[edit]We bought our new iSeries (Power9-based) through a VAR this year with an IBM rep helping us with what we actually needed. It was a bit of a drawn-out experience, but overall, I wasn't displeased. It was easier than dealing with some PC vendors (looking at you Dell and HP). I would imagine it will be another 15 years before we have to buy another one.[/edit]
Bingo. Going to an IBM VAR/Partner is really the only decent/sane way to navigate the IBM sales bureaucracy unless you're talking about US$(high 7-figure plus) orders. It's also really the best way to get the best price, because the VAR will work to figure out all the discounts and such a customer might be entitled to. On an IBM sales team, the left hand doesn't always know or care what the right hand is doing.
When I worked at IBM, I usually called in a favor from friends at a local VAR whenever I needed to order something to get a BoM, because even internally the process was opaque.
I had decent experience with an IBM reseller, back in 2005 or so. I forget the exact term. VAR? Solutions provider? Anyway, we were able to get the latest POWER system (POWER 6, I think) and evaluate our AIX-based app on it for about 2 weeks. They even set us up with a small SAN! Unfortunately, the reseller was not terribly skilled technically, so they were unable to configure the SAN. I had to figure that out myself.
A couple years ago, I worked at a company who's ERP system ran on an IBM System i (formerly known as AS/400). At a user conference, IBM had a table setup with a couple sales guys and a pair of servers on the table.
The first one was the latest server that could run Series i. No big deal, we were coming to the end of our maintenance contract and would probably be buying a new machine in the next year or so.
The second one was a 2U, 48-socket POWER server. They bragged about it running thousands of Linux VM's at a time. I found it a bit odd because nobody who would be running this particular ERP software would be running thousands of Linux VM's.
> It’s very cool, but unfortunately inaccessible for those without sky high budgets and time to talk to sales reps.
I've spent some time looking to see if it would be possible to spin up a POWER based VM in the cloud (just out of curiosity really). While it seems possible in theory in the IBM Cloud, it seems IBM themselves is only interested in offering this to their enterprise customers moving to the cloud, focusing it on getting people to move IBM's AIX/iSeries lock-in to the cloud. When looking at it before, I was not able to spin up a POWER VM from a regular IBM Cloud account at least.
There might be a bit more interest in POWER if it wasn't so damn inaccessible but it really is. If the easiest way to get into POWER is paying thousands of dollars to a company retrofitting IBM's decidedly non-desktop hardware into desktop hardware (talking about Raptor CS), your architecture is doomed to wait for all your enterprise customers to move to amd64 commodity hardware. Maybe that is IBM's goal even, I don't know.
Ah this seems to be new, thanks! That's exactly the kind of response I hoped I would get.
Unfortunately completing this flow - with the goal to get a temporary shell on a Power system running some Linux distro - requires me to authorize a payment of over 1300 dollars before I even select an image. That's for "reserving" 1 POWER core and 2GB of RAM... As an individual developer playing with this, that's way over my head. I'd understand very high end cloud pricing for that (tens of dollars per hour, I'd be happy to pay that for messing around a little), but this isn't even a cloud machine pricing model. For reference, $1300 is over half the price of getting a Talos motherboard with an 8-core POWER9 CPU you then actually own.
It seems IBM does not have the infrastructure or volume to permit cloud pricing here. I understand they might not have this, but in my opinion they need to work on this to make it accessible.
Completely agree. I've talked many times to some of the folks there and it seems to me the product is not for a developer or hobbyist, which is unfortunate. It is purely enterprise.
Thought this will open up a parallel ecosystem for certain applications but IBM does not seem to think that.
I just don't understand how IBM - being acutely aware of what happened to Itanium and SPARC + running a cloud platform themselves - still believes this notion of "enterprise hardware" is the way of the future. Surely IBM was/is even better entrenched in this area than HP and Sun/Oracle/Fujitsu, but come on. POWER has a lot of history in consumer accessible products even. This is just a recipe for fading into irrelevance ever more.
Sounds like a problem that should be easy to find on an org chart.
If you don’t reward good ideas (they don’t even need to be particularly good or novel, just common sense), you’ll have a company trying to grow something that’s reached its peak usage with motivational speeches, very inspiring leaders and good old pressure on employees as a form of local optimization.
It’s extremely easy to find. The problem is the org chart itself. The entire way they run this stuff is not setup to handle the kinds of things we expect (and get) from every other hardware vendor or cloud provider that could reasonably be considered a competitor to them in either category.
Mostly, but I think you could have an org chart and let individuals and teams override it if they find good ways to reuse assets.
I just don't think most companies run this way, maybe it's the military way of thinking of strict hierarchy rather than a market where the best ideas can develop or where you can at least break out of the hierarchy to get something started.
The MBAs and bean counters would probably argue that it's important to get certain unpopular things done, but I think a market would solve that too, as soon as something becomes a bottle neck, someone would step in.
My impression is that organizations are far too centered around VPs and directors who get to carry on without having to prove themselves again in the new situations they're in.
Beats me. Can't believe a company that was making processors for the most picky of all industries (gaming consoles) is now completely ignoring the everyday consumer. Org priorities and financial engineering it looks like.
They've made huge changes since those days, getting rid first of x86 laptops and desktops then servers then chip fab. None of those decisions ever seemed particularly wise to me, especially the last.
What makes you think that IBM sales people and management cares about "the way of the future"?
They care about the quarterly report and right now the best way to improve short-term performance is to milk your existing locked-in customers for as much money as possible.
Because they've just announced the next entry into a series of processors that by all accounts can take the fight to Intel and AMD's best (and has done so for some time), which requires huge investments and long term planning, roadmaps and funding commitments to work. Seems pretty committed to long term thinking to me.
If any management/exec thought of this filter, they are idiots. I'd imagine if they made these as accessible as x86 servers to me - a humble sysadmin, and I could prove by data that our legacy Fortran scientific application can run faster on it than the newest Intel servers (it most likely would, given SMT differences, higher memory bandwidth, wider cache lines, more efficient 7 nm process, higher clocks etc) I'd recommend to my boss to spend our annual compute budget on these boxes instead of the Intel servers because that one legacy application consumes a lot of our HPC compute capacity. My boss and I would happily pay IBM to give this critical application a boost. Too bad, IBM filtered me away.
Agreed. Lots of companies have this 'call our sales department' strategy prior to giving you any information at all. That's always been a great way to lose my business, and indirectly the businesses for which we consult. But that's perfectly ok with me. If a company is not willing to list their prices up front then that's a good indication that they are not competitive.
Besides that, some of the atrocities that I've seen IBM and their partners commit are a good warning that you want to stay far away from them. Lest you be Watsonized and made dependent on marketing-masquerading-as-technology.
The last time I checked (which was quite a while ago) you could configure and price at least some categories of POWER servers on IBM's website much like you can with Dell or any other x86 vendor. The big difference though was in the amount of sticker shock when you see the bottom line price.
Very much agree. When IBM acquired Red Hat I really hoped they were going to get serious about cloud. They still might, but at least for now me as an individual guy just can't play with their hardware. I admit I'm quite impatient and I get irritated when any amount of red tape blocks efficient allocation of my time, but I don't think (given how easy it is on other cloud platforms) it's too much to ask for if you want to be taken seriously as a public cloud offering.
Each of these do some filtering because they give out resources only to find people bitcoin mining. Also, a good number of experimenters give up on the first obstacle they encounter or aren't really well-versed in benchmarking / architectural differences (lots of folks running microbenchmarks). There are some incredible resources available (many free) for those looking for a partner and not just a box. Good luck in finding the right partner for your projects.
Of course I'm not. My entire point is that filtering out average Janes like me is antithetical to the long-term interests of POWER, and extremely weird for a cloud platform that allows you to spin up an amd64 VM in a few seconds for a few cents. You're left with an audience that has to use POWER (either because legacy or their purchasing colleagues thought it somehow was a good idea), not an audience that actually wants to.
It speaks volumes to me and says that IBM knows perfectly well that they are not interested in cost conscious customers but only want those for whom the IT department is a cost center and not a core strategic asset.
That's why you'll never see a Netflix or a Whatsapp on IBM infra. But banks, insurance companies, medical companies etc are still large contributors to IBMs revenue streams. If your idea of software development is agile teams and capable programmers churning out code to power your business then you're not an IBM customer or even a prospect.
If your idea of software is 5000 programmers as interchangeable cogs in a machine with an annual release and three month acceptance cycles then IBM is where you'll probably end up.
I understand that is how IBM thinks. I just don't like that they only think this way, as it feels like spoilt potential. IBM has all the pieces in place - they have great technology with high performance, ppc64le actually has pretty good apparent support from mainstream Linux distros, infrastructure and languages (for a non x86/arm architecture), and they have full "creative liberty" of where they focus their platform. It would be an awesome option to have for cloud infrastructure, but they keep it all to themselves.
They could be using that to do what AWS does with Graviton2 - cost control their completely integrated stack and make it a competitive advantage. Sell more performance per dollar despite having a non-amd64 architecture. Use it to give everyone more choice and competition. But instead they mostly use it to lock in their old (or new?) enterprise customers. The irony is that these two models could easily coexist, but they don't seem to understand the former and understand the latter very well.
And I can't help but think the latter is a losing model, as my strong impression is that the movement in the enterprise is away from "enterprise hardware", and towards commodity hardware. IBM needs to work on becoming commodity, in my opinion.
Oh, you are totally right, it is spoiled potential. But they've been doing this for so long it is impossible for them to change. Hence the very long and slow slide to the bottom. I'd see the Maffia change their tune before IBM ever will, way too much institutional inertia.
I know some of the IBM story from very close and it takes a certain attitude to even want to work there.
What's funny about this is that it used be be true - Whatsapp started on Softlayer, and didn't move away until Facebook bought them.
If IBM had even just kept pace and stayed behind AWS with Softlayer after they acquired them, they would have a healthy cloud business by now. It might be because I maintained a system on Softlayer both pre and post acquisition - so I was really close and able to see what was happening - but they squandered a huge opportunity there.
Ah yes, that's true, they in fact were hosted there at some point. I couldn't have picked a worse example :) Or; in a way it is proof that IBM is a bad choice for companies that operate at scale and are low margin and data heavy. Hosting costs must have been a substantial fraction of operating costs for Whatsapp (obviously, long after personnel).
To be fair, POWER is an open standard, and there's absolutely nothing stopping someone like Linode, DigitalOcean, or Hetzner from offering POWER-based systems at a smaller hourly price.
> To be fair, POWER is an open standard, and there's absolutely nothing stopping someone like Linode, DigitalOcean, or Hetzner from offering POWER-based systems at a smaller hourly price.
Well besides the fact that 1) IBM is the only party with both the capabilities and interest in making high performance Power based products [1], and 2) evidently does not understand how to (or why to) invest in bringing this to a general audience. I don't really care if they would do this in IBM Cloud or with other cloud infrastructure companies, but they don't seem to be doing either. In addition 3) why would other parties be interested in running Power when they can run amd64 or arm? It certainly doesn't look to have a price advantage...
IBM really needs to shepherd Power well - they're the only ones that can do it. But I can't help but thinking they seem to be leading it to the grave despite apparently very capable engineering.
[1]: And why would anyone but IBM go for Power at this point if they can have ARM too? An open license matters very little when compared to ARM's mindshare and momentum.
As it is, POWER is on a slow decline towards irrelevance by focusing only on milking their existing enterprise customers as long as it lasts.
Which is a huge shame, the POWER ISA per se is mostly fine, and their commitment to open firmware etc. for trustworthy computing (see Raptor) is a niche that for some reason interests nobody else.
If they want to turn it around and compete with x86, ARM and maybe even RISC-V on the low end, they need to commit to openpower. Get some interesting cores open sourced under the openpower umbrella, docs, open source bus interfaces for connecting stuff on a SOC, etc. And get some decent priced hardware into the hands of hobbyists and as dev boards for embedded.
How would filtering out potential developers benefit a platform’s viability? My imagination isn’t sufficient to envision a market situation where this could ever make sense. Can anyone here describe a hypothetical scenario where as a company pushing a platform you want less developers interested in it?
It could be your going about it a bit wrong. IBM does have trial POWER cloud instances. Or at least they did a couple years ago, as I managed to get my hands on one through the IBM developer program.
OTOH, getting into the developer program requires talking to their sales/marketing droids too. But, if your a software OEM or have a SAS product, they were (are?) hungry and actively doing their best to recruit companies to their platforms. So, you spend a couple hours answering their questions, and in return you get some pretty steep hardware/software discounts and they will loan you machines.
Well in our case (Fortune 500) we have iSeries and pSeries systems built on different generations of Power. They just aren't consumer grade machines out there, or better put commodity priced machines.
When your are already invested it is far safer to stay with what works for you. My downtime is measured in hours per year and all of it has been scheduled across the last seven years; that was the last time we had an unscheduled outage. You can achieve this will all types of hardware; well maybe not all types; but it is easier with some than compared to others and expectations are certainly much higher.
We joke at work that we get forgotten all the time because our daily ready for business meetings; all groups reporting in; never see us mentioned except to state plans for upcoming quarterly maintenance. A pleasant state to be in
This is probably a bit picky since you don't mention what those machines do, but that's not even "four nines" in the high-availability scale. I'm not sure that's something to boast about for a Fortune 500 company or any kind of praise for the IBM series. Maybe you meant hours over the last seven years?
Not op but it sounded like it was scheduled downtime so perhaps it was meant as a per-system number and either redundancy kept the availability higher or the service wasn't needed when downtime was scheduled (overnight, weekends, etc)
Juelich Supercomputing Center has a POWER9 system which I was able to experiment with. A lot of interesting differences from 4 way SMT, 128byte cache line size to faster main memory bandwidth and multi GPUs.
In the end, it was great but not magical, so the difficulty of acquiring offsets the benefits IMO.
I have a 1999 IBM Multiprise 3000 and I learned that the processor has a 256-byte cache line which dropped my jaw when I read about that. Sadly, I can't get the Service Element software so I can't actually do anything with it.
It's really cool they're making this, I should give this a try. This does look like it's the most accessible way of getting a POWER-based shell. This service's interface does not start out with making a great first impression though unfortunately, but if it's the only realistic option then so be it.
If you wanted to try ppc64le, you could reach out to OSU's OSL Lab. They have Power8/9 machines and you could build your software and possibly arrange some time for PoC.
Also you can build for ppc64le if you are, or become, a developer for a GNU/Linux distribution. Unfortunately getting ppc64le back online for Fedora copr has no ETA, but you can develop and build for Fedora/EPEL. I can't remember whether SuSE's OBS has it.
You can also run under qemu, but I don't know how solid that is, and there is at least one simulator.
Raptor Systems’s POWER9-based Talos system immediately came to mind when I was reading this. I definitely hope they’ll be bringing out a POWER10-supporting successor. The experience is now pretty perfect, even for the supposedly dicey desktop environment (provided you keep in mind that there’s a lot less hardware present than you might’ve come to expect: no integrated graphics, no in-built sound, even disk and network controllers can be a bit dicey). Their initial offerings had some notable teething problems, that I shan’t waste more pixels going into (I’ve railed against them and turned in their favour, so I shall let it rest with my partisanship and convert’s fervour). It’s certainly worth considering if you’re into exotic hardware and/or are super-security conscious AND are willing to pay some extra money beyond what you’d expect for comparable x86₆₄ commodity builds. Well worth hoping for and looking into.
The more you’re ready to do-it-yourself, the more you can save. You absolutely need the motherboard and IBM chip, and then you can get other things like memory, storage, graphic card, etc. used for much cheaper. Only the memory is a bit special as it has to be registered, but because of that you might be able to find good deals since it can’t be used in standard desktop computers.
You can get more performance for the same price with AMD/Intel, but for those interested this platform is not that out of reach. I use it as my workstation and I’m happy with it.
Blackbird™ Mainboard (Board Only)
Order online for $1,310.99
Current Status: Backordered
Not unreasonable. $500 for the 4-core processor or $800 for the 8 core. Basically what you would have paid for a decent system decades ago, without adjusting for inflation.
“Much lower” is still quite expensive compared to a mainstream platform: the absolute minimum if you already had all the other pieces would be the motherboard and CPU bundle for $2,133.77: https://secure.raptorcs.com/content/BK1B01/intro.html
That’s 8 cores, which since Power9 has SMT4 means 32 threads. The 4 core CPU bundle is a bit cheaper but when you add the 2u CPU cooler its price gets very close, so it’s not worth it unless your computer case is too narrow for the 3u (≃15 cm clearance, https://en.wikipedia.org/wiki/Rack_unit) cooler included with the 8 core CPU.
It won’t run AmigaOS unless on an emulator[1], and then you’re better off with AMD/Intel: you would lose more performance in emulating the CPU, but you would get more raw performance for your money anyway.
If you want to run AmigaOS software natively on PowerPC your best option is a PPC Mac running MorphOS. It even runs on the PowerBook G4: https://ddg.co/?q=morphos+powerbook+g4
I’m in the waiting list for Vampire4 Standalone, but for now I’ve got a MiSTeR.
So no, don’t buy a Power9 hoping to run AmigaOS on it any better than on a mainstream PC.
P.S.: the $1310.99 price mentioned by aww_dang must be from a few months ago, the current price displayed in Raptor’s website is $1732.07. Prices went up due to the current COVID-19-induced crisis, which also made Raptor cancel their upcoming Condor platform (ATX Power9 but LaGrange-based: twice the memory bandwidth): https://www.talospace.com/2020/07/condor-cancelled.html
If you run an HPC cluster, I'd have thought you have to talk to vendors as a matter of course, but I assume there's no point yet. At least one UK vendor will sell you AC922s (POWER9), I guess as one-off, if you wanted.
True, I do talk to many vendors related to hardware purchase. But, it usually is about the hardware I want, after I already made up my mind about what fits into our budget and matches our need. IBM is usually not even in consideration because I cant even get a ballpark estimate of what these things practically cost (I can get some $ figures from Googling forums but it is not really of my interest).
To give you an idea, I can log in into a local vendor's website, for example, https://www.atea.dk/eshop/products/?filters=S_sr650
and quickly get an idea of what an SR650 Lenovo rack server would cost me. My configurations would obviously change the cost, which is where I talk to the vendor to get the real costs.
That's not how IBM works or how they have ever worked. The whole idea here is to put you in touch with their sales organization who will put some effort into determining what the best way of fleecing you is.
But if you want a small GNU/Linux POWER system, you don't talk to IBM, surely. Ours came from OCF in the UK; they might even operate in Denmark. You won't see a price on the web site, but realistically you don't expect a list price from Dell for your x86 server systems -- at least in my experience. If you talk to them nicely, they may give you an evaluation unit. We once paid £1 for a potent Interlagos system that didn't go back.
I'm not saying talking to sales people is necessarily a pleasant and easy task, of course, especially as you typically need to know more about it than they do.
The only place I have heard about POWER being used in Denmark is at the military, but I guess it exists elsewhere as well. Though I could see that the big orders in the public sector comes through procurement. Compared to private companies asking around.
The time to talk to sales reps is a real problem. I was at a startup, and this was a factor in pretty much all of the software we selected. This was a financial app- we needed messaging of some sort. I called the leaders at that time, Tibco, 29West, etc. They all wanted to schedule meetings 3-5 days in the future to just talk about what we needed. My expectation is that we would have something basically working in 3-5 days. This is before we started haggling over contracts and pricing and all that.
We ended up using ActiveMQ, and literally that evening I had things talking to each other. No contracts, no hassle, just downloaded and got to work.
A bit more on-topic in regards to IBM though- we buy licenses from them directly, and the past few months they have sent kind of bizarre emails saying I could save money if I bought them through a third party partner. I just forwarded the mail to the finance guys, but I can't imagine why IBM would introduce a middle man into our relationship with them and how that additional layer in between could end up saving us money.
IBM does channel sales. They don’t want to maintain a sales force. They want to maintain channel partners and VARS. They will discount you because they need your money flowing thru VARS to keep everyone happy in their ecosystem. So they need to incentivize everyone to go through their partners. So they will charge you more if you are direct to them.
They had and have a large/enterprise customer sales culture. There were few SMB sales for IBM back in TWSr's days (how small a mainframe do you want? Still costs a mint.)[1] and now that IBM has SMB product to sell, they aren't interested in the building out a sales org to address that market[2]. So they've done what almost everyone else has done and outsourced that to a channel.
[1] Yes I know IBM used to sell everything down to the pens and pencils, but virtually always to support some very expensive other purchase.
[2] Yes I know IBM has things called "SMB sales" or some variation. From my experience at IBM, they were either targeting some specific product/market combo, or they were a bad joke; not exactly the A-team. YMMV.
I don't know what the plan is for POWER10 but POWER9 systems are available from other vendors than IBM. Notably this company: https://www.raptorcs.com/.
The systems are still quite expensive, but likely within the budget of a University research lab.
I've begun to think having to "contact sales" is an anti-pattern that actually hurts gross sales in the long-run.
I've recently tried to engage with EventMobi, a company that supports virtual events.
I'm ready to spend thousands if necessary.
However:
#1: There is no way to just sign up, which is off-putting.
#2: Requesting a demo just put me in their funnel with canned email messages. And after over a week, I have yet to be contacted by a live human, despite sending emails.
I feel companies leave a lot of money on the table by not having some sort of self-driven onboarding.
Sales reps are humans. They get busy. Forget to call back, etc.
I feel like there should ALWAYS be at least some sort of self-driven flow at the low end. Even if sales are required at the high-end.
Otherwise, it seems, money is always being left on the table.
The dual chip module has 30 SMT8 cores running at 3+GHz, capable of 64 FP64 FLOPS/cycle when using the matrix unit. That gives 5.7TF of peak FP64 performance (Compared to 19.5TF on NVIDIA A100 when using tensor cores, and 9.7TF on A100 when not using tensor cores).
They say it has 3x the "general purpose socket performance" of power9 in FP workloads. Trying to make sense of this from the other data, they have 15 SMT8 cores per chip (12 on Power9). The single chip module runs at "4+"GHz and dual chip at "3+"Ghz. (4GHz on Power9). Each SMT8 core has 30% additional performance compared to Power9 (slide 13). If I assume the lowest possible clock that gets me to 2.4x comparing the dual chip modules to the previous single chip modules, whereas assuming 3.75GHz clocks would give 3x.
Our institution runs multiple HPC clusters for all kinds of scientific use cases. I remember reading about the head of the HPC department making the switch from INTEL/AMD to IBM because it had a much larger memory / storage (I can't remember which) bandwidth in some astronomy application. This made the project feasible without having to invest in custom hardware.
It's good to see that POWER still excels in important use cases.
POWER9 has large cache and can take large memory (at least for when it appeared), but was particularly notable for bandwidth/latency. I think they were the first with PCIe 4, for instance, when PCIe 3 was a bottleneck for HPC interconnect.
For the uninitiated, what's the value prop of these processors?
Cheaper $ cost per TFLOPs to make up for the trouble of dealing with a specialty instruction set? Speed of certain specialized computations that cannot be matched by alternatives?
Open Architecture with no black boxes. Total control over your system of the kind we used to take for granted before the Intel Management Architecture days. Very focussed on throughput and centralised operation (for example homomorphic encryption and encrypted memory to forestall snooping).
And yes, when used ‘correctly’, these systems can be very fast... in the steady marathon kind of way rather than the spasmodic sprint-racer clock-boosting-and-throttling manner of today’s mainline chips.
> ‘Open’ means that you’re allowed to understand exactly how it works and that there’s no mysteries. It means having the blueprints of the machine, not a free machine.
No, this is 100% incorrect.
The Power ISA, i.e., the software/hardware interface of the CPU, is open source. This means that if you want to build a Power CPU that implements its software interface, you can do so "for free".
That's it. You don't get "the blueprints of the machine", you cannot look into how the CPU work internally and understand it, etc.
That's like having a standard API that anybody can implement, e.g., the C standard library, but which Apple, Microsoft, etc. ship as a black box binary blob, so you can't understand their implementation, search/fix bugs, etc.
So no, your claim is completely incorrect. The benefits of an open ISA only apply to those wanting to build their own CPUs, which for Power is just not even a handful of companies, none of them making their blueprints of their CPUs openly available...
For end users, your machine is as open/closed on a system with an open ISA like in one with a closed one. People paying 10k$ for a Raptor II in the name of openness are throwing their money away.
This is a completely different situation than, e.g., RISC-V, where not only the ISA is open-source, but the VHDL implementation of many RISC-V cores is also open source, and you can buy those cores today.
It's not about the ISA (I assume). The point is that these systems have essentially all free software firmware, as I understand it. You have remote management, but it's something you can presumably fix if you need to. Apart from trust issues, you know how valuable that is if, for instance, you've had to deal with BMCs' brokenness continually over the years.
It's a doomed project. There are no mobile PowerPC parts on the market anymore; they're trying to make do with a QorIQ networking part, which has an inappropriate power budget for a laptop. (The specifications are a little hazy on the matter, but by my reading of the datasheet, it idles around 7W and draws closer to 20W at full power.)
As far as I know, Power ISA implementations still required royalties until a year ago, so it hasn't had much time to mature in the current iteration of its "open" role.
How do you reconcile this comment with the one from
reacharavindh?
What use is the "Open Architecture" part if they're super expensive (ok, maybe you can ignore this part) and you have to go through sales representatives for a simple sale?
Those are still barriers to entry, even if they're not technical.
I don’t even attempt to reconcile it because they’re totally different things... ‘Open’ means that you’re allowed to understand exactly how it works and that there’s no mysteries. It means having the blueprints of the machine, not a free machine.
1. Is any of these "Open" architectures actually used in production anywhere serious when not implemented by their creators? I'm actually interested to know of examples.
2. How do we know that the actual chip IBM provides is the thing in the spec? The comparison was with Intel, how can we prove that there are no backdoors for PowerPC? If we can't prove, does it matter if it's "Open"?
I never claimed it was, and this is starting to reek of a straw-man argument where you’re opposing a statement I haven’t actually made.
(1) Yes, I am aware of situations where this architecture has been chosen by a body that isn’t a chief implementor, and no, I am not at liberty to discuss it.
(2) Having an open spec to compare against, even though I actually don’t know how, is already another plane of existence compared to not having something to compare against.
Decapping and microscopy? Pushing edge cases onto the chip and comparing expected outputs? Implementing all or part on an FPGA and seeing how they compare at a severely clock-reduced rate? It’s well beyond my technical ability, but it’s not beyond expert technicians’ abilities. That’s the key point.
EDIT: Also you can set your own keys for the root trust, and remove others’. That’s very important, and radically orthogonal to the competition of ARM and x86₆₄.
They seem to be the best "communicators" on the market. With 1TB DDR5 and PCIe 5.0 support, a POWER10 will be the fastest "glue" between DDR5 and GPUs.
Which is pretty much how it works in the Summit supercomputer (POWER9).
--------
From a CPU-perspective, its going to be more costly than a Xeon or EPYC and not as fast at crunching numbers. But POWER9 (and I expect POWER10) usually had the best L3 cache and RAM performance.
The 1TB/sec OpenCAPI link to FPGAs or GPUs continues that tradition. That's an absurdly huge communication path between CPUs and/or GPUs or whatever else is on the motherboard.
Since selling its Intel server business to Lenovo, IBM enterprise server hardware is Power only. It is the flip side to some of the complaints in this thread, IBM enterprise sales is the single trusted source (or single throat to choke) for all of your business critical systems.
If your organization has an existing relationship with IBM or Red Hat, Power CPUs are part of an integrated bundle moving forward.
Have they improved Load-Hit-Store penalties from previous generations?
Lots of transistors and opcodes have been sacrificed for fancy things like transactional memory, runtime instrumentation and other features but fundamentals haven't improved requiring expensive compiler opts which interpreters don't do and are expensive for JITs.
The Intel chips did the fundamentals better, has POWER caught up?
IBM is dying a slow death; they still "milk" the market with old stuff like AS/400, because of a superb lock-in.
The "Power" business is still doing ok; but I'd bet in a few years one of the other big guys will go at it (maybe Nvidia?) and start eating at their market share.
I'm not an expert in this stuff, but I think they just get thrown away.
It's one of several reasons why smaller chips are more area-efficient to make, and one of several reasons why the major semiconductor manufacturers have been so interested lately in building chips out of smaller pieces manufactured separately rather than one big monolithic chip.
Even dumber question, why do they need to be circular?
CDs and DVDs write in a circular pattern starting from the middle going outwards, but the actual chips on these wafers seem to be their own individual squares.
Because of the process used to make the initial cylindrical crystal from which the wafers are sliced: https://en.wikipedia.org/wiki/Boule_(crystal) It involves spinning a seed crystal and drawing a cylinder out of a bath of molton ultra-pure silicon. Spinning => cylinder => circular wafers. To reduce waste of the "edge bits" the industry has moved over time to larger and larger wafers. I have some wafers from the 90s which are 6" and 8" in diameter (amazing what you can buy on eBay), but modern ones are all 12" (actually 300mm).
I would guess that dies are built from modular sections (e.g. SRAM cells), and it’s important that two identical modules perform identically - signal propagation time is relevant at this scale, so the shape and layout of each module must be identical. I would further guess that rectangular layouts are easiest to reason about, easiest to make masks for, easiest to pack efficiently at the transistor level, and easiest to test.
But I don’t know of a fundamental reason why a sufficiently advanced VHDL “compiler” couldn’t produce hex-cell or even circular layouts.
Chip dicing hardware can produce hex-cells, or any other cell with straight edges. (Not circular - that's not a good shape to expect from crystalline silicon.)
But - as you say - the modular sections are rectangular, and for most applications there's no good reason to make the dies any other shape.
There's actually a patent for hex-cell chips, but it doesn't seem to have been used for any significant projects.
> Even dumber question, why do they need to be circular?
The wafer is round because it's cut from a cylinder of silicon. And the cylinder is a cylinder because spinning is involved in the process to make it. Hence, thanks to centripetal force, it ends up being round!
>Ribbon solar cells are a 1970s technology most recently sold by Evergreen Solar (which is now in receivership, i.e. bankrupt and liquidated), among other manufacturers.
I'm not sure if they can be 'just' recycled into new wafers. Either way, given that it's just silicon I'm sure it can be recycled or safely disposed of (silicon isn't toxic as far as I know, unless you breathe it in a powder form)
Not sure how easy it would be to recycle those as chips, given it will have dopants [1] inside. It will likely be unfit for computing applications unless purified, but since it's on the order of one dopant atom per 1e12 silicon atoms, it would basically be 100% pure in other industries.
Some metallic contacts (mostly aluminium), silicon oxide and other residues are likely present as well, depending on the masking process.
I've wondered for quite some time why not triangles or hexagons. I guess the yield (percentage of wafer thrown away) improvements would be minimal. Plus, the temperature is often better controlled at the center, which would make the edge parts less performant anyway.
That's the other advantages of chiplet design: maximize yield (a small defect renders a much smaller chip unusable), and much more granular binning (easier to sort out good/worse chips, due to placement and random issues during fabrication). Not to mention you have a much more modular design at the end, where you only have to change the cheaper (not 7nm) silicon interposer.
Partly it's "path dependency" - everything is set up for rectangular dies, so everything would need to change for uncertain benefits. Not just the tooling, but also the design software. While I was looking at this I found that Intel has a patent for an octagonal die with a smaller square one fitted in the gaps between: https://patents.google.com/patent/US20060278956
BTW the dicing used (in the 1970s) to be done partly by hand. You can see a video of someone doing it here: https://youtu.be/HW5Fvk8FNOQ?t=978
I know you re probably joking, but please mind Poe's law.
For the uninitiated, yeah, some dies can die during dicing. But I think you'd have trouble finding the cracks, and then it's just infeasible to precisely cut both halves where they would need to be cut, then reattach them. The issues would be the cut thickness, not damaging the circuits near it, precisely aligning the circuits, and then electrically connecting the circuits.
Alignment is probably the hardest part, we can barely do it for flip-chip wafers/silicon interposers on the order of the µm, imagine doing it at less than 7nm, which is the transistor pitch here.
I think they do sometimes put test features in the corners if there's space. The electrical properties of the die can vary in interesting ways [1], but the edges are usually worse than the center.
So, the main result of designing a CPU is a series of masks that essentially indicate where to put what. For example, in this layer, inject boron anywhere the mask doesn't block. The masks aren't wafer sized- they are pretty small, and a machine moves the mask from position to position across the wafer to re-use it. But, at least when I was working on this, some masks would be larger than an individual square (die)- maybe the mask could do 2x2 at a time. In that case, maybe the application of the mask would get you one complete die, and three die off the edge.
So who is going to actually fabricate these chips, Samsung ?
IBM transferred its own chip fab business to Global Foundries several years ago and it was my understanding that they were tied to them for the following 10 years. But Global Foundries announced they were abandoning EUV so I don't think they're going to be producing 7nm chips.
> Samsung Electronics will manufacture the IBM POWER10 processor, combining Samsung's industry-leading semiconductor manufacturing technology with IBM's CPU designs.
It's interesting to see that they are using Samsung's 7nm process. I thought that, apart from the work they do for Apple, Samsung kept their high-end fabbing mostly to themselves.
Samsung, Glofo and IBM were members of the Common Platform. My ex-roommate used to work at IBM Upstate where they trained Samsung engineers.
Apple moved to TSMC in Taiwan not too long after Tim Cook appeared on CBS claiming that the engines of their mobile devices were made in US, almost 6-7 years ago. Apple's share of Samsung's production isn't probably much these days. But they are still #2 behind TSMC and Samsung also announced recently that they are investing $100B for next 10 years in logic business which includes their foundry.
Apple and Samsung had been partner for a long time. Some iPods and iPhone, iPhone 3G, 3GS use Samsung arm processor. Apple A4 to A7 is made by Samsung too.
But A8 is made by TSMC, A9 has two versions, APL0898 by Samsung, APL1022 by TSMC. There were some debates on which one is better.
It's bittersweet reading these IBM announcements. They clearly have amazing hardware, but I'll probably never get to play with it, since they make no effort to sell to consumers (unlike Intel, AMD, Nvidia, etc).
> transparent memory encryption designed to support end-to-end security
Does this work with process isolation? I.E. can I make it so that each process's memory is encrypted with a different key, to prevent snooping by other processes? How (if at all) does that work with debuggers?
I'm not sure about POWER, but in AMD EPYC it is implemented at the hypervisor level. So each VM can have encrypted memory with a unique key, but within a VM the processes see unencrypted memory.
It's typically implemented as an extension of the virtual memory page table, and conceptually it wouldn't be too difficult to have finer-grained keys, such as one for the kernel and one for user mode processes, or even one per process.
Interesting. Does that allay the concerns about speculative execution side channel leaks in cloud VMs? (Because even if you can leak data from other VMs running on the same physical device, that data will be garbage without the other VM's encryption key.)
Can anyone give a description of what these chips are used for and by whom? And who writes software for these architectures? Seems like a totally different side of the industry that I know nothing about.
Is there a more informative writeup somewhere? I couldn't find any data on performance outside AI inference workloads. There is a footnote about 30 cores but very little detail even on that.
Ha! I wouldn't hold my breath - indeed they are ramping up to use their own system on chips (SoC) - not just CPUs - in future Macs.
If they deliver a significant increase in performance - even if it's only for a few specific use cases - the ripples will be interesting to watch play out for decades to come.
>"With hardware co-optimized for Red Hat OpenShift, IBM POWER10-based servers will deliver the future of the hybrid cloud when they become available in the second half of 2021."
Can someone say what co-optimized mean here? Is this just bad marketing speak? What is intended to mean if so?
I had really high hopes for cpu with native float128 support (quad precision) with POWER9 but after tests it turned out is only native in addition and multiplication ops. We’ll see what new generation brings to the table.
In this moment I do feel like Apple has tackled its goal of making computing more accessible. I just didn’t realize ibm wouldn’t change their playbook.
I used to work on an IBM AIX system as a data warehouse developer. That was on Power architecture but unfortunately I didn't know much about this architecture back then.
- They leapfrogged everyone else with PCIe v5 and DDR5
- 1 TB/s memory bandwidth, which is comparable to high-end NVIDIA GPUs, but for CPUs
- Socket-to-socket interconnect is 1 TB/s also.
- 120 GB/s/core L3 cache read rate sustained.
- Floating point rate comparable to GPUs
- 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.
- Memory disaggregation similar to how most modern enterprise architectures separate disk from compute. You can have memory-less compute nodes talking to a central memory node!
- 16-socket glueless servers
- Has instructions for accelerating gzip.