Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Next-Generation IBM POWER10 Processor (ibm.com)
377 points by mbrobbel on Aug 17, 2020 | hide | past | favorite | 260 comments


There's some great info in these slides: https://regmedia.co.uk/2020/08/17/ibm_power10_summary.pdf

- They leapfrogged everyone else with PCIe v5 and DDR5

- 1 TB/s memory bandwidth, which is comparable to high-end NVIDIA GPUs, but for CPUs

- Socket-to-socket interconnect is 1 TB/s also.

- 120 GB/s/core L3 cache read rate sustained.

- Floating point rate comparable to GPUs

- 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.

- Memory disaggregation similar to how most modern enterprise architectures separate disk from compute. You can have memory-less compute nodes talking to a central memory node!

- 16-socket glueless servers

- Has instructions for accelerating gzip.


I appreciate the slides, but you've got some critical errors.

> - Floating point rate comparable to GPUs

No where close. POWER10 caps out at 60x SMT4, and only has 2x 128-bit SIMD units per SMT4. That's 480 FLOPs per clock cycle. At 4GHz, that's only 1.9 TFlops single-precision compute.

An NVidia 2070 Super ($400 consumer GPU) hits 8.2 TFlops with 448 GB/s bandwidth.

> - 8-way SMT makes this into a hybrid between a CPU and a GPU in terms of the latency hiding and memory management, but programmable exactly like a full CPU, without the limitations of a GPU.

Note that 8-way SMT is the "big core", and I'm almost convinced that 8-SMT is more about licensing than actually scaling. By supporting 8 threads per core (and doubling the size of the core), you get a 30-core system that supports 240-threads. That's a licensing hack: since most enterprise software is paid per-core.

I'd expect that 4-SMT will be more popular with consumers (ie: Talos II sized systems). Similar to how Power9 was


Note that 8-way SMT is the "big core", and I'm almost convinced that 8-SMT is more about licensing than actually scaling. By supporting 8 threads per core (and doubling the size of the core), you get a 30-core system that supports 240-threads. That's a licensing hack: since most enterprise software is paid per-core.

I don't think that's entirely fair: Massively multi-user transactional applications (read: databases) are right in the wheelhouse of POWER, and they're exactly the kind of applications that benefit most from SMT. Lots of opportunities for latency hiding as you're chasing pointers from indexes to data blocks.


While that's a fair point, databases are also among the most costly software that is paid per-core. So the SMT8 system will have half the per-core costs as an SMT4 system (because an SMT8 system will have half the "cores" of any SMT4 system. Even though an SMT8 core is just two SMT4 cores grafted together).


> That's a licensing hack: since most enterprise software is paid per-core.

Shit. That's actually awesome and could easily pay for itself.


Until the licensing terms are updated.


Sure, long term maybe you have to pivot or whatever. But companies spend millions on these multicore licenses today.


I think you’ll find that the licensing terms can change faster than you can re-engineer your systems.



Oracle charges 1.0 for any POWER9 core. Which means that SMT8 cores (which are twice as big as SMT4 cores) get you more performance with less licensing costs.

So what would you rather buy? A 24-core SMT4 Nimbus chip, or an 12-core SMT8 Cumulus chip?

The two chips have the same number of execution units. They both support 96 threads. But the 12-core SMT8 Cumulus chip will have half the license costs from Oracle.

-------

For DB2...

The SMT8 model (E950) only has a 100x multiplier for DB2, while SMT4 models have a 70x multiplier. So you're only paying 30% more per core (E950), despite getting twice the execution resources.

Even the top end SMT8 model (E980) has a 120x multiplier. So you're still saving big bucks on licensing.


The point is that a POWER core costs twice as much as an Intel/AMD core so SMT8 doesn't save as much money as you might think.

What would you rather buy? A 24-core x86 or a 12-core SMT8 POWER?


The 12-core POWER10 will have 120MB of L3 cache and 1000GBps main-memory bandwidth with 96-threads.

That's unified L3 cache by the way, none of the "16MB/CCX with remote off-chip L3 caches being slower than DDR4 reads" that EPYC has.

Intel Xeon Platinum 8180 only has 38.5MB L3 across 28 cores / 56-threads. With 6-memory controllers at 2666 MHz (21GBps per stick), that's 126 GBps bandwidth.

AMD EPYC really only has 16MB L3 per CCX (because the other L3 cache is "remote" and slower than DDR4). With 8 memory controllers at 2666 MHz, we're at 170 GBps bandwidth on 64-threads.

If we're talking about "best CPU for in-memory database", its pretty clear that POWER9 / POWER10 is the winner. You get the fewest cores (license-cost hack) with the highest L3 and RAM bandwidths, with the most threads supported.

--------

On the other hand, x86 has far superior single-threaded performance, far superior SIMD units, and is generally cheaper. For compute-heavy situations (raytracing, H265 encoding, etc. etc.) the x86 is superior.

But as far as being a thin processor supporting as much memory bandwidth as possible (with the lowest per-core licensing costs), POWER9 / POWER10 clearly wins.

And again: those SMT8 cores are no slouch. They can handle 4-threads with very little slowdown, and the full 8-threads only has a modest slowdown. They're designed to execute many threads, instead of speeding up a single thread (which happens to be really good for databases anyway, where your CPU will spend large amounts of time waiting on RAM to respond instead of computing).


> No where close. POWER10 caps out at 60x SMT4, and only has 2x 128-bit SIMD units per SMT4. That's 480 FLOPs per clock cycle. At 4GHz, that's only 1.9 TFlops single-precision compute.

Too late for me to edit this: but POWER10 has a full 128-bit vector unit per slice now. So one more x2, for 3.8 TFlops single-precision on 30x SMT8 or 60x SMT4.

So I was off by a factor of 2x in my earlier calculation. Power10 has a dedicated matrix-multiplication unit, but I consider a full matrix-multiplication to be highly specialized (comparable to a TPU or Tensor-core), so its not really something to compare flop-per-flop except against other matrix-multiplication units.


There's also the 512-bit matrix units which would be more like 3840 FLOP/cycle (single precision) for workloads that can be expressed that way


I'm not sure if that plays to the advantage of POWER10.

NVidia 2070 Super also has FP16 matrix-multiplication units, achieving 57 FP16-matrix TFLOPs. These are the NVidia "Tensor Cores".

Ampere (rumored to be released within a few weeks...) even has sparse matrix-multiplication (!!) units being added in.


Having a Talos board myself, it seems like if you run Linux you only get SMT4. SMT8 is reserved for IBM.


The Talos computers are only compatible with SMT4 chips (aka: Sforza chips). You can't buy an SMT8 chip for Talos.

SMT8 chips are also known as "Scale up" chips. The Summit supercomputer was made with SMT4 chips by the way.

I never used an SMT8, but from the documents... its really similar to two SMT4 cores working together (at least on Power9). SMT8 chips are a completely different core than SMT4 chips, with double the execution resources, double the decoder width, double everything. SMT8 is a really, really fat core.


The scale-up chips (used in the E950 and E980 only) aren't the only SMT8 chips. The scale-out chips (Sforza, Monza, LaGrange) can be fused as SMT4 or SMT8, but IBM doesn't appear to sell SMT8-fused chips for other parties.

If a given IBM server runs PowerVM, it's SMT8. You may find this table of mine helpful (assembled from various sources and partially inferred, so accuracy not guaranteed, but it represents my understanding): https://www.devever.net/~hl/f/SERVERS


> - Floating point rate comparable to GPUs

Where do you get the FP performance, exactly, and for what value of "FP"? It's unclear to me in the slides from El Reg, which appear to be about the MMA specifically, and it's not clear what the SIMD units actually are. (I don't know if that's specified by the ISA.)

One thing is that it presumably has a better chance of keeping the SIMD fed than some.


> - Has instructions for accelerating gzip.

Has there been any research for finding out how much these custom instructions have accelerated modern computational power in practice?

There are a lot of these instructions in modern CPUs for AES, gzip, SSL, etc but you have to have a library coded to support them.

I'm curious what the performance improvements look like in practice.


Glad you asked!

I recently did a bunch of tests to see what the "ultimate bottlenecks" are for basic web applications. Think latency to the database and AES256 throughput.

Some rough numbers:

- Local latency to SQL Server from ASP.NET is about 150 μs, or about 6000 synchronous queries per second, max.

- Even with SR-IOV and Mellanox adapters, that rises to 250 μs if a physical network hop is involved.

- Typical networks have a latency floor around 500-600 μs, and it's not uncommon to see 1.3 ms VM-to-VM. Now we're down to 800 queries per second!

- Similarly, older CPUs struggle to exceed 250 MB/s/core (2 Gbps) for AES256, which is the fundamental limit to HTTPS throughput for a single client.

- Newer CPUs, e.g.: AMD EPYC or any recent Intel Xeon can do about 1 GB/s/core, but I haven't seen any CPUs that significantly exceed that. That's not even 10 Gbps. If you have a high-spec cloud VM with 40 or 50 Gbps NICs, there is no way a single HTTPS stream can saturate that link. You have to parallelise somehow to get the full throughout (or drop encryption.)

- HTTPS accelerators such as F5 BIG IP or Citrix ADC (NetScaler) are actually HTTPS decelerators for individual users, because even hardware models with SSL offload cards can't keep up with the 1 GB/s from a modern CPU. Their SSL cards are designed for improving the aggregate bandwidth of hundreds of simultaneous streams, and don't do well at all for a single stream, or even a couple of concurrent streams. This matters when "end-to-end encryption" is mandated, because back end connections are often pooled. So you end up with N users being muxed onto just one back-end connection which then becomes the bottleneck.


RTT between VMs in the same GCP zone is often well below 100us. Are you also measuring the latency of the query itself? I realize not all networks are built the same, but it seems like your benchmark case is much worse than what's possible, even without buying specialized hardware.


The test was to run "SELECT 1" using the low-level ADO.NET database query API in a tight loop. This is the relevant metric, as it represents the performance ceiling. It doesn't matter how fast the packets can get on the wire if the application can't utilise this because of some other bottleneck.

Of course, the underlying TCP latency is significantly lower. Using Microsoft's "Latte.exe" testing tool, I saw ~50 μs in Azure with "Accelerated Networking" enabled. As far as I know, they use Mellanox adapters.

Something I found curious is that no matter what I did, the local latency wouldn't go below about 125 μs. Neither shared memory nor named pipes had any benefit. This is on a 4 GHz computer, so in practice this is the "ultimate latency limit" for SQL Server, unless Intel and AMD start up the megahertz war again...

It would be an interesting exercise comparing the various database engines to see what their latency overheads are, and what their response time is to trivial queries such as selecting a single row given a key.

Unfortunately, due to the DeWitt clauses in EULAs, this would be risky to publish...


> Newer CPUs, e.g.: AMD EPYC or any recent Intel Xeon can do about 1 GB/s/core, but I haven't seen any CPUs that significantly exceed that.

Modern processors can do ~2.1GB/s per core for AES-256-GCM (about 2x what they can do for AES-CBC).


It should be noted that AES-CBC doesn't have any instruction-level parallelism available.

Both EPYC and Xeons have 2-AES units per core now. But CBC can only effectively use one of them at a time. (Block(n+1) cannot be computed until block(n) is done computing. Because Block(n) is used as input into Block(n+1) in CBC mode).

AES-256-GCM can compute block(n) and block(n+1) simultaneously. So you need to use such a parallel algorithm if you actually want to use the 2x AES pipelines on EPYC or Xeon.


More like ~3.9 GB/s for a desktop CPU and somewhere north of 7 GB/s for AES-256-CTR. The performance difference between CBC and CTR is pretty much exactly what you'd expect, you can only use one out of two units, and you have to wait out the full latency (1:4 disadvantage on recent cores) => ~8x slower.


Given that server processors have up to 64 cores now, it sounds like the NIC (even 200 Gbps) is squarely the bottleneck.


Not for any single user or connection. High capacity doesn't equal low latency, and scalability to many users doesn't necessarily help one user get better performance!

Very often, you'll see n-tier applications where for some reason (typically load-balancers), the requests are muxed into a single TCP stream. In the past, this improved efficiency by eliminating the per-connection overhead.

Now, with high core counts and high bandwidths, some parallelism is absolutely required to even begin to approach the performance ceiling. If the application is naturally single-threaded, such as some ETL jobs, these single-stream bottlenecks are very difficult to overcome.

In the field, I very often see 8-core VMs with a "suspicious" utilisation graph flatlining at 12.5% because of issues like this. It boils my blood when people say that this is perfectly fine, because clearly the server has "adequate capacity". In reality, there's a problem, and the server is at 100% of its capacity an the other 7 cores are just heating the data centre air.


Funny. 800 queries per second is pretty much exactly where my SQL Server applications are running.


You and pretty much every modern data centre application, irrespective of the technology stack. The typical 1.3ms latency I see kills performance across the board, but very few people are aware of the magnitude of the issue.


My work around is to add more to each query. Instead of doing inserts, I do bulk copy statements with thousands of rows. Not particularly fun.


Yeah, I had the same thought. Also gzip has been around for many decades and newer lossless compression algorithms are way better both in terms of speed (comparing with the same hardware) and compression ratio, e.g. LZ4 algorithm and Zstandard.

It just doesn't make sense for the HW engineers and chip designers to optimize backwards for older software algorithm instead of optimizing forward for newer algorithms.


This depends on the prevalence on the different algorithms. If let's say gzip is 90% of the decompression load then it would make good sense.


One such extremely common workload that does frequent gzip compression is HTTP servers because it is the standard method for compression.


GZip is a widely supported compression type for HTTP connections. It makes complete sense to have it be hardware accelerated.


Zstandard is not actually that fast!

I recently benchmarked it versus the ancient "DEFLATE" algorithm, as seen in the .NET Framework.

They're virtually identical in speed using typical settings.

Zstandard only has a performance advantage if using the prepared dictionaries, in which case it is genuinely faster.

If you don't go to the effort of "training" a dictionary and using it, there's zero benefit to Zstandard.


IBM tech is for large old heavy-footed enterprises. I expect much GZIP, less Zstandard there.


How specific are those "gzip instructions", though? Are there newer algorithms that could benefit from some of the same acceleration primitives, even if they do higher-level things differently?


They're not instructions; it's a dedicated hardware unit. Most of the area appears to be devoted to finding matches so it should be possible to add other LZ-style algorithms without much additional area.


How is the single thread performance? Intel Xeon was always faster (thanks to competition from AMD Opteron...), But maybe that's changed with Intel's recent problems...


When Phoronix tested POWER9 SMT4 a while back, single-thread performance seems disappointing at first glance.

But it seems to be made up with their implementation of SMT4. The 2nd thread on a core didn't have much slowdown at all, while thread3 and thread4 per core barely affected performance.

It seems like POWER9 at least, benefits from running significantly more threads per core (at least compared to Xeon or AMD).

EDIT: It should be noted that IBM's 128-bit vector units are downright terrible compared to Intel's 512-bit or AMD's 256-bit vector units. SIMD compute is the weakest point of the Power9, and probably the Power10. They'll be the worst at multimedia performance (or other code using SIMD units: Raytracing, graphics, etc. etc.).

Power9's best use case was highly-threaded 64-bit code without SIMD. Power10 looks like the SIMD units are improving, but they're still grossly undersized compared to AMD or Intel SIMD units.


Sounds like IBM is not wasting area and power on out of order scheduling to find independent instructions within one thread. If you're running a lot of threads anyway, you get more independent instructions to work with for free!


When in SMT4 mode, various hardware resources are "partitioned off" in Power9.

The first, and third, threads use the "Left Superslice", while the second and fourth threads use the "Right Superslice". All four threads share a decoder (Bulldozer style).

1/4th of the branch predictor (EAT) is given to each of the 4x threads per core.

Register rename buffer is shared 2-threads at a time. (Two threads use the "left superslice", two other threads use the "right superslice"). An SMT1 mode, the single thread can use all 4 resources simultaneously.

A lot of the out-of-order stuff looks like it'd work as expected in 1-thread to 4-thread modes. At least, looking through the Power9 user guide / in theory.

--------

Honestly, I think the weirdest thing about POWER9 is the 2-cycle minimum latency (even on simple instructions like ADD and XOR). With that kind of latency, I bet that a number of inner-loops and code needs 2-threads loaded on the core, just to stay fully fed.

That'd be my theory for why 2-threads seem to be needed before POWER9 cores feel like they're being utilized well.

Obviously, POWER10 probably will change some of these details. But I'd expect POWER10 to largely be the same as POWER9 (aside from being bigger, faster, more efficient).


I am excited. But also sad that most of us wont ever get to play with it. Unlike Intel and AMD's Xeon and EPYC, getting hold of POWER doesn't seems like an easy task unless the cooperation you work for have very specific needs.

Edit: Turns out there is a major section on it below.


Good to see they're keeping pace. For some reason I keep cheering for the old beast but I don't see how they can keep up with the big 5


Who are the Big 5? Googling turns up no definitive list.


“GAFAM A more inclusive grouping referred to as GAFAM or "Big Five", defines Google, Amazon, Facebook, Apple, and Microsoft as the tech giants.[18][19][20][21] Besides Saudi Aramco, the GAFAM companies are the five most valuable public corporations in the world as measured by market capitalization.[3] Nikos Smyrnaios justified the GAFAM grouping as an oligopoly that appears to take control of the Internet by concentrating market power, financial power and using patent rights and copyright within a context of capitalism.[22]“

https://en.m.wikipedia.org/wiki/Big_Tech


Those wouldn't be IBM's top competitors in semiconductors would they? I don't think Facebook or Amazon or Microsoft is spending more than 1% of their R&D budget on semiconductors.


Does this mean GPUs might become less relevant in the future if CPUs can perform similar tasks at same speed?


CPU-SIMD is less about competing against GPUs and more about latency.

GPUs will always have more GFlops and memory bandwidth at a lower cost. They're specifically built GFlop and memory-bandwidth machines. Case in point: the NVidia 2070 Super is 8 TFlops of compute at $400, a tiny fraction of what this POWER10 will cost.

If POWER10 costs anything like POWER9, we're looking at well over $2000 for the bigger chips and $1000 for reasonable multisocket motherboards. And holy moly: 602mm^2 at 7nm is going to be EXPENSIVE. EDIT: I'm only calculating ~2 TFlops from the hypothetical 60x SMT4 Power10 at 4GHz. That's no where close to GPU-level Flops.

However, the CPU-GPU link is slow in comparison to CPU-L1 cache (or even CPU-DDR4 / DDR5). A CPU can "win the race" by using SIMD onboard, completing your task before it even spent the ~5-microseconds needed to communicate to the GPU.

----------

With that being said: POWER10 also implements PCIe 5.0, which means it will be one of the fastest processors for communicating with future GPUs.


It's more appropriate to compare pricing of Tesla with a datacenter-grade CPU like POWER10 (or Epyc/Xeon/etc.).

A64FX (in Fugaku, the current #1 machine on all popular supercomputing benchmarks) has shown that CPUs can compete with top-shelf GPUs on bandwidth and floating point energy efficiency.


Fugaku has 158,976 nodes x2 chips each, or 317,952 A64FX chips.

Summit has 4,608 nodes x 6 GPUs each, or 27,648 V100 GPUs. It also was built back in 2018.

---------

While Fugaku is certainly an interesting design, it seems inevitable that a modern GPU (say A100 Amperes) would crush it in FLOPs. Really, Fugaku's most interesting point is its high rate of HPCG, showing that its interconnect is hugely efficient.

Per-node, Fugaku is weaker. They built an amazing interconnect to compensate for that weakness. Fugaku also is an HBM-based computer, meaning you cannot easily add or remove RAM (like a CPU / GPU team can configure to more, or less RAM by adding sticks).

These are the little differences that make a difference in practicality. But yes, A64FX is certainly an accomplishment, but I wouldn't go so far as to say its proven that CPUs can keep up with GPUs in terms of raw FLOPs.


A100 has a 20% edge on energy efficiency for HPL, along with higher intrinsic latencies. It's also 6-12 months behind A64FX in deployment. https://www.top500.org/lists/green500/2020/06/

HPCG mostly tests memory bandwidth rather than interconnect, but Fugaku does have a great network.

Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).

Normalizing per node (versus per energy or cost) isn't particularly useful unless your software doesn't work well with distributed memory.


> Adding DRAM to a GPU-heavy machine has limited benefit due to the relatively low bandwidth to the device. They're effectively both HBM machines if you need the ~TB bandwidth per device (or per socket).

This POWER10 chip under discussion has 1TB bandwidth to devices with expandable RAM.

Yeah, I didn't think it was possible. But... congrats to IBM for getting this done. Within the context of this hypothetical POWER10, 1TB bandwidth interconnects to expandable RAM is on the table.


It's 410 GB/s peak for DDR5. The "up to 800 GB/s sustained" is for GDDR6 and POWER10 isn't slated to ship until Q4 2021 so it isn't really a direct comparison with hardware that was deployed in 2019.


IIRC, the new memory connection they introduced with the latest POWER9 chips traded bandwidth and physical distance for a bit of latency.


Slides posted earlier in this thread say the OMI (the memory interface) adds about 10ns compared to straight DDR5.


Your point about latency is the reason why we can't use GPUs for realtime audio processing, even though they would be absolutely otherwise well-suited. Stuff like computing FFTs and convolution could be done much faster on GPUs, but the latency would be prohibitive in realtime audio.


Hmmm...

GPU latency is ~5 microseconds per kernel, plus the time it takes to transfer data into and out of the GPU. (~15GBps on PCIe 3.0 x16 lanes). Given that audio is probably less than 1MBps, PCIe bandwidth won't be an issue at all.

Any audio system based on USB controls would be on the order of 1000 microseconds of latency (1,000,000 microseconds/1000 USB updates per second == 1000 microseconds). Lets assume that we have a hard realtime cutoff of 1000 microseconds.

While GPU-latency is an issue for tiny compute tasks, I don't think its actually big enough to make a huge difference for audio applications (which usually use standard USB controllers at 1ms specified latency).

I mean, you only have room for 200 CPU-GPU transfers (5-microseconds per CPU-GPU message), but if the entire audio-calculation was completed inside the GPU, you wouldn't need any more than 1-message there and 1-message back.


What's the latency threshold for "real time" when it comes to audio processing?


Depends on the application. Your brain can notice the delay when playing something like digital drums after ~10-15 ms. Something with less 'attack' like guitar has a bit more wiggle room, and ambient synths even more so.

edit: also, vocals are most latency-sensitive if you are the one singing and hearing it played back at the same time.


Recording instruments while monitoring them starts becoming really difficult after ~10ms. 2-8ms is ideal in my experience


> They're specifically built GFlop and memory-bandwidth machines

That would indicate their theoretical peak performance is higher. Unless you can line up your data so that the GPU will be processing it all the time in all its compute units without any memory latency, you won't get the theoretical peak. In those cases, it's perfectly possible a beefy CPU will be able to out-supercompute a GPU-based machine. It's just that some problems are more amenable to some architectures.


For AI/ML and scientific computations? Maybe.

For games and graphics? I don't think so, a GPU has a ton of dedicated hardware that is very costly to simulate in software: triangle setup, rasterizers, tessellation units, texture mapping units, ROPs... and now even raytracing units.


In a sense yes, and this has already happened. It's just that GPUs are under lock and key just like early processors were. There are interesting development leaks you can find, like NVIDIA cards supporting USB connectors on some models, implying you can use just the GPU as a full computer.

Many ARM devices boot from GPU.


They might be focusing on efficient inference, that's the main use case in ML.


CPUs and GPUs have been converging for a long time.


It leaves me salivating


It’s very cool, but unfortunately inaccessible for those without sky high budgets and time to talk to sales reps.

I would happily experiment with one of these at our HPC cluster(a small group at an University), but the idea of talking to a sales rep to even figure out what it would cost puts me off completely, ignoring the licenses for most interesting things to do with the hardware you buy.

I wish t. power boxes were as easy to buy as x86 boxes. Simply configure, get an idea of a price, talk to the distributor and place an order.


I worked with a financial trading firm that was interested in evaluating a POWER system. They called IBM, who arranged a meeting our our offices.

From memory, about 8 IBM people showed up. They didn't seem to actually know each other, but were from several different groups within IBM.

We sat down, and started by explaining what we did with our existing x86-64 systems, and what we thought we'd like to try with the POWER system. We asked for a single box to evaluate, roughly the equivalent to our existing HP DL380 dual-Xeon boxes.

The folks from IBM then spent the next 40 minutes arguing with each other about exactly which system we should be using. Five minutes before the meeting was scheduled to end, one of them took charge and said they'd figure it out offline, and get back to use with the details.

Several more rounds of email were exchanged, but we never actually got to the point of being told what system we could have, or what its specs were, let alone actually being able to physically get one and test it.

It was perhaps the most absurd situation I've seen in 30 years in the industry.


There used to be a joke that IBM sales reps don't have children because all they do is sit on the bed and tell their spouses how good it's going to be.


That's really amazing.

Rather more than 20 years ago, a DEC/COMPAQ salesperson cold-called me to see if the ISP I was working at would like to switch to Alpha servers. After about 20 minutes, he offered a six-month free loan of a mid-range server -- probably $10-15K, I don't recall. It arrived a week later. We determined the hardware was pretty nice but the operating system was a major PITA -- this was when you expected to compile a large fraction of your software -- so it mostly sat on someone's desk for the last four months of the loan before the salesperson came to claim it.


Compaq also used to provide publicly-available "testdrive" Alpha servers that you could just shell into. This was way before the idea of cloud computing was mainstream. A lot of gnu+linux/alpha development happened thanks to testdrive.compaq.com (which is dead now, of course).


This was before the Alpha port of Linux happened; if it had been available instead of OSF/1, we would have done more with the machine.

But, yes: if you want to market hardware that nobody else can supply, you need to get it into the hands of people who will use it and evangelize for it.


In a 30+ year professional IT career I've had exactly three interactions with IBM, and every one of them went very much as you described. How this company stays in businesss I'll never understand.


Big IT contracts and their VAR doing the actual selling.

[edit]We bought our new iSeries (Power9-based) through a VAR this year with an IBM rep helping us with what we actually needed. It was a bit of a drawn-out experience, but overall, I wasn't displeased. It was easier than dealing with some PC vendors (looking at you Dell and HP). I would imagine it will be another 15 years before we have to buy another one.[/edit]


Bingo. Going to an IBM VAR/Partner is really the only decent/sane way to navigate the IBM sales bureaucracy unless you're talking about US$(high 7-figure plus) orders. It's also really the best way to get the best price, because the VAR will work to figure out all the discounts and such a customer might be entitled to. On an IBM sales team, the left hand doesn't always know or care what the right hand is doing.

When I worked at IBM, I usually called in a favor from friends at a local VAR whenever I needed to order something to get a BoM, because even internally the process was opaque.


I had decent experience with an IBM reseller, back in 2005 or so. I forget the exact term. VAR? Solutions provider? Anyway, we were able to get the latest POWER system (POWER 6, I think) and evaluate our AIX-based app on it for about 2 weeks. They even set us up with a small SAN! Unfortunately, the reseller was not terribly skilled technically, so they were unable to configure the SAN. I had to figure that out myself.


A couple years ago, I worked at a company who's ERP system ran on an IBM System i (formerly known as AS/400). At a user conference, IBM had a table setup with a couple sales guys and a pair of servers on the table.

The first one was the latest server that could run Series i. No big deal, we were coming to the end of our maintenance contract and would probably be buying a new machine in the next year or so.

The second one was a 2U, 48-socket POWER server. They bragged about it running thousands of Linux VM's at a time. I found it a bit odd because nobody who would be running this particular ERP software would be running thousands of Linux VM's.


This is why VARs exist


> It’s very cool, but unfortunately inaccessible for those without sky high budgets and time to talk to sales reps.

I've spent some time looking to see if it would be possible to spin up a POWER based VM in the cloud (just out of curiosity really). While it seems possible in theory in the IBM Cloud, it seems IBM themselves is only interested in offering this to their enterprise customers moving to the cloud, focusing it on getting people to move IBM's AIX/iSeries lock-in to the cloud. When looking at it before, I was not able to spin up a POWER VM from a regular IBM Cloud account at least.

There might be a bit more interest in POWER if it wasn't so damn inaccessible but it really is. If the easiest way to get into POWER is paying thousands of dollars to a company retrofitting IBM's decidedly non-desktop hardware into desktop hardware (talking about Raptor CS), your architecture is doomed to wait for all your enterprise customers to move to amd64 commodity hardware. Maybe that is IBM's goal even, I don't know.


I think you can get Power systems on the cloud here https://cloud.ibm.com/catalog/services/power-systems-virtual...


Ah this seems to be new, thanks! That's exactly the kind of response I hoped I would get.

Unfortunately completing this flow - with the goal to get a temporary shell on a Power system running some Linux distro - requires me to authorize a payment of over 1300 dollars before I even select an image. That's for "reserving" 1 POWER core and 2GB of RAM... As an individual developer playing with this, that's way over my head. I'd understand very high end cloud pricing for that (tens of dollars per hour, I'd be happy to pay that for messing around a little), but this isn't even a cloud machine pricing model. For reference, $1300 is over half the price of getting a Talos motherboard with an 8-core POWER9 CPU you then actually own.

It seems IBM does not have the infrastructure or volume to permit cloud pricing here. I understand they might not have this, but in my opinion they need to work on this to make it accessible.


Completely agree. I've talked many times to some of the folks there and it seems to me the product is not for a developer or hobbyist, which is unfortunate. It is purely enterprise. Thought this will open up a parallel ecosystem for certain applications but IBM does not seem to think that.


I just don't understand how IBM - being acutely aware of what happened to Itanium and SPARC + running a cloud platform themselves - still believes this notion of "enterprise hardware" is the way of the future. Surely IBM was/is even better entrenched in this area than HP and Sun/Oracle/Fujitsu, but come on. POWER has a lot of history in consumer accessible products even. This is just a recipe for fading into irrelevance ever more.


Sounds like a problem that should be easy to find on an org chart.

If you don’t reward good ideas (they don’t even need to be particularly good or novel, just common sense), you’ll have a company trying to grow something that’s reached its peak usage with motivational speeches, very inspiring leaders and good old pressure on employees as a form of local optimization.


It’s extremely easy to find. The problem is the org chart itself. The entire way they run this stuff is not setup to handle the kinds of things we expect (and get) from every other hardware vendor or cloud provider that could reasonably be considered a competitor to them in either category.


Mostly, but I think you could have an org chart and let individuals and teams override it if they find good ways to reuse assets.

I just don't think most companies run this way, maybe it's the military way of thinking of strict hierarchy rather than a market where the best ideas can develop or where you can at least break out of the hierarchy to get something started.

The MBAs and bean counters would probably argue that it's important to get certain unpopular things done, but I think a market would solve that too, as soon as something becomes a bottle neck, someone would step in.

My impression is that organizations are far too centered around VPs and directors who get to carry on without having to prove themselves again in the new situations they're in.


Beats me. Can't believe a company that was making processors for the most picky of all industries (gaming consoles) is now completely ignoring the everyday consumer. Org priorities and financial engineering it looks like.


They've made huge changes since those days, getting rid first of x86 laptops and desktops then servers then chip fab. None of those decisions ever seemed particularly wise to me, especially the last.


The most frustrating part is they made really good products(even if slightly expensive) but somehow didn't follow through on any of them.


What makes you think that IBM sales people and management cares about "the way of the future"?

They care about the quarterly report and right now the best way to improve short-term performance is to milk your existing locked-in customers for as much money as possible.


Because they've just announced the next entry into a series of processors that by all accounts can take the fight to Intel and AMD's best (and has done so for some time), which requires huge investments and long term planning, roadmaps and funding commitments to work. Seems pretty committed to long term thinking to me.


At a guess you are not the target audience and this up front commitment is their way of filtering you out.


If any management/exec thought of this filter, they are idiots. I'd imagine if they made these as accessible as x86 servers to me - a humble sysadmin, and I could prove by data that our legacy Fortran scientific application can run faster on it than the newest Intel servers (it most likely would, given SMT differences, higher memory bandwidth, wider cache lines, more efficient 7 nm process, higher clocks etc) I'd recommend to my boss to spend our annual compute budget on these boxes instead of the Intel servers because that one legacy application consumes a lot of our HPC compute capacity. My boss and I would happily pay IBM to give this critical application a boost. Too bad, IBM filtered me away.


Agreed. Lots of companies have this 'call our sales department' strategy prior to giving you any information at all. That's always been a great way to lose my business, and indirectly the businesses for which we consult. But that's perfectly ok with me. If a company is not willing to list their prices up front then that's a good indication that they are not competitive.

Besides that, some of the atrocities that I've seen IBM and their partners commit are a good warning that you want to stay far away from them. Lest you be Watsonized and made dependent on marketing-masquerading-as-technology.


The last time I checked (which was quite a while ago) you could configure and price at least some categories of POWER servers on IBM's website much like you can with Dell or any other x86 vendor. The big difference though was in the amount of sticker shock when you see the bottom line price.


Very much agree. When IBM acquired Red Hat I really hoped they were going to get serious about cloud. They still might, but at least for now me as an individual guy just can't play with their hardware. I admit I'm quite impatient and I get irritated when any amount of red tape blocks efficient allocation of my time, but I don't think (given how easy it is on other cloud platforms) it's too much to ask for if you want to be taken seriously as a public cloud offering.


There are other options (1 is offline being upgraded this week)

https://developer.ibm.com/linuxonpower/cloud-resources/

Each of these do some filtering because they give out resources only to find people bitcoin mining. Also, a good number of experimenters give up on the first obstacle they encounter or aren't really well-versed in benchmarking / architectural differences (lots of folks running microbenchmarks). There are some incredible resources available (many free) for those looking for a partner and not just a box. Good luck in finding the right partner for your projects.


Of course I'm not. My entire point is that filtering out average Janes like me is antithetical to the long-term interests of POWER, and extremely weird for a cloud platform that allows you to spin up an amd64 VM in a few seconds for a few cents. You're left with an audience that has to use POWER (either because legacy or their purchasing colleagues thought it somehow was a good idea), not an audience that actually wants to.


It speaks volumes to me and says that IBM knows perfectly well that they are not interested in cost conscious customers but only want those for whom the IT department is a cost center and not a core strategic asset.

That's why you'll never see a Netflix or a Whatsapp on IBM infra. But banks, insurance companies, medical companies etc are still large contributors to IBMs revenue streams. If your idea of software development is agile teams and capable programmers churning out code to power your business then you're not an IBM customer or even a prospect.

If your idea of software is 5000 programmers as interchangeable cogs in a machine with an annual release and three month acceptance cycles then IBM is where you'll probably end up.


I understand that is how IBM thinks. I just don't like that they only think this way, as it feels like spoilt potential. IBM has all the pieces in place - they have great technology with high performance, ppc64le actually has pretty good apparent support from mainstream Linux distros, infrastructure and languages (for a non x86/arm architecture), and they have full "creative liberty" of where they focus their platform. It would be an awesome option to have for cloud infrastructure, but they keep it all to themselves.

They could be using that to do what AWS does with Graviton2 - cost control their completely integrated stack and make it a competitive advantage. Sell more performance per dollar despite having a non-amd64 architecture. Use it to give everyone more choice and competition. But instead they mostly use it to lock in their old (or new?) enterprise customers. The irony is that these two models could easily coexist, but they don't seem to understand the former and understand the latter very well.

And I can't help but think the latter is a losing model, as my strong impression is that the movement in the enterprise is away from "enterprise hardware", and towards commodity hardware. IBM needs to work on becoming commodity, in my opinion.


Oh, you are totally right, it is spoiled potential. But they've been doing this for so long it is impossible for them to change. Hence the very long and slow slide to the bottom. I'd see the Maffia change their tune before IBM ever will, way too much institutional inertia.

I know some of the IBM story from very close and it takes a certain attitude to even want to work there.


What's funny about this is that it used be be true - Whatsapp started on Softlayer, and didn't move away until Facebook bought them.

If IBM had even just kept pace and stayed behind AWS with Softlayer after they acquired them, they would have a healthy cloud business by now. It might be because I maintained a system on Softlayer both pre and post acquisition - so I was really close and able to see what was happening - but they squandered a huge opportunity there.


Ah yes, that's true, they in fact were hosted there at some point. I couldn't have picked a worse example :) Or; in a way it is proof that IBM is a bad choice for companies that operate at scale and are low margin and data heavy. Hosting costs must have been a substantial fraction of operating costs for Whatsapp (obviously, long after personnel).


To be fair, POWER is an open standard, and there's absolutely nothing stopping someone like Linode, DigitalOcean, or Hetzner from offering POWER-based systems at a smaller hourly price.

edit: oh hey, you're ahead of me :D https://news.ycombinator.com/item?id=24185759


> To be fair, POWER is an open standard, and there's absolutely nothing stopping someone like Linode, DigitalOcean, or Hetzner from offering POWER-based systems at a smaller hourly price.

Well besides the fact that 1) IBM is the only party with both the capabilities and interest in making high performance Power based products [1], and 2) evidently does not understand how to (or why to) invest in bringing this to a general audience. I don't really care if they would do this in IBM Cloud or with other cloud infrastructure companies, but they don't seem to be doing either. In addition 3) why would other parties be interested in running Power when they can run amd64 or arm? It certainly doesn't look to have a price advantage...

IBM really needs to shepherd Power well - they're the only ones that can do it. But I can't help but thinking they seem to be leading it to the grave despite apparently very capable engineering.

[1]: And why would anyone but IBM go for Power at this point if they can have ARM too? An open license matters very little when compared to ARM's mindshare and momentum.


As it is, POWER is on a slow decline towards irrelevance by focusing only on milking their existing enterprise customers as long as it lasts.

Which is a huge shame, the POWER ISA per se is mostly fine, and their commitment to open firmware etc. for trustworthy computing (see Raptor) is a niche that for some reason interests nobody else.

If they want to turn it around and compete with x86, ARM and maybe even RISC-V on the low end, they need to commit to openpower. Get some interesting cores open sourced under the openpower umbrella, docs, open source bus interfaces for connecting stuff on a SOC, etc. And get some decent priced hardware into the hands of hobbyists and as dev boards for embedded.


How would filtering out potential developers benefit a platform’s viability? My imagination isn’t sufficient to envision a market situation where this could ever make sense. Can anyone here describe a hypothetical scenario where as a company pushing a platform you want less developers interested in it?


It could be your going about it a bit wrong. IBM does have trial POWER cloud instances. Or at least they did a couple years ago, as I managed to get my hands on one through the IBM developer program.

OTOH, getting into the developer program requires talking to their sales/marketing droids too. But, if your a software OEM or have a SAS product, they were (are?) hungry and actively doing their best to recruit companies to their platforms. So, you spend a couple hours answering their questions, and in return you get some pretty steep hardware/software discounts and they will loan you machines.


Doesn't GCS over Power as a platform? Likely to only deep pocketed customers but iirc they did a partnership a while back.

yup -- not turn key at all -- "contact sales" but it exists: https://cloud.google.com/ibm



Well in our case (Fortune 500) we have iSeries and pSeries systems built on different generations of Power. They just aren't consumer grade machines out there, or better put commodity priced machines.

When your are already invested it is far safer to stay with what works for you. My downtime is measured in hours per year and all of it has been scheduled across the last seven years; that was the last time we had an unscheduled outage. You can achieve this will all types of hardware; well maybe not all types; but it is easier with some than compared to others and expectations are certainly much higher.

We joke at work that we get forgotten all the time because our daily ready for business meetings; all groups reporting in; never see us mentioned except to state plans for upcoming quarterly maintenance. A pleasant state to be in


> My downtime is measured in hours per year (...)

This is probably a bit picky since you don't mention what those machines do, but that's not even "four nines" in the high-availability scale. I'm not sure that's something to boast about for a Fortune 500 company or any kind of praise for the IBM series. Maybe you meant hours over the last seven years?


Not op but it sounded like it was scheduled downtime so perhaps it was meant as a per-system number and either redundancy kept the availability higher or the service wasn't needed when downtime was scheduled (overnight, weekends, etc)


Planned and scheduled downtime usually doesn't count against SLO/SLA.

If the "hours per year" is true, they're actually achieving 100% uptime.

Fucking big deal, if you ask me.


Juelich Supercomputing Center has a POWER9 system which I was able to experiment with. A lot of interesting differences from 4 way SMT, 128byte cache line size to faster main memory bandwidth and multi GPUs.

In the end, it was great but not magical, so the difficulty of acquiring offsets the benefits IMO.


I have a 1999 IBM Multiprise 3000 and I learned that the processor has a 256-byte cache line which dropped my jaw when I read about that. Sadly, I can't get the Service Element software so I can't actually do anything with it.


Try https://www.integricloud.com/ it's the cloud of Raptor (the power workstation builder)


It's really cool they're making this, I should give this a try. This does look like it's the most accessible way of getting a POWER-based shell. This service's interface does not start out with making a great first impression though unfortunately, but if it's the only realistic option then so be it.


Top UP? ¿?

Why making it so difficult to navigate? Just give resources an price.

I have to invest so much effort into an infrastructure I don't even know If I'd like to have.


Presumably this helps them plan out capacity better. Not everyone can have the extra slack that AWS has.


If you wanted to try ppc64le, you could reach out to OSU's OSL Lab. They have Power8/9 machines and you could build your software and possibly arrange some time for PoC.

https://osuosl.org/services/powerdev/request_hosting/

There are also these https://raptorcs.com/TALOSII/


Also you can build for ppc64le if you are, or become, a developer for a GNU/Linux distribution. Unfortunately getting ppc64le back online for Fedora copr has no ETA, but you can develop and build for Fedora/EPEL. I can't remember whether SuSE's OBS has it.

You can also run under qemu, but I don't know how solid that is, and there is at least one simulator.


Yeah openSUSE has ppc64le builders. I think there are openSUSE ports to ppc64le, but they definitely exist because we support SLE on ppc64le.

(Disclaimer: I work for SUSE.)


Raptor sells power9 servers. You can buy one to tinker with for ~$2k. Presumably, they will sell power 10 too.


It looks like it will be a few years before they have that https://twitter.com/RaptorCompSys/status/1295364416469377026


That seems a bit disappointing. It might suggest that IBM is backtracking on their openPOWER approach.


Do you know about Raptor Computing? [1] Some of those are affordable and all of them are configurable. Maybe they'll support POWER10 in the future.

[1]. https://www.raptorcs.com/


Raptor Systems’s POWER9-based Talos system immediately came to mind when I was reading this. I definitely hope they’ll be bringing out a POWER10-supporting successor. The experience is now pretty perfect, even for the supposedly dicey desktop environment (provided you keep in mind that there’s a lot less hardware present than you might’ve come to expect: no integrated graphics, no in-built sound, even disk and network controllers can be a bit dicey). Their initial offerings had some notable teething problems, that I shan’t waste more pixels going into (I’ve railed against them and turned in their favour, so I shall let it rest with my partisanship and convert’s fervour). It’s certainly worth considering if you’re into exotic hardware and/or are super-security conscious AND are willing to pay some extra money beyond what you’d expect for comparable x86₆₄ commodity builds. Well worth hoping for and looking into.


Looks like about 10k USD for the POWER9 recommended config

https://secure.raptorcs.com/content/TL2WK2/purchase.html


That’s true, but if you base the system on the smaller BlackBird motherboard (as I did) you can go much lower. https://wiki.raptorcs.com/wiki/Blackbird https://secure.raptorcs.com/content/base/products.html

You might also not need the Radeon WX7100 Pro, if a consumer level graphic card is enough there are sevaral options: https://wiki.raptorcs.com/wiki/POWER9_Hardware_Compatibility...

The more you’re ready to do-it-yourself, the more you can save. You absolutely need the motherboard and IBM chip, and then you can get other things like memory, storage, graphic card, etc. used for much cheaper. Only the memory is a bit special as it has to be registered, but because of that you might be able to find good deals since it can’t be used in standard desktop computers.

You can get more performance for the same price with AMD/Intel, but for those interested this platform is not that out of reach. I use it as my workstation and I’m happy with it.


How much is ‘much lower’? I would love to build a workstation for myself using PowerPC. Would it run AmigaOS?


Blackbird™ Mainboard (Board Only) Order online for $1,310.99 Current Status: Backordered

Not unreasonable. $500 for the 4-core processor or $800 for the 8 core. Basically what you would have paid for a decent system decades ago, without adjusting for inflation.


“Much lower” is still quite expensive compared to a mainstream platform: the absolute minimum if you already had all the other pieces would be the motherboard and CPU bundle for $2,133.77: https://secure.raptorcs.com/content/BK1B01/intro.html

That’s 8 cores, which since Power9 has SMT4 means 32 threads. The 4 core CPU bundle is a bit cheaper but when you add the 2u CPU cooler its price gets very close, so it’s not worth it unless your computer case is too narrow for the 3u (≃15 cm clearance, https://en.wikipedia.org/wiki/Rack_unit) cooler included with the 8 core CPU.

It won’t run AmigaOS unless on an emulator[1], and then you’re better off with AMD/Intel: you would lose more performance in emulating the CPU, but you would get more raw performance for your money anyway.

[1] https://forums.raptorcs.com/index.php/topic,75.0.html

If you want to run AmigaOS software natively on PowerPC your best option is a PPC Mac running MorphOS. It even runs on the PowerBook G4: https://ddg.co/?q=morphos+powerbook+g4

Or if a well-integrated emulator suits you better, take a look at AmiKit: https://www.amikit.amiga.sk/

I don’t think that this CPU would be a net advantage for those use cases. I did use Amiga computers even as Commodore went bust, hanging on, but the approach I’m going for now is FPGA implementations: https://misterfpga.org/ http://www.apollo-core.com/index.htm?page=preorder

I’m in the waiting list for Vampire4 Standalone, but for now I’ve got a MiSTeR.

So no, don’t buy a Power9 hoping to run AmigaOS on it any better than on a mainstream PC.

P.S.: the $1310.99 price mentioned by aww_dang must be from a few months ago, the current price displayed in Raptor’s website is $1732.07. Prices went up due to the current COVID-19-induced crisis, which also made Raptor cancel their upcoming Condor platform (ATX Power9 but LaGrange-based: twice the memory bandwidth): https://www.talospace.com/2020/07/condor-cancelled.html

P.S.#2: for those in Europe, Vikings (https://vikings.net/) has been trying for a while to offer them here and they might be able do so it soon: https://www.talospace.com/2020/08/vikings-upcoming-openpower...

P.S.#3: used BlackBird with 8C Power9 for sale in Germany: https://forums.raptorcs.com/index.php/topic,153.0.html


There are some issues with Power10, although Raptor hasn’t revealed them yet until they clarify things with IBM: https://www.talospace.com/2020/07/condor-cancelled.html?show...


You can buy POWER9 boxes without talking to humans. With any luck, Raptor will integrate POWER10 into one of their upcoming Talos[1] workstations.

[1] https://raptorcs.com/content/TL2WK2/intro.html


> With any luck, Raptor will integrate POWER10 into one of their upcoming Talos[1] workstations.

Not for a while: https://twitter.com/RaptorCompSys/status/1295364416469377026


If you run an HPC cluster, I'd have thought you have to talk to vendors as a matter of course, but I assume there's no point yet. At least one UK vendor will sell you AC922s (POWER9), I guess as one-off, if you wanted.


True, I do talk to many vendors related to hardware purchase. But, it usually is about the hardware I want, after I already made up my mind about what fits into our budget and matches our need. IBM is usually not even in consideration because I cant even get a ballpark estimate of what these things practically cost (I can get some $ figures from Googling forums but it is not really of my interest).

To give you an idea, I can log in into a local vendor's website, for example, https://www.atea.dk/eshop/products/?filters=S_sr650 and quickly get an idea of what an SR650 Lenovo rack server would cost me. My configurations would obviously change the cost, which is where I talk to the vendor to get the real costs.


That's not how IBM works or how they have ever worked. The whole idea here is to put you in touch with their sales organization who will put some effort into determining what the best way of fleecing you is.


But if you want a small GNU/Linux POWER system, you don't talk to IBM, surely. Ours came from OCF in the UK; they might even operate in Denmark. You won't see a price on the web site, but realistically you don't expect a list price from Dell for your x86 server systems -- at least in my experience. If you talk to them nicely, they may give you an evaluation unit. We once paid £1 for a potent Interlagos system that didn't go back.

I'm not saying talking to sales people is necessarily a pleasant and easy task, of course, especially as you typically need to know more about it than they do.


"Consultancy"


The only place I have heard about POWER being used in Denmark is at the military, but I guess it exists elsewhere as well. Though I could see that the big orders in the public sector comes through procurement. Compared to private companies asking around.


Banking tends to be a big Power user. I bet even in Denmark that's true. :-)



The time to talk to sales reps is a real problem. I was at a startup, and this was a factor in pretty much all of the software we selected. This was a financial app- we needed messaging of some sort. I called the leaders at that time, Tibco, 29West, etc. They all wanted to schedule meetings 3-5 days in the future to just talk about what we needed. My expectation is that we would have something basically working in 3-5 days. This is before we started haggling over contracts and pricing and all that.

We ended up using ActiveMQ, and literally that evening I had things talking to each other. No contracts, no hassle, just downloaded and got to work.

A bit more on-topic in regards to IBM though- we buy licenses from them directly, and the past few months they have sent kind of bizarre emails saying I could save money if I bought them through a third party partner. I just forwarded the mail to the finance guys, but I can't imagine why IBM would introduce a middle man into our relationship with them and how that additional layer in between could end up saving us money.


IBM does channel sales. They don’t want to maintain a sales force. They want to maintain channel partners and VARS. They will discount you because they need your money flowing thru VARS to keep everyone happy in their ecosystem. So they need to incentivize everyone to go through their partners. So they will charge you more if you are direct to them.


> They don’t want to maintain a sales force.

Wow, I didn't realize their culture had changed that much. They were a sales oriented company since the days of Thomas Watson Sr.


They had and have a large/enterprise customer sales culture. There were few SMB sales for IBM back in TWSr's days (how small a mainframe do you want? Still costs a mint.)[1] and now that IBM has SMB product to sell, they aren't interested in the building out a sales org to address that market[2]. So they've done what almost everyone else has done and outsourced that to a channel.

[1] Yes I know IBM used to sell everything down to the pens and pencils, but virtually always to support some very expensive other purchase. [2] Yes I know IBM has things called "SMB sales" or some variation. From my experience at IBM, they were either targeting some specific product/market combo, or they were a bad joke; not exactly the A-team. YMMV.


Many big B2B sales orgs are all channel or heavily channel now. Just the way it is these days.


I don't know what the plan is for POWER10 but POWER9 systems are available from other vendors than IBM. Notably this company: https://www.raptorcs.com/.

The systems are still quite expensive, but likely within the budget of a University research lab.


I've begun to think having to "contact sales" is an anti-pattern that actually hurts gross sales in the long-run.

I've recently tried to engage with EventMobi, a company that supports virtual events.

I'm ready to spend thousands if necessary.

However:

#1: There is no way to just sign up, which is off-putting.

#2: Requesting a demo just put me in their funnel with canned email messages. And after over a week, I have yet to be contacted by a live human, despite sending emails.

I feel companies leave a lot of money on the table by not having some sort of self-driven onboarding.

Sales reps are humans. They get busy. Forget to call back, etc.

I feel like there should ALWAYS be at least some sort of self-driven flow at the low end. Even if sales are required at the high-end.

Otherwise, it seems, money is always being left on the table.


The "low" end is often grossly unprofitable (or, frankly, aggravating) to the point of dragging down the product and the support personnel.


I don't know if it's accessible to everyone, but IBM does have a cloud offering: https://www.ibm.com/cloud. Their bare metal offering doesn't seem to include Power processors though: https://www.ibm.com/cloud/bare-metal-servers


My take on performance using the slides mbrobbel posted, from https://regmedia.co.uk/2020/08/17/ibm_power10_summary.pdf

The dual chip module has 30 SMT8 cores running at 3+GHz, capable of 64 FP64 FLOPS/cycle when using the matrix unit. That gives 5.7TF of peak FP64 performance (Compared to 19.5TF on NVIDIA A100 when using tensor cores, and 9.7TF on A100 when not using tensor cores).

They say it has 3x the "general purpose socket performance" of power9 in FP workloads. Trying to make sense of this from the other data, they have 15 SMT8 cores per chip (12 on Power9). The single chip module runs at "4+"GHz and dual chip at "3+"Ghz. (4GHz on Power9). Each SMT8 core has 30% additional performance compared to Power9 (slide 13). If I assume the lowest possible clock that gets me to 2.4x comparing the dual chip modules to the previous single chip modules, whereas assuming 3.75GHz clocks would give 3x.


Our institution runs multiple HPC clusters for all kinds of scientific use cases. I remember reading about the head of the HPC department making the switch from INTEL/AMD to IBM because it had a much larger memory / storage (I can't remember which) bandwidth in some astronomy application. This made the project feasible without having to invest in custom hardware.

It's good to see that POWER still excels in important use cases.


POWER9 has large cache and can take large memory (at least for when it appeared), but was particularly notable for bandwidth/latency. I think they were the first with PCIe 4, for instance, when PCIe 3 was a bottleneck for HPC interconnect.

The native interconnect in 10 looks interesting.


For the uninitiated, what's the value prop of these processors?

Cheaper $ cost per TFLOPs to make up for the trouble of dealing with a specialty instruction set? Speed of certain specialized computations that cannot be matched by alternatives?

Or how would one summarize it?


Open Architecture with no black boxes. Total control over your system of the kind we used to take for granted before the Intel Management Architecture days. Very focussed on throughput and centralised operation (for example homomorphic encryption and encrypted memory to forestall snooping).

And yes, when used ‘correctly’, these systems can be very fast... in the steady marathon kind of way rather than the spasmodic sprint-racer clock-boosting-and-throttling manner of today’s mainline chips.


> Open Architecture with no black boxes.

> ‘Open’ means that you’re allowed to understand exactly how it works and that there’s no mysteries. It means having the blueprints of the machine, not a free machine.

No, this is 100% incorrect.

The Power ISA, i.e., the software/hardware interface of the CPU, is open source. This means that if you want to build a Power CPU that implements its software interface, you can do so "for free".

That's it. You don't get "the blueprints of the machine", you cannot look into how the CPU work internally and understand it, etc.

That's like having a standard API that anybody can implement, e.g., the C standard library, but which Apple, Microsoft, etc. ship as a black box binary blob, so you can't understand their implementation, search/fix bugs, etc.

So no, your claim is completely incorrect. The benefits of an open ISA only apply to those wanting to build their own CPUs, which for Power is just not even a handful of companies, none of them making their blueprints of their CPUs openly available...

For end users, your machine is as open/closed on a system with an open ISA like in one with a closed one. People paying 10k$ for a Raptor II in the name of openness are throwing their money away.

This is a completely different situation than, e.g., RISC-V, where not only the ISA is open-source, but the VHDL implementation of many RISC-V cores is also open source, and you can buy those cores today.


It's not about the ISA (I assume). The point is that these systems have essentially all free software firmware, as I understand it. You have remote management, but it's something you can presumably fix if you need to. Apart from trust issues, you know how valuable that is if, for instance, you've had to deal with BMCs' brokenness continually over the years.


Reading between the lines on statements made by Raptor I'm thinking that POWER10 will not be open immediately upon release.


Yes, that does plausible, which would be unfortunate.


Yeah, check out powerpc laptop project. Some folks are trying to build a laptop with powerpc.


It's a doomed project. There are no mobile PowerPC parts on the market anymore; they're trying to make do with a QorIQ networking part, which has an inappropriate power budget for a laptop. (The specifications are a little hazy on the matter, but by my reading of the datasheet, it idles around 7W and draws closer to 20W at full power.)


So how come all the spotlights are on RISC-V? Another case of "worse is better"?


As far as I know, Power ISA implementations still required royalties until a year ago, so it hasn't had much time to mature in the current iteration of its "open" role.


> Open Architecture with no black boxes.

How do you reconcile this comment with the one from reacharavindh?

What use is the "Open Architecture" part if they're super expensive (ok, maybe you can ignore this part) and you have to go through sales representatives for a simple sale?

Those are still barriers to entry, even if they're not technical.


I don’t even attempt to reconcile it because they’re totally different things... ‘Open’ means that you’re allowed to understand exactly how it works and that there’s no mysteries. It means having the blueprints of the machine, not a free machine.


Yeah, but "Open" isn't everything.

1. Is any of these "Open" architectures actually used in production anywhere serious when not implemented by their creators? I'm actually interested to know of examples.

2. How do we know that the actual chip IBM provides is the thing in the spec? The comparison was with Intel, how can we prove that there are no backdoors for PowerPC? If we can't prove, does it matter if it's "Open"?


> but “Open” isn’t everything

I never claimed it was, and this is starting to reek of a straw-man argument where you’re opposing a statement I haven’t actually made.

(1) Yes, I am aware of situations where this architecture has been chosen by a body that isn’t a chief implementor, and no, I am not at liberty to discuss it.

(2) Having an open spec to compare against, even though I actually don’t know how, is already another plane of existence compared to not having something to compare against.

Decapping and microscopy? Pushing edge cases onto the chip and comparing expected outputs? Implementing all or part on an FPGA and seeing how they compare at a severely clock-reduced rate? It’s well beyond my technical ability, but it’s not beyond expert technicians’ abilities. That’s the key point.

EDIT: Also you can set your own keys for the root trust, and remove others’. That’s very important, and radically orthogonal to the competition of ARM and x86₆₄.


They seem to be the best "communicators" on the market. With 1TB DDR5 and PCIe 5.0 support, a POWER10 will be the fastest "glue" between DDR5 and GPUs.

Which is pretty much how it works in the Summit supercomputer (POWER9).

--------

From a CPU-perspective, its going to be more costly than a Xeon or EPYC and not as fast at crunching numbers. But POWER9 (and I expect POWER10) usually had the best L3 cache and RAM performance.

The 1TB/sec OpenCAPI link to FPGAs or GPUs continues that tradition. That's an absurdly huge communication path between CPUs and/or GPUs or whatever else is on the motherboard.


Since selling its Intel server business to Lenovo, IBM enterprise server hardware is Power only. It is the flip side to some of the complaints in this thread, IBM enterprise sales is the single trusted source (or single throat to choke) for all of your business critical systems.

If your organization has an existing relationship with IBM or Red Hat, Power CPUs are part of an integrated bundle moving forward.


They also have NVLink which means 9x more bandwidth to NVIDIA GPUs compared to PCIe 3.0x16


Vertical scaling for database servers.


Have they improved Load-Hit-Store penalties from previous generations?

Lots of transistors and opcodes have been sacrificed for fancy things like transactional memory, runtime instrumentation and other features but fundamentals haven't improved requiring expensive compiler opts which interpreters don't do and are expensive for JITs.

The Intel chips did the fundamentals better, has POWER caught up?


I noticed recently that OpenBLAS has gained some code using the POWER 10 matrix-multiplication units: https://github.com/xianyi/OpenBLAS/tree/develop/kernel/power


IBM is dying a slow death; they still "milk" the market with old stuff like AS/400, because of a superb lock-in.

The "Power" business is still doing ok; but I'd bet in a few years one of the other big guys will go at it (maybe Nvidia?) and start eating at their market share.


What do they do with the edge bits of the silicon wafer that are not squares?


I'm not an expert in this stuff, but I think they just get thrown away.

It's one of several reasons why smaller chips are more area-efficient to make, and one of several reasons why the major semiconductor manufacturers have been so interested lately in building chips out of smaller pieces manufactured separately rather than one big monolithic chip.


Even dumber question, why do they need to be circular?

CDs and DVDs write in a circular pattern starting from the middle going outwards, but the actual chips on these wafers seem to be their own individual squares.


Because of the process used to make the initial cylindrical crystal from which the wafers are sliced: https://en.wikipedia.org/wiki/Boule_(crystal) It involves spinning a seed crystal and drawing a cylinder out of a bath of molton ultra-pure silicon. Spinning => cylinder => circular wafers. To reduce waste of the "edge bits" the industry has moved over time to larger and larger wafers. I have some wafers from the 90s which are 6" and 8" in diameter (amazing what you can buy on eBay), but modern ones are all 12" (actually 300mm).


And migration to the next proposed size of 450mm may never happen, as the economics are not favorable:

https://en.wikipedia.org/wiki/Wafer_(electronics)#Proposed_4...


Just looked on EBay, there are indeed many different wafers for sale.

https://www.ebay.com/sch/i.html?_nkw=wafer+chip&_trksid=p238...

Poured in Acryl some would make cool plates.


Also dumb question: Why do dies need to be square? Wouldn't you get less waste at the edges with a hex shape?


That’s a great question.

I would guess that dies are built from modular sections (e.g. SRAM cells), and it’s important that two identical modules perform identically - signal propagation time is relevant at this scale, so the shape and layout of each module must be identical. I would further guess that rectangular layouts are easiest to reason about, easiest to make masks for, easiest to pack efficiently at the transistor level, and easiest to test.

But I don’t know of a fundamental reason why a sufficiently advanced VHDL “compiler” couldn’t produce hex-cell or even circular layouts.


Chip dicing hardware can produce hex-cells, or any other cell with straight edges. (Not circular - that's not a good shape to expect from crystalline silicon.)

But - as you say - the modular sections are rectangular, and for most applications there's no good reason to make the dies any other shape.

There's actually a patent for hex-cell chips, but it doesn't seem to have been used for any significant projects.

https://patents.google.com/patent/US6030885A/en


Could do triangular chips instead, they would be easier to dice.


I expect that would be too difficult to cut up


> Even dumber question, why do they need to be circular?

The wafer is round because it's cut from a cylinder of silicon. And the cylinder is a cylinder because spinning is involved in the process to make it. Hence, thanks to centripetal force, it ends up being round!


Because the crystals as grown are circular (just first hit about the process https://youtu.be/XbBc4ByimY8)

From that one slices the wafers and then the processors get made. The ”extra” ones in the edges have pretty much zero marginal cost.


Here's another video that shows different ways silicon wafers are manufactured, including "silicon ribbon" which makes rectangular ribbons!

https://www.youtube.com/watch?v=8QKzS_w_Ko0

Silicon Ribbons begin at 10:10.

https://en.wikipedia.org/wiki/String_ribbon

>Ribbon solar cells are a 1970s technology most recently sold by Evergreen Solar (which is now in receivership, i.e. bankrupt and liquidated), among other manufacturers.

https://en.wikipedia.org/wiki/Crystalline_silicon#PV_industr...

>ribbon silicon (ribbon-Si), has currently no market


They're grown as a giant cylinder, which is then sliced into round wafers.

https://en.wikipedia.org/wiki/Boule_(crystal)


Because that's the way you grow them as a single crystal.


I'm not sure if they can be 'just' recycled into new wafers. Either way, given that it's just silicon I'm sure it can be recycled or safely disposed of (silicon isn't toxic as far as I know, unless you breathe it in a powder form)


Not sure how easy it would be to recycle those as chips, given it will have dopants [1] inside. It will likely be unfit for computing applications unless purified, but since it's on the order of one dopant atom per 1e12 silicon atoms, it would basically be 100% pure in other industries.

Some metallic contacts (mostly aluminium), silicon oxide and other residues are likely present as well, depending on the masking process.

[1] https://en.wikipedia.org/wiki/Doping_(semiconductor)#Silicon...


There's stuff deposited on top. Depending on the stage, it's not silicon only.


I've wondered for quite some time why not triangles or hexagons. I guess the yield (percentage of wafer thrown away) improvements would be minimal. Plus, the temperature is often better controlled at the center, which would make the edge parts less performant anyway.

That's the other advantages of chiplet design: maximize yield (a small defect renders a much smaller chip unusable), and much more granular binning (easier to sort out good/worse chips, due to placement and random issues during fabrication). Not to mention you have a much more modular design at the end, where you only have to change the cheaper (not 7nm) silicon interposer.


Partly it's "path dependency" - everything is set up for rectangular dies, so everything would need to change for uncertain benefits. Not just the tooling, but also the design software. While I was looking at this I found that Intel has a patent for an octagonal die with a smaller square one fitted in the gaps between: https://patents.google.com/patent/US20060278956

BTW the dicing used (in the 1970s) to be done partly by hand. You can see a video of someone doing it here: https://youtu.be/HW5Fvk8FNOQ?t=978


Hexagons dies would be difficult to cut from the wafer, rectangles can be cut with relatively simple circular saw.


The periphery is used for wafer level test circuitry.


Replacement corners in case one of the full-size dies in the middle cracks during dicing.


I know you re probably joking, but please mind Poe's law.

For the uninitiated, yeah, some dies can die during dicing. But I think you'd have trouble finding the cracks, and then it's just infeasible to precisely cut both halves where they would need to be cut, then reattach them. The issues would be the cut thickness, not damaging the circuits near it, precisely aligning the circuits, and then electrically connecting the circuits.

Alignment is probably the hardest part, we can barely do it for flip-chip wafers/silicon interposers on the order of the µm, imagine doing it at less than 7nm, which is the transistor pitch here.


Yes, I was joking. :-)

I think they do sometimes put test features in the corners if there's space. The electrical properties of the die can vary in interesting ways [1], but the edges are usually worse than the center.

[1] https://www.google.com/search?tbm=isch&q=wafer+defect+patter...


Bull. Everybody knows that is because they've been cutting corners like it was crunchtime at the circle factory.


Another question: why do they “engrave” the squares that are going to be discarded anyway?


So, the main result of designing a CPU is a series of masks that essentially indicate where to put what. For example, in this layer, inject boron anywhere the mask doesn't block. The masks aren't wafer sized- they are pretty small, and a machine moves the mask from position to position across the wafer to re-use it. But, at least when I was working on this, some masks would be larger than an individual square (die)- maybe the mask could do 2x2 at a time. In that case, maybe the application of the mask would get you one complete die, and three die off the edge.


Awesome explanation, thanks! Kinda like dual-cavity injection molds I guess


36% of Supercomputer 500 list use IBM power CPUs, including top two supercomputers.

https://www.top500.org/


Japan's Fugaku with Fujitsu Spark CPUs has just taken the crown.

IBM Power CPUs are now 2nd and 3rd.

https://www.top500.org/lists/top500/list/2020/06/

https://en.wikipedia.org/wiki/Fujitsu_A64FX


That's an ARM(v8.2-A) chip, not SPARC.

Also, SPARC is spelled SPARC.


And SPARC spelled backwards is CRAPS.


So who is going to actually fabricate these chips, Samsung ?

IBM transferred its own chip fab business to Global Foundries several years ago and it was my understanding that they were tied to them for the following 10 years. But Global Foundries announced they were abandoning EUV so I don't think they're going to be producing 7nm chips.


From the link

> Samsung Electronics will manufacture the IBM POWER10 processor, combining Samsung's industry-leading semiconductor manufacturing technology with IBM's CPU designs.


Thanks, I missed that, didn't read the PR all the way to the end.

I wonder how they got out of their deal with Global Foundries.


GloFo probably triggered all sorts of clauses when they announced the were stopping R&D on newer nodes if I had to guess.


Makes sense. That announcement certainly surprised me, especially so soon after they took over IBM's chip fab business.


Are they barred from defense work now?


It's interesting to see that they are using Samsung's 7nm process. I thought that, apart from the work they do for Apple, Samsung kept their high-end fabbing mostly to themselves.


Samsung, Glofo and IBM were members of the Common Platform. My ex-roommate used to work at IBM Upstate where they trained Samsung engineers.

Apple moved to TSMC in Taiwan not too long after Tim Cook appeared on CBS claiming that the engines of their mobile devices were made in US, almost 6-7 years ago. Apple's share of Samsung's production isn't probably much these days. But they are still #2 behind TSMC and Samsung also announced recently that they are investing $100B for next 10 years in logic business which includes their foundry.


Apple and Samsung had been partner for a long time. Some iPods and iPhone, iPhone 3G, 3GS use Samsung arm processor. Apple A4 to A7 is made by Samsung too.

But A8 is made by TSMC, A9 has two versions, APL0898 by Samsung, APL1022 by TSMC. There were some debates on which one is better.

After that, all Ax process are made by TSMC.


One of the many core business of samsung is fabbing on request. They are not doing this only for their smartphones and a few exceptions.


For the their most advanced process, too?

I wonder because NVIDIA, AMD and others requiring that process all seem to land at TSMC. Qualcomm, too, but that's hardly surprising.


Qualcomm is currently using Samsung's 7 nm and nVidia will too.


So I'm far off on both counts. Thank you for the correction.


It's bittersweet reading these IBM announcements. They clearly have amazing hardware, but I'll probably never get to play with it, since they make no effort to sell to consumers (unlike Intel, AMD, Nvidia, etc).


> transparent memory encryption designed to support end-to-end security

Does this work with process isolation? I.E. can I make it so that each process's memory is encrypted with a different key, to prevent snooping by other processes? How (if at all) does that work with debuggers?


I'm not sure about POWER, but in AMD EPYC it is implemented at the hypervisor level. So each VM can have encrypted memory with a unique key, but within a VM the processes see unencrypted memory.

It's typically implemented as an extension of the virtual memory page table, and conceptually it wouldn't be too difficult to have finer-grained keys, such as one for the kernel and one for user mode processes, or even one per process.


Interesting. Does that allay the concerns about speculative execution side channel leaks in cloud VMs? (Because even if you can leak data from other VMs running on the same physical device, that data will be garbage without the other VM's encryption key.)


Is IBM POWER the modern day Cray?


A little bit recursive perhaps. But; IBM is modern day IBM.


Can anyone give a description of what these chips are used for and by whom? And who writes software for these architectures? Seems like a totally different side of the industry that I know nothing about.


Is there a more informative writeup somewhere? I couldn't find any data on performance outside AI inference workloads. There is a footnote about 30 cores but very little detail even on that.



Looks cool, but seems to be pretty unavailable unless you're a large business. Also, very weird that they write about themselves in third person.


The third person is standard format for press releases. It’s so that journalists can copy bits verbatim without having to rewrite who did what.


It's so various news orgs can just publish it in all or part with no editing.


Considering the performance of Power9, it's likely to be slower than the modern x86 cpus, with some specialized exceptions.


Any chance Apple will return to POWER CPUs and be a real "supercomputer" again?

https://jeffhendricks.net/wp-content/uploads/2019/04/Powerma...


Ha! I wouldn't hold my breath - indeed they are ramping up to use their own system on chips (SoC) - not just CPUs - in future Macs.

If they deliver a significant increase in performance - even if it's only for a few specific use cases - the ripples will be interesting to watch play out for decades to come.


>"With hardware co-optimized for Red Hat OpenShift, IBM POWER10-based servers will deliver the future of the hybrid cloud when they become available in the second half of 2021."

Can someone say what co-optimized mean here? Is this just bad marketing speak? What is intended to mean if so?


Is anybody buying non-ARM/non-x86 systems nowadays? Seems like a dying market.


Yes, the financial sector loves them.


I had really high hopes for cpu with native float128 support (quad precision) with POWER9 but after tests it turned out is only native in addition and multiplication ops. We’ll see what new generation brings to the table.


Addition and multiplication ops are the stark majority.


In this moment I do feel like Apple has tackled its goal of making computing more accessible. I just didn’t realize ibm wouldn’t change their playbook.



So, will you be able to install 'Blue Hat' on it? Or do they have another niche OS for it instead?


Yes, RHEL (and Ubuntu) are supported.


+ SLES also


Now if I can get one of these puppies in a future Raptor Computing build... That would be the dream.


I used to work on an IBM AIX system as a data warehouse developer. That was on Power architecture but unfortunately I didn't know much about this architecture back then.


Anyone know if it will be littleBig endian?


Nice. Where are the main HPC jobs located?


National Labs in US


Look at how huuuuuuge those dies are!


Imagine a Beowulf cluster of these.


Wait. IBM still makes CPUs? IBM still makes anything?


They design them but do not actually make CPUs. They are fabless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: