This is not correct. Despite only having a single IO die, the Zen2/Zen3 IOD is actually divided into four quadrants, and each chipset (or pair of chipsets) has preferential access to that quadrant, and has to cross an interconnect to get to the other quadrants (and their memory). This is still a non-uniform topology.
This is explicitly called out in the BIOS of these systems: the terminology is "NPS4 mode" (Nodes Per Socket) and is documented in AMD's reference materials. The system can also be run in what's called "NPS1 mode", where all memory channels are interleaved, but since this obviously requires waiting for all four quadrants to finish their memory access, this increases latency somewhat.
Anandtech did some really good coverage of this in their 3rd Gen Epyc (Milan) deep-dive, including latency measurements. Zen2 is largely similar architecturally in this area. You can see from comparing the NPS1 and NPS4 charts that Zen2 has around 12.8% higher memory latency in NPS1 mode and this is reduced to 6.4% in Zen3.
If you ignore the physical placement of functionality and simply look at a birds-eye view of the data paths, it's not that different from Zen1. Zen1 had four "chiplets" that each have their own memory controller (and associated uncore). In Zen2, the uncore has simply been pulled from the chiplets onto the standalone IOD, but it is still implemented as four quadrants - just as Naples was four monolithic dies interconnected and packaged together.
As mentioned, since effectively the entire uncore is divided into quadrants, this has a few other quirks. The one that most frequently comes up is populating memory channels - you really really want to populate all memory channels on Epyc, even when running in NPS1 mode. If you're not going to populate sets of 8, you really really want to make sure it's one of the "balanced" configurations, as otherwise some quadrants don't have access to memory at all, and the performance hit can be substantial. For example, Lenovo's documentation shows that populating only 6 out of 8 channels can result in a 29% performance hit (relative to the theoretical potential of a 6-DIMM configuration) even in the "correct" configuration (two quadrants each lose one channel), and an improper configuration (one quadrant with no attached memory) will have a 60% performance hit. Populating 7 channels will result in a 65% performance hit from the theoretical maximum - you are losing 2/3rds of your performance largely due to the NUMA topology!
PCIe latency is also slightly higher when crossing quadrants. Not the sort of thing most people will be paying attention to, but to use an example here, the guy above who's doing HFT and paying attention to NUMA affinity is probably paying attention to what cores go to what PCIe lanes for talking to his FPGAs, because it does matter. Netflix also ran into similar issues around bandwidth - needlessly pushing data across the NUMA domains will eventually bottleneck performance, if you are pushing enough data. Keeping it inside the quadrant doesn't incur that bottleneck.
It really is a testament to how well AMD made NUMA work that it doesn't "feel" like NUMA - and I think they even turn the NPS1 mode on by default now. But architecturally, it is NUMA underneath, and you can extract a small amount of additional performance by pulling the veneer of UMA away and addressing the hardware as it is actually implemented.
Thanks for the clarifications. I wasn't aware the EPYC IOD was so severely sliced, and just assumed the NPS4 mode would be for isolating neighbour VMs and improving DRAM row buffer locality, both mostly by reducing channel interleaving and setting up somewhat-explicit NUMA.
Yeah! Most people don't realize it because it does pretty much just behave like UMA until you get to the extremes of performance tuning. The one gotcha that does potentially affect the general public is that thing about making sure you populate sets of 8 sticks if at all possible, but most server users will be populating sets of 8 anyway.
It's actually stunning how good a job AMD did there, I'm not dumping on it at all, for 99% of users it might as well be UMA. Naples very much acted like a four-socket system and Rome's quadrants more or less Just Work. I've always been very curious about what changed that it's so different, whether it's the off-chip interconnects being that much higher-latency than the on-chip interconnects, or what.
This is explicitly called out in the BIOS of these systems: the terminology is "NPS4 mode" (Nodes Per Socket) and is documented in AMD's reference materials. The system can also be run in what's called "NPS1 mode", where all memory channels are interleaved, but since this obviously requires waiting for all four quadrants to finish their memory access, this increases latency somewhat.
https://developer.amd.com/wp-content/resources/56338_1.00_pu...
Anandtech did some really good coverage of this in their 3rd Gen Epyc (Milan) deep-dive, including latency measurements. Zen2 is largely similar architecturally in this area. You can see from comparing the NPS1 and NPS4 charts that Zen2 has around 12.8% higher memory latency in NPS1 mode and this is reduced to 6.4% in Zen3.
https://www.anandtech.com/show/16529/amd-epyc-milan-review/4
If you ignore the physical placement of functionality and simply look at a birds-eye view of the data paths, it's not that different from Zen1. Zen1 had four "chiplets" that each have their own memory controller (and associated uncore). In Zen2, the uncore has simply been pulled from the chiplets onto the standalone IOD, but it is still implemented as four quadrants - just as Naples was four monolithic dies interconnected and packaged together.
As mentioned, since effectively the entire uncore is divided into quadrants, this has a few other quirks. The one that most frequently comes up is populating memory channels - you really really want to populate all memory channels on Epyc, even when running in NPS1 mode. If you're not going to populate sets of 8, you really really want to make sure it's one of the "balanced" configurations, as otherwise some quadrants don't have access to memory at all, and the performance hit can be substantial. For example, Lenovo's documentation shows that populating only 6 out of 8 channels can result in a 29% performance hit (relative to the theoretical potential of a 6-DIMM configuration) even in the "correct" configuration (two quadrants each lose one channel), and an improper configuration (one quadrant with no attached memory) will have a 60% performance hit. Populating 7 channels will result in a 65% performance hit from the theoretical maximum - you are losing 2/3rds of your performance largely due to the NUMA topology!
https://lenovopress.com/lp1268.pdf
PCIe latency is also slightly higher when crossing quadrants. Not the sort of thing most people will be paying attention to, but to use an example here, the guy above who's doing HFT and paying attention to NUMA affinity is probably paying attention to what cores go to what PCIe lanes for talking to his FPGAs, because it does matter. Netflix also ran into similar issues around bandwidth - needlessly pushing data across the NUMA domains will eventually bottleneck performance, if you are pushing enough data. Keeping it inside the quadrant doesn't incur that bottleneck.
It really is a testament to how well AMD made NUMA work that it doesn't "feel" like NUMA - and I think they even turn the NPS1 mode on by default now. But architecturally, it is NUMA underneath, and you can extract a small amount of additional performance by pulling the veneer of UMA away and addressing the hardware as it is actually implemented.