Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Failing to reach DDR4 bandwidth (unum.cloud)
101 points by ashvardanian on Feb 2, 2022 | hide | past | favorite | 71 comments


You are missing one critical element: NUMA awareness.

Because of the Chiplet design, Threadrippers and EPYC chips are basically NUMA systems, like you would with a multi-socket motherboard. This has an impact on both memory latency and bandwidth.

Redo your test, but pin each thread to a CPU thread, and its memory pool to a stick of ram attached directly to the CCX of said thread. The bandwidth you are measuring is the Infinity Fabric's bandwidth, NOT the DDR4 bandwidth.

You can read more about infinity fabric and its bandwidth here: https://en.wikichip.org/wiki/amd/infinity_fabric


Came to say exactly this.

A lot of code is wasting a ton of performance on servers because of not being NUMA aware. We should also test high throughput/performance code on NUMA systems.

Remember to allocate local memory for your threads — IOW, be NUMA aware. Going through interconnect(s) is slow.


I'm in HFT and a huge part of our job for the FPGA drivers is managing the NUMA awareness of our software. And we have to constantly educate new devs coming into the low level software teams about this. Luckily, the SW teams are starting to do this training for us but it's always a major headache when there's issues. You can easily lose 50%+ of your performance by just failing to set a core affinity without even considering anything else.


How is that generally done? Platform specific libraries? Utilities that ask the OS to pin a program to a node?


In Linux you can use numactl command to pin a process.


Windows has a NUMA API[1] to help with these things. For pinning a thread you can for example use the SetThreadSelectedCpuSetMasks call[2] from what I can gather.

[1]: https://docs.microsoft.com/en-us/windows/win32/procthread/nu...

[2]: https://docs.microsoft.com/en-us/windows/win32/api/processth...


libnuma probably does the job.

https://github.com/numactl/numactl


If you want to do set thread-core affinity in a cross-platform way, hwloc is a good approach: https://www.open-mpi.org/projects/hwloc/doc/v2.7.0/a00166.ph...


Hadn't even opened the article yet and I was thinking exactly this, 'bet they forgot NUMA'


On a second thought, are you sure Infinity Fabric would be that slow? I would expect it to be compared to L3 bandwidth not RAM. Plus, the article you gave suggests 40 GB/s P2P IFOP with slower 2.6 GHz RAM. I can't find the right diagram , but as I remember, this CPU had 8 chiplets, still allowing for much more bandwidth, depending on the topology.


> I would expect it to be compared to L3 bandwidth not RAM.

Infinity Fabric is well known to be slower / lower-bandwidth than DRAM actually.

Just one of the peculiarities of this architecture.

------

That's why understanding MESI is important. If Core#10 writes something to DRAM: then Core#45 will not be able to write to it until Core#10 releases it (aka: reaches the "invalid" state of MESI).

In contrast, writing to DRAM that's "not owned" by anybody is one-and-done. No need to communicate to other cores or chiplets.

------

No _singular_ core can reach 204GB/s spec of the architecture. You only reach that 204GB/s spec if all the cores (or chiplets really) are visiting different sections of RAM independently.

This happens in practice often: the whole EPYC / Threadripper design is really aimed for virtual machines doing independent tasks. If you're trying to write high-performance code on EPYC / Threadripper, its important to keep this model in your brain.

-------

Once you fix this problem, your next problem is probably going to be the page-table / virtual memory system. Maybe you should be looking up huge-pages in preparation?


A singular CCD on Zen3 can max out the read bandwidth of two DDR4 channels at the latency-optimal 1:1 gearing between the IFOP clock and the DDR4 clock (i.e., 1600 MHz clock for DDR4-3200), but it's write bandwidth is only half as much (IFOP has asymmetric bandwidth).

I haven't looked into detailed limits of the server northbridge, but the consumer northbridge has heaps of PCIe bandwidth compared to the relatively low DDR4 bandwidth. Even though PCIe4 is full-duplex, just the read bandwidth (after accounting for (in practice) unavoidable losses from DDR4's address bus nature) of the x16+x4 northbridge PCIe4 lanes plus the x4 PCIe4 lanes connecting the southbridge to the northbridge (which are re-emitted on X570 with the southbridge acting as a PCIe switch) exceeds the DDR4-2666 bandwidth.

So assuming that DDR4-2666 RAM with the infinity fabric geared 1:1 at 1333 MHz, a Ryzen9 5950X's two CCDs can (together/combined) read 42 GB/s from DRAM, read 42 GB/s from PCIe, and write 42 GB/s to PCIe.

This should break even with the peak L1$ bandwidth if the core is running at half the IFOP clock.

Granted, in reality DDR4-3800 runs fine on most R9-5950X's (some chips don't work at the 1800 MHz IFOP; those will still run at DDR4-3600 in almost all cases, given high-quality DRAM (16 Gbit Micron Rev.E dies are used in e.g. Kingston Server Premier 32GB 2Rx8 "DDR4-3200 CL22" ECC-UDIMMs and my two sticks work at 3600 CL18 normally, even reaching 3600 (with CL 30+; this was a signal integrity test) in the same channel suggesting DDR4-3600 should work with 128 GiB RAM)). And that memory/IFOP speed is 42% above the 2666/1333 speed I equated to the available PCIe4x4 bandwidth.

It's still damming to see that for sheer full-duplex bandwidth, DDR4 quite easily wins, considering 16 lanes full-duplex match the practical full-duplex ceiling of 2 DDR4-3600 channels if you measure bandwidth at the CPU's L2$.


Somehow, similarly bad numbers were achieved with 56 cores working with disjoint parts of 512 GB array. I didnt pin RAM to cores, though. Maybe thats the missing piece, but it’s sad if thats the only way to meet the spec. Almost no application can do that in modern cloud deployment realities.


https://people.freebsd.org/~gallatin/talks/euro2019.pdf Here's a talk about a lot of the NUMA-issues brought up by Netflix + FreeBSD. Its 200 Gb/s (bits), not bytes, but I'd expect you to be running into a lot of the similar problems.

Its 200 Gb/s because they're reading from SSDs. But still, careful attention to NUMA was the only way they could reach full PCIe-bandwidths, which is a similar issue to yours (where NUMA is needed to reach full Memory/DRAM bandwidth).

--------

Did you set your BIOS into NUMA mode? If you're in default BIOS-settings, your RAM is being split up between all 4 nodes (Uniform-memory-access mode). At least, that's how it was for my Threadripper 1950x (I know you've got a totally different model, but who knows?)

I forgot exactly the name of the BIOS setting. But you gotta boot your computer up with 4-nodes-per-socket (or was it 8-nodes per socket for your computer?) or some similar setting. There were some programs on Linux (and Windows) that help you see what the OS is seeing.


See also the a more recent talk[0] on progress made on the same topic. On EPYC too, so perhaps more applicable to OP?

[0]: https://people.freebsd.org/~gallatin/talks/euro2021.pdf


The original talk was on 2019 era hardware, for a 2019-era talk. Xeon and EPYC are both discussed, albeit somewhat dated now but still relevant IMO.

The newer talk is 2021 hardware. IMO, the newer talk seems to focus more on the specific details of kTLS and PCIe TLS... but there's still stuff in there about NUMA for sure.

But yeah, the 2021 talk definitely is on more recent hardware: 2021 Xeons and EPYCs.


Ah, dang, that's on me, I forgot the earlier talk discussed EPYC chips too.


Are you thinking of "NUMA per socket" setting in the BIOS?

Dell has a short overview of this focused on Epyc 7000 series: https://downloads.dell.com/manuals/common/dell-emc-dfd-numa-...


That sounds like the one.

Different motherboards call these settings different names in their BIOS programs / setup however. So just keep an eye out for it.


NUMA is part of the reality. There's lots of modern computing that doesn't acknowledge reality, but that doesn't tend to do good things for performance.

Communication isn't cheap, and implicit communication makes it hard to tell what's expensive. Reading from a pointer could be from L1 cache, or it could be from a pci-e card attached to another socket. The code looks the same, but the cost is radically different. Organizing how your code operates in memory to avoid cross-numa and cross-cpu in general isn't always easy, but it's required if you want to get the performance you paid for. OTOH, smaller nodes get this for free: my dual core desktop 'servers' don't have to worry about NUMA and don't have a lot of cross-cpu contention either.


> Reading from a pointer could be from L1 cache, or it could be from a pci-e card attached to another socket.

The fun one is TLB (translation lookaside buffers) and the virtual memory system.

Today's AMD core's have more L3 cache than what the TLB can handle with 4k-pages. You need to enable 2MB hugepages or 1GB hugepages to even access L3 cache at full speeds in practice...

EDIT: Milan-X has 96MB L3 cache per CCX. 4kB-pages would require 24,000 (24-thousand) TLB-entries. IIRC, Milan only has 2000-TLB-entries. Hurraaahhhhhh....

------

CPUs are devilishly complicated. It makes optimization "fun". Apparently, running "memcpy" requires Ph.D levels of study before you can "memcpy" at full speeds these days.


In the same kind of funny subject, GPUs nowadays have full MMUs, with TLBs and all present too…


Hugepages FTW


> Somehow, similarly bad numbers were achieved with 56 cores working with disjoint parts of 512 GB array

Oh, that's not NUMA at all, now that I'm more carefully reading your post. NUMA would involve a "copy" step, ensuring that those elements are in NUMA-local memory before reading.

Much like how in GPU-programming, you have to worry about the physicality of memory, NUMA-aware programming you have to memcpy data to the right location before it achieves high speeds. Each of the 56-cores needs its ~10GBs in "NUMA-local" memory _BEFORE_ you start the benchmark.

Yeah, I realize this isn't practical. But... who ever said that NUMA use cases are practical? Lol. A lot of cases, it makes more sense to just take advantage of infinity fabric for simplicity (although its slower, its definitely more convenient).


On the contrary. Having multiple independent services running on a single physical machine will likely give you full performance. The easiest way is probably just splitting the physical machine into the number of VMs/pods equal or larger than CCX count.


AFAIK each CCD gets 50 GB/s over Infinity Fabric which adds up to 200 GB/s which is effectively the same as memory bandwidth. IF should only be a bottleneck if you want to get high bandwidth from few CCDs.


I upvoted this knowing the explanation but not feeling myself confident enough to actually write it down in the hope that someone more authoritative will. Thank you!


No, these are Zen2. Both Zen2 and Zen3 single-socket systems are not NUMA systems in the traditional view (i.e., w.r.t. DRAM).

The L3 is CCX-local, but with Zen2 they re-introduced a separate northbridge die for each socket, integrated with the CCDs as a separate chiplet.

Dual-socket systems remain NUMA, and for workloads that feel L3$ more than DRAM, treating CCX-as-NUMA still helps (but that's a scheduler thing, where kernel-level threads/processes (Linux barely distinguishes) that do fine-grained shared memory with _collective_ temporal locality should be accessed from within a single CCX).


But isn’t Zen2 a MCM? It’s not a single uniform processor and there is another level of abstraction on top of a socket


MCM but only one I/O die, but has multiple "quadrants" in single I/O die that makes multiple NUMA domain depending on configuration. See https://developer.amd.com/wp-content/resources/56827-1-0.pdf


This is not correct. Despite only having a single IO die, the Zen2/Zen3 IOD is actually divided into four quadrants, and each chipset (or pair of chipsets) has preferential access to that quadrant, and has to cross an interconnect to get to the other quadrants (and their memory). This is still a non-uniform topology.

This is explicitly called out in the BIOS of these systems: the terminology is "NPS4 mode" (Nodes Per Socket) and is documented in AMD's reference materials. The system can also be run in what's called "NPS1 mode", where all memory channels are interleaved, but since this obviously requires waiting for all four quadrants to finish their memory access, this increases latency somewhat.

https://developer.amd.com/wp-content/resources/56338_1.00_pu...

Anandtech did some really good coverage of this in their 3rd Gen Epyc (Milan) deep-dive, including latency measurements. Zen2 is largely similar architecturally in this area. You can see from comparing the NPS1 and NPS4 charts that Zen2 has around 12.8% higher memory latency in NPS1 mode and this is reduced to 6.4% in Zen3.

https://www.anandtech.com/show/16529/amd-epyc-milan-review/4

If you ignore the physical placement of functionality and simply look at a birds-eye view of the data paths, it's not that different from Zen1. Zen1 had four "chiplets" that each have their own memory controller (and associated uncore). In Zen2, the uncore has simply been pulled from the chiplets onto the standalone IOD, but it is still implemented as four quadrants - just as Naples was four monolithic dies interconnected and packaged together.

As mentioned, since effectively the entire uncore is divided into quadrants, this has a few other quirks. The one that most frequently comes up is populating memory channels - you really really want to populate all memory channels on Epyc, even when running in NPS1 mode. If you're not going to populate sets of 8, you really really want to make sure it's one of the "balanced" configurations, as otherwise some quadrants don't have access to memory at all, and the performance hit can be substantial. For example, Lenovo's documentation shows that populating only 6 out of 8 channels can result in a 29% performance hit (relative to the theoretical potential of a 6-DIMM configuration) even in the "correct" configuration (two quadrants each lose one channel), and an improper configuration (one quadrant with no attached memory) will have a 60% performance hit. Populating 7 channels will result in a 65% performance hit from the theoretical maximum - you are losing 2/3rds of your performance largely due to the NUMA topology!

https://lenovopress.com/lp1268.pdf

PCIe latency is also slightly higher when crossing quadrants. Not the sort of thing most people will be paying attention to, but to use an example here, the guy above who's doing HFT and paying attention to NUMA affinity is probably paying attention to what cores go to what PCIe lanes for talking to his FPGAs, because it does matter. Netflix also ran into similar issues around bandwidth - needlessly pushing data across the NUMA domains will eventually bottleneck performance, if you are pushing enough data. Keeping it inside the quadrant doesn't incur that bottleneck.

It really is a testament to how well AMD made NUMA work that it doesn't "feel" like NUMA - and I think they even turn the NPS1 mode on by default now. But architecturally, it is NUMA underneath, and you can extract a small amount of additional performance by pulling the veneer of UMA away and addressing the hardware as it is actually implemented.


Thanks for the clarifications. I wasn't aware the EPYC IOD was so severely sliced, and just assumed the NPS4 mode would be for isolating neighbour VMs and improving DRAM row buffer locality, both mostly by reducing channel interleaving and setting up somewhat-explicit NUMA.


Yeah! Most people don't realize it because it does pretty much just behave like UMA until you get to the extremes of performance tuning. The one gotcha that does potentially affect the general public is that thing about making sure you populate sets of 8 sticks if at all possible, but most server users will be populating sets of 8 anyway.

It's actually stunning how good a job AMD did there, I'm not dumping on it at all, for 99% of users it might as well be UMA. Naples very much acted like a four-socket system and Rome's quadrants more or less Just Work. I've always been very curious about what changed that it's so different, whether it's the off-chip interconnects being that much higher-latency than the on-chip interconnects, or what.


I would add that seeing somebody "forgetting" NUMA makes it likely there is a lot of other problems.

NUMA isn't something you "forget" when you try to test bandwidth limits. It suggest lack of understanding of how that stuff works.

It is as if somebody failed to reach hard drive transfer speeds and they "forgot" they need to be looking at the access pattern.


I think that's fair - this reads heavily like someone that thinks throwing hardware at the problem is all it takes. The marketing materials are guarantees. That kind of thing.

They even quoted direct AMD statements that would indicate the memory access pattern isn't uniform. They glossed over it when they saw numbers.

I used the term 'forget' in one of my posts, but it's not a true representation - NUMA isn't the kind of thing you forget when you're actually doing HPC. It's the bread and butter.

It's like your mechanic forgetting your vehicle has a motor that needs consideration...


Thanks, that's a valid suggestion! But in that case I would be forced to split my array into smaller chunks in different disjoint region of address space, right?


but it only proves his point


It is not exactly clear to me what is going on with threads (I guess you are using all of them?). I haven't done too much in this space but anecdotally I've had better luck if my summation is explicitly split into sub-summation tasks. It is not clear if that is being done here. It looks like a single summation loop that the author is expecting the computer to magically split across multiple threads. I'd be interested in seeing what this looks like if instead the task were to add chunks of the original dataset into results per thread (e.g, first 8000 samples on first thread, next 8000 on 2nd thread, etc.), with a final accumulation loop across all threads. Again, the author may be trying this and this is not my area of expertise but I've had decent luck saturating the memory bus with a similar approach.


Here is the source and the threads: https://github.com/unum-cloud/ParallelReductions/blob/fd16d9...

OFC we don't expect the compiler to instantiate them for us, it's not OpenMP :) That one we covered in previous articles. OpenMP gave us about 50 GB/s with all cores enabled and 80 GB/s with part of them disabled.


Is there an advantage to using taskflow for parallel for, if you already have another threadpool implementation? I recently removed taskflow in a project that was only being used for a parallel for loop (as part of a larger refactor, the code had a number of issues...), and I'm wondering if that was a mistake now that I see that pattern somewhere else. :)


Nope, dont worry :) I did it our of laziness. I didn’t want to implement a task queue for std::thread-s, so I took TaskFlow, as one of the most famous solutions. You can definitely get better async task management with enough C++ experience and time.


My c++ is not great (so it is hard for me to tell what is going on) and I'm used to OpenMP where my understanding has always been that you tend to get a single thread per processor (or per hyper-thread) -- not sure if that is guaranteed with the way your code is laid out? Perhaps it really is a NUMA issue as others suggest. I will note that one other variation I had (as it looks like you are already splitting across threads) is that the chunk sizes were actually smaller than the # of threads which meant a faster thread would take more chunks rather than waiting on the slowest thread. Good luck!


Doug Lea of Java Memory Model and concurrency note went pretty far down this rabbit hole. Not only do you use separate counters/queues per thread/core, but you also put empty space around them so that you don't accidentally share cache lines. I don't know what they do now, but at the time some of the data structures in that library used arrays where only every 8th or 16th entry is used to avoid two cores trying to read from the same cache line.

Typically allocating a separate data structure per actor also accomplishes this as a happy accident. If the thread does the allocation, then it has a better chance of being in the right bank as well.


Yes, thats needed when you have counters in global memory. In that case, instead of just having vector<double> you would put each double into a stricture aligned to 64 byte addresses. Here all the counters are on local stack, so that trick unfortunately wont help


For the single threaded version, I believe they have a similar problem with

    auto sums = _mm256_set1_ps(0);
    for (; it + 8 < end; it += 8)
        sums = _mm256_add_ps(_mm256_loadu_ps(it), sums);
Where each SMD op is trying to overwrite to a compact data structure.

But in the threaded version https://github.com/unum-cloud/ParallelReductions/blob/fd16d9... they have separate slots for an accumulator but it's still in a shared vector, which most likely has the issue I described.


The spreadsheet showing a bunch of different approaches:

> Attempt Bandwidth Max Bandwidth Saturation Time to Code

> Parallel STL 87 GB/s 204 GB/s 42.6% 1m

> Best CPU run 122 GB/s 204 GB/s 59.8% 60m

> Thrust 743 GB/s 936 GB/s 79.4% 1m

> Custom CUDA 817 GB/s 936 GB/s 87.3% 30m

> CUB 879 GB/s 936 GB/s 93.9% 5m

Then, benchmarks output:

> 8,115,337,000,378 instructions # 0.18 insn per cycle # 4.71 stalled cycles per insn (83.33%)

> 1,820,092,697,347 branches # 168.291 M/sec (83.33%)

It's not clear to me which approach represents the output from the benchmarks. But that's a _lot_ of branches for the number of instructions executed. I suggest that might be a reason that the CPU didn't reach high RAM throughput.


The numbers are for the last approach, committed to the public repo. I agree that it's a lot of branches, but expectedly so. Only 0.13% of all branches were missed, so the speculative execution almost always works. Even manual 4x unrolling didn't help.


Indeed? I wonder if you've tried different memory performance settings in your BIOS (which "might" invalidate your warranty) or different memory modules altogether


And regarding the unrolling, I also removed the data dependency - accumulating into 4 different YMM registers. So it's most likely just the Infinity Fabric bottleneck, limiting our access to memory.


Didn't change in BIOS anything this time. But the entire system runs on liquid in a very cold, properly vented room. We must have cranked up the BIOS settings during the first boot.


On the other hand, high branch usage is the ideal scenario for using CPUs vs GPUs.


Not necessarily.

GPUs are outstandingly good at uniform branches. Only divergent branches are GPUs bad at.


I would further emphasize - locally uniform branches. Mostly the cores within the same warp should be well synchronized.


It also should be noted that CPUs are _ALSO_ bad at divergent branches.

Its just that in the CPU-world, "divergent branches" is called "branch misprediction". In the GPU-world, we have a better idea of how the "branch predictor" (or the equivalent thing to the branch predictor) works. Its called SIMD-execution / branch divergence.

-------

That being said: CPUs are way better than GPUs at divergent scenarios. Not only is the CPU branch predictor able to guess patterns, but its also able to execute in parallel to the rest of the CPU (out-of-order speculative execution and all).

So in highly divergent branchy code, CPUs work but somewhat slowly (if the branch predictor can't predict, then its useless). But if there's a simple pattern, GPUs are great.

If there's a pattern that can be detected at runtime, but is too difficult to program in to a GPU (ex: a binary search over a million elements will probably loop roughly 20 times), that's where CPUs win exceptionally over GPUs.

The CPU-branch predictor will predict 20-loops in your binary search. It might be a little bit wrong (19-loops needed or 21-loops needed), but speeding up those 20-loops is a huge benefit.


Wrt branches, GPUs want spatial locality, and CPUs want temporal locality :)


GPUs are in fact easier to reach high performance these days IMO. Sure, you gotta learn how to stride memory and possibly do it across workgroups, but those GPUs really have patterns that easily hit their specs (be it FLOPs or Memory Bandwidth)

GPUs are dumb, but that makes them kinda simpler for these kinds of simple tests.

CPUs, in particular the Thread ripper Pro, have very complex memory hierarchies.

Unless they experiment with NUMA mode (4 or 8 nodes maybe), I wouldn't expect much improvement. To max out the memory Bandwidth on a CPU, you really gotta understand MESI / cache snooping / false sharing, split up your accesses across the chip let's (easier to do with NUMA turned on) and code in a NUMA aware manner.

---------

I'm not saying GPUs are always better. But this particular test is really bloody simple and ideal for GPUs.


Yeah, completely agree. Sadly we live in the convergence era. Now GPUs are also multi-chip (starting with MI200) and the latencies will likely become unpredictable once again :)


Does AMD have folks that you can reach out to regarding this? I know Intel has MKL and all the work around its own compiler for maximum speed. This seems like it should be trivial for someone at AMD to put together as an example of how to do things like this correctly ...


I would try something like this on gravaton2/3 or a recent POWER machine, in my experience its a lot easier to get reasonable memory perf out of !intel machines.


The threadripper is also a !intel chip.


Some folks are just not TradeMarksters. They'll teach you that Intel makes processors that implement x86 and amd64 instruction sets. You can Google things on DuckDuckGo. They'll blow their nose on the Kleenex that they bought from GenericPaperProductsCo. At some restaurants in the US South you'll have to let your server know what kind of Coke you want: cola, lemon-lime, or the Doctor one.


Threadripper is by AMD


Yeah that was what I was pointing out -- I was mirroring the previous, who I think meant !intel as "not intel."


I bet AMD itself is not able to get even close to the 204 GB/s bandwidth that they "promise" for these CPUs.

AMD instructions for measuring peak bandwidth are here: https://developer.amd.com/spack/stream-benchmark/ (essentially does a multi threaded memcpy). The script there will pin 1 thread per CCD, and with 8 threads you'll get close to peak bandwidth on CPUs with 8 CCDs (adding more threads doesn't help).

I've run it in the same CPUs that AMD uses there, following their instructions to the letter, which should be able to achieve 204 GB/s, but the best I've seen is about 165 GB/s. That's 80% of what AMD advertises.

HOWEVER, THIS NUMBER IS A LIE. It is using the AMD compiler with a flag called "-fnt-store" to turn all stores into non-temporal stores. This is something that no real-world application in practice would actually do because it does not make sense doing in general.

Without this flag, the peak bandwidth that AMD scripts report is about 124 GB/s. That's 60% of what AMD advertises and the best that most applications can aim to obtain in practice (the blog post author is really close to it).

The fine print (https://www.amd.com/en/products/cpu/amd-epyc-7742) does say that 204 GB/s is a "theoretical" number.

So I bought a "theoretical" 204 GB/s CPU that delivers 60% of what it "theoretically" promised.

I feel scammed.

---

Instead, if I run STREAM on my GPU, I get close to 100% of what the GPU promises, and about 90% or so of it in practice.


You seem surprised, this is pretty much the rule for ALL CPUs. The 204GB/sec number is simply DDR4-3200 x 64 bits (width of one memory channel) x 8 (number of memory channels = 204,800MB/sec or "roughly" 204GB/sec (assuming GB = 10^9 not 2^30). This is called "peak" bandwidth, as is a never to exceed number.

60-80% of peak is common with tuned code, and less without. Sure GPUs, which are all about bandwidth do better. Much of a GPU design is about throughput and hiding latency with multiple requests/threads/whatever. CPUs have different design goals and worry much more about latency and often have to worry about NUMA/memory coherency. Not to mention in the general case often you have one cache miss per core, which isn't necessarily enough to saturate the memory bus.

Try apple M1, other ARMs, Alpha, PA-risc, Sparc, Intel, etc and you'll find the same story.


> This is called "peak" bandwidth, as is a never to exceed number.

This is the peak bandwidth of DDR4, saying that would be fine.

AMD claims that this is the theoretical peak bandwidth of their CPU, not of DDR4, yet their CPUs memory subsystem are too poor to even get close to this number.

> Try apple M1, other ARMs, Alpha, PA-risc, Sparc, Intel, etc and you'll find the same story.

I tried the Ampere Altra and Graviton and they achieve ~170-180 GB/s without non-temporal stores. You can get an AWS graviton instance and verify it yourself.

That's 90% of "theoretical DDR4 peak" for ARM vs 60% for AMD.

So no, i disagree. This is not the case for ALL cpus, this is the case for AMD cpus. And no, 60% vs 90% "out of the box" perf is not the same.


Arm does have an advantage with a looser memory model, which makes it easier to get good bandwidth. Graviton2 and Ampere Altra tend to have more cores than common x86-64 servers (64-128), which helps as well.

90% is more than I'd expect though, can you post the numbers and which code you used? Enough to replicate (compiler, compiler flags, code, etc).

I've tracked down lower than expected numbers before, there's many causes. Compilers (temporal vs non-temporal), GCC vs Intel vs Other compilers, surprisingly large differences because of using new or malloc instead of static arrays, having two Dimms per channel (lowering the speed to DDR4-2666), various BIOS settings including a NUMA disable that stripes across all channels that apparently makes some windows codes run better. In a surprising number of cases just having dimms in the wrong slots will do it.


Virginia Tech Stream, stock, with OMP_NUM_THREADS set to number of cores (these don't have SMT), clang-12 compiler using -O3 -fopenmp -DNDEBUG.

> Arm does have an advantage

As I said, claiming that the theoretical peak of a CPU can actually achieve theoretical peak of DDR4 is ok as long as that's actually the case.

Getting 50-60% just means AMD is lying.

Particularly when on other vendors like Amazon and Ampere one gets 90+% of what they advertise with stock compilers and stock flags.

Having to use an AMD proprietary compiler, with flags that are bad for programs in general (essentially a STREAM cheat mode), to get 70% of what they advertise, is just sad in 2022.


I'm somewhat surprised that AMD's stream-benchmark uses "transparent_hugepage" rather than explicitly using huge-pages.

Has Linux's transparent_hugepage feature gotten good enough to use in today's environments? Maybe its good enough for STREAM-benchmarking (simple memory patterns) ??

Or is "STREAM" unable to use huge-pages directly (so we require the OS to try and guess appropriate usage?) ??


Yes, it works quite well. Explicit 1gb huge pages also provides significant improvements for the right workloads. You pretty much have to start taking advantage of features like these to get anywhere near the theoretical maximum throughput in real world usage even on the higher core count desktop chips.


Those are great links, thanks for sharing! My memcpy-s were even slower :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: