Xilinx HBM2 Internals (2023)

pclmulqdq · on May 9, 2024

I wonder if the author is doing anything to overclock the HBM here or if this is within the ratings of the Samsung HBM stacks. It's nice to be able to do this when you have a few cards, but if you are working with hundreds, it may not be practical to push the HBM this far without overvolting them a bit.

latchkey · on May 9, 2024

I automated the tuning of 150k gpus that were being used to mine ethereum.

The trick was that as a whole, you knew the limits of the hardware. You know how to set the knobs to max performance. Due to the silicon lottery, cards that can't perform at max end up crashing.

So what I did was kind of the opposite of what everyone else was doing. I first set everything at max, watched for a crash, then tuned the knobs to be a bit lower. All of this was done with an automated piece of software that I built. The cards we used essentially had 3 knobs to twist, which resulted in hundreds of combinations. Eventually, the cards stop crashing, so you're at the right settings, for that individual piece of hardware.

We were running in seasonal climates too... so each winter/summer, I'd reset things and let it auto tune back again. Heat plays a huge factor on stability.

Of course, each workload has different settings too... so that plays into it, but if everything else is static, this ended up being a great way to do things.

rowanG077 · on May 9, 2024

That seems great if a failure always results in a crash. There are a ton of failure modes where your result will just spuriously be wrong.

latchkey · on May 9, 2024

To my knowledge, HPC rarely tunes cards for max performance. My MI300x are stock settings and I doubt I'll ever modify them.

pclmulqdq · on May 9, 2024

Interesting, I generally assumed Eth miners would undervolt their GPUs to get more life out of them rather than overclocking them for absolute max performance.

latchkey · on May 9, 2024

Undervolt / overclock / memory timings

Wolf9466 · on May 9, 2024

Author here. I did overclock it - that was one of the points of the writeup: when you modify the memory clock, you should change the timings along with, because they are often specified in tCK (ticks of the memory clock), and as such, they will change when the clock changes.

I have reliable information from folks with several thousand of these FPGAs that they reliably clock to 1100Mhz - 1150Mhz on the HBM2 at stock voltage (or a bit less.) This falls in line with my personal experiences - I have seven XCVU35P FPGAs, and they range from doing only 1100Mhz to 1150Mhz, to some handling 1200Mhz.

Samsung's documentation specifies this HBM2 for 1000Mhz to 1100Mhz, based on binning - this is why I was annoyed that Xilinx limited it to 900Mhz, and worked to learn how to change the PLL settings.

pclmulqdq · on May 9, 2024

I am also aware the Xilinx sets their own clock specs annoyingly conservatively, and I think they do it to preserve device lifetime or something similar. However, I did want to clarify whether you were overvolting these things or just raising the clock frequency.

I have run into issues where you do get a dud FPGA that is just a lot slower than other FPGAs of its speed bin (it must have come from the edge of the wafer or something), and debugging that is pretty annoying.

Wolf9466 · on May 10, 2024

As I said explicitly, stock voltage (or less.)

Wolf9466 · on May 16, 2024

Correction - 1000Mhz or 1200Mhz, depending on binning.

willis936 · on May 9, 2024

I'm not an expert on memory interfaces. How do you use HBM2's 1024-bit interface when you have ~200 I/O on a zynq ultrascale+? Are these psuedo-channels a SerDes for the HBM2 bus?

huntero · on May 9, 2024

The HBM stacks are on-package for these parts, so you don't have to use any external I/O to interface with them.

You end up with a similar challenge accessing that much bandwidth internally from your FPGA logic though, it looks like the Xilinx HBM IP presents a set of 16 or 32 separate AXI interfaces, each of which gives you about 14.4GB/s of bandwidth (https://docs.amd.com/r/en-US/pg276-axi-hbm/Introduction).

someguydave · on May 9, 2024

Look at the (non-Zynq) VCU128 board for an example. The HBM2 is on the PL side, and the interconnect is via a die-to-die interface. So the 32 AXI3 interfaces to HBM2 here are hard silicon, not FPGA I/O pins.

akira2501 · on May 9, 2024

I feel like domains are pretty cheap so it would be easy to separate your fetishes from your work life.

doctor_eval · on May 9, 2024

I made the mistake of looking at the gallery. NSFW.

formerly_proven · on May 9, 2024

There are no mistakes, just happy little accidents.

doctor_eval · on May 9, 2024

You’re right, I shouldn’t have said mistake.

The context switch nearly gave me whiplash, tho.

TacticalCoder · on May 9, 2024

> Conventions

> MiB = Megabytes (2^20 bytes)

> Gb = Gigabits (2^27 bytes, or 128MiB)

> GiB = Gibibytes (2^30 bytes)

Shouldn't MiB be Mebibytes then?

therealcamino · on May 11, 2024

Yeah, I found the Conventions section baffling. It defines both "giga" and "gibi" as 2^30! Why define both if they have the same value? Then it confuses the issue further by using "gibi" when the underlying unit is bytes, and "giga" when it is bits, which, given those definitions, doesn't convey any difference in meaning.

Wolf9466 · on May 13, 2024

Yeah, I messed that part up.