Reliability also depends strongly on current density and applied voltage, even m...

touisteur · 2026-01-23T07:03:29 1769151809

I don't have first-hand knowledge on HBM GPUs but on the RTX Blackwell 6000 Pro Server, the perf difference between the free up-to-600W and the same GPU capped at 300W is less than 10% on any workload I could (including Tensor Core-heavy ones) throw at it.

That's a very expensive 300W and I wonder what tradeoff made them go for this, and whether capping is here a way to increase reliability. ...

Wonder whether there's any writeup on those additional 300 Watts...

zozbot234 · 2026-01-23T08:15:09 1769156109

> whether capping is here a way to increase reliability

Almost certainly so, and you wouldn't even need to halve the wattage; even a smaller drop ought to bring a very clear improvement. The performance profile you mention is something you see all the time on CPUs when pushed to their extremes; it's crazy to see that pro-level GPUs are seemingly being tuned the same way out of the box.

storystarling · 2026-01-23T07:52:14 1769154734

It sounds like those workloads are memory bandwidth bound. In my experience with generative models, the compute units end up waiting on VRAM throughput, so throwing more wattage at the cores hits diminishing returns very quickly.

zozbot234 · 2026-01-23T08:21:46 1769156506

If they were memory bandwidth bound wouldn't that in itself push the wattage and thermals down comparatively, even on a "pegged to 100%" workload? That's the very clear pattern on CPU at least.

touisteur · 2026-01-23T12:15:26 1769170526

That's my experience as well, after monitoring frequency and temp on lots of kernel on all the spectrum from memory-bound, to L2-bound to compute-bound. Hard to reach the 600W with memory-bound kernel. TensorRT manages it somehow with some small to mid networks but perf increase seems capped around 10% too even with all the magic inside.

touisteur · 2026-01-23T12:11:38 1769170298

I thought so but no, iterative small matrix multiplication kernel in tensor cores, or pure (generative) compute with ultra-late reduction and ultra-small working memory. nsight-compute says everything is in L1 or small register file, no spilling, and that I am compute bound, good ILP. Can't find a way to get more than 10% for the 300W difference. Thus asking if anyone did better and how and how reliable the HW stays.