It's not an increase in capacity of per-GPU "GPU memory" (the HBM directly connected to the H100 here is up to 96GB, where the previous generation was 80GB), but rather reflects the product of two things:
1. Each node here is a more tightly-coupled CPU+GPU two-chip pairing, and the CPU side has a significantly larger pool of 480GB of LPDDR ("regular" RAM). So each GPU is part of a node that includes up to 480+96GB of total memory.
Is this memory unified like Apple Silicon? Meaning, can a model be deployed onto 574GB of total memory? Can the GPU read memory directly from the 480GB pool? Same question for CPU being able to directly access the 96GB.
It should be mapped as one address space, so yes to the loading across multiple GPU question. It's not fully unified though, at this scale of computer it's simply impossible to put 100s of GB on an SOC like that. Instead, the GPU and CPU have DMA over PCI and NvLink, which is plenty fast for AI and scientific compute purposes. "Unified memory" doesn't make much sense for supercomputers this large.
This device has a fully switched fabric allowing comms between any of the 256 "superchip" clusters at 900GB/s. That is dramatically faster than a direct host to GPU 32-lane PCI-E connection (which is crazy), and obviously dwarfs any existing machine to machine connectivity. The actual usability of shared memory across the array is improved significantly.
I mean...nvidia has obviously been using DMA for decades. This isn't just DMA.
No i mean the fact that Nvidia is now claiming that the memory the CPU has access to can be counted as memory for the GPU. the fabric is neat. the "We have 500 GB of ram per gpu" claim is questionable.
1. Each node here is a more tightly-coupled CPU+GPU two-chip pairing, and the CPU side has a significantly larger pool of 480GB of LPDDR ("regular" RAM). So each GPU is part of a node that includes up to 480+96GB of total memory.
2. There are way more nodes: 256, up from 8.