There are workloads that are data parallel, and scale like the GPU-heavy scale-o...

There are workloads that are data parallel, and scale like the GPU-heavy scale-out nodes that you describe.

The other approach, which you do when models themselves are massive, is model parallelism. You split it into multiple parts that run on different nodes.

In both cases, you need to distribute weight updates through the network although the traffic patterns can be different.

To maximize the performance in both scenarios, systems designers optimize for all-reduce and bisection bandwidth.

There are also other tricks, for example the TPUv4 ICI network is optically switched, and it is configured when a workload starts to maximize bandwidth for the requested topology ("twisting the torus" in the published paper).