This paragraph is strange; it’s an attempt at a compare but nothing is really held constant:
> Training took 1.82 minutes with 256 fourth-gen TPUs, only slightly slower than the 0.39 minutes it took with 4,096 third-gen TPUs. Meanwhile, achieving a 0.81-minute training time with Nvidia hardware required 2,048 A100 cards and 512 AMD Epyc 7742 CPU cores.
And “0.81 minutes” is one of the stranger things I’ve read today.
MLPerf runs are rarely done with a single device instead of a system or cluster and these seem to just be the best MLPerf 0.7 runs for each type of system (ASIC, GPU, CPU) not an attempt at normalized "points per something". Results in minutes as normally MLPerf results are somewhere between days and minutes but this is the top of the line gear in each case so it comes out a bit odd out of that context.
you are most likely correct!
I read ´Training took 1.82 minutes with 256 fourth-gen TPUs’ to mean Using all 256 TPU it took 1.82 wall clock time to do the training not 466 minutes.
This mean each TPU contributed to (1/256) assuming it scale linearly.
Even if it doesn’t the 256 TPU ran in parallel not sequentially right ?
Note that this assumes that training speed is perfectly linear with the # of accelerators which is not true as you get to very large #s (like 4096!). So the true number should be smaller than the 3.43x above, and the reported 2.7x makes sense.
> Training took 1.82 minutes with 256 fourth-gen TPUs, only slightly slower than the 0.39 minutes it took with 4,096 third-gen TPUs. Meanwhile, achieving a 0.81-minute training time with Nvidia hardware required 2,048 A100 cards and 512 AMD Epyc 7742 CPU cores.
And “0.81 minutes” is one of the stranger things I’ve read today.