Google Announces TPUv4, 2.7x faster on average than TPUv3

mwint · on May 18, 2021

This paragraph is strange; it’s an attempt at a compare but nothing is really held constant:

> Training took 1.82 minutes with 256 fourth-gen TPUs, only slightly slower than the 0.39 minutes it took with 4,096 third-gen TPUs. Meanwhile, achieving a 0.81-minute training time with Nvidia hardware required 2,048 A100 cards and 512 AMD Epyc 7742 CPU cores.

And “0.81 minutes” is one of the stranger things I’ve read today.

zamadatix · on May 18, 2021

MLPerf runs are rarely done with a single device instead of a system or cluster and these seem to just be the best MLPerf 0.7 runs for each type of system (ASIC, GPU, CPU) not an attempt at normalized "points per something". Results in minutes as normally MLPerf results are somewhere between days and minutes but this is the top of the line gear in each case so it comes out a bit odd out of that context.

skyde · on May 18, 2021

0.007109375 minute per Forth-gen TPU (1.82 / 256)

VS

0.000095214 minutes per third-gen TPU (0.39 / 4096)

Is it only me or Third-gen seem better lol

codefined · on May 18, 2021

Your maths is incorrect, dividing by the number of devices doesn't make sense in this case, you should be multiplying them. E.g.

- 466 minutes of fourth-generation TPU time.

- 1,600 minutes of third-generation TPU time.

Using this logic, fourth generation TPUs are 3.4x better. But, comparing different cluster sizes is pointless. These things don't scale linearly.

skyde · on May 23, 2021

you are most likely correct! I read ´Training took 1.82 minutes with 256 fourth-gen TPUs’ to mean Using all 256 TPU it took 1.82 wall clock time to do the training not 466 minutes.

This mean each TPU contributed to (1/256) assuming it scale linearly.

Even if it doesn’t the 256 TPU ran in parallel not sequentially right ?

ml_hardware · on May 18, 2021

I think your math is backwards. The training workload W is the same, and the time to complete it is:

W / (256 * speed_v4) = 1.82

W / (4096 * speed_v3) = 0.39

speed_v4 / speed_v3 = (4096 * 0.39) / (256 * 1.82) = 3.43

Note that this assumes that training speed is perfectly linear with the # of accelerators which is not true as you get to very large #s (like 4096!). So the true number should be smaller than the 3.43x above, and the reported 2.7x makes sense.

skyde · on May 18, 2021

its only slightly (74 times) slower :)