yes, if you want to optimize a worst-case MPI-cluster, then a Pi (4) might be optimal for you (because sadly, 4 measly ARM cores with 100MBit/s is a some magnitudes removed from 100 cores and 100GBit/s Infiniband). But then you can also use a stack of old desktops, which is cheaper and you can just throw in a standard image and everything (including CUDA and MKL) can work.
Virtualising thousands of CPUs on a single machine is still not trivial.
Never mind simulating real world network issues.
https://www.bitscope.com/blog/FM/?p=GF19A