That doesn't follow. Shallow networks can be *harder* to train than a deep one, ...

That doesn't follow. Shallow networks can be harder to train than a deep one, which is one of the old arguments for why you should train a deep NN despite its many disadvantages (like latency - often a matter of life and death for biological organisms!). The depth allows easier learning.

This is why today, if you need a low-latency NN, which means a shallow one, often your best bet is to train a deep one first and then distill or prune it down into a shallow one. Because the deep one is so much easier, while training a shallow one from scratch without relying on depth may be an open research question and effectively impossible.