> Since the bigger network contains the smaller network, it is perfectly capable...

> Since the bigger network contains the smaller network, it is perfectly capable of achieving the same performance, so the only reason why this does not happen is that SGD cannot find it.

This is maybe true in the limit of infinite data, but not true in any practical sense, and I don't think it has anything to do with SGD. E.g. polynomial basis functions also have this property, but you can't use an arbitrarily large polynomial order or you'll eventually overfit. You can get a closed-form solution for polynomial regression problems, so no SGD involved.