I really do not understand your post. One doesn't train on the test set, so of course it's reasonable to think that increasing the number of parameters will cause more overfitting.
And yet that is not what is observed in practice. See figures 1 and 2 of that paper [1].
What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).
> And yet that is not what is observed in practice. See figures 1 and 2 of that paper
Yeah... that's why the paper is a good contribution contrary to what you're saying. Not sure why you're repeating this information.
> What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).
I mean.. those models are by definition less powerful as they have less parameters. The (to me) main point of the paper is to point out a symptom which is interesting. Their explanation for the symptom being (maybe) wrong doesn't detract from the important work of showing the symptom exists.
A legitimate criticism would be that there have been earlier papers showing the same thing.
Sidenote: It seems having a larger model does make it easier for SGD to find good solutions[1]