I really do not understand your post. One doesn't train on the test set, so of c...

blackbear_ · on April 5, 2021

And yet that is not what is observed in practice. See figures 1 and 2 of that paper [1].

What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).

[1] https://arxiv.org/pdf/1812.11118.pdf

Nimitz14 · on April 5, 2021

> And yet that is not what is observed in practice. See figures 1 and 2 of that paper

Yeah... that's why the paper is a good contribution contrary to what you're saying. Not sure why you're repeating this information.

> What I am complaining about is that the authors are confusing "what can be represented" by a neural network with "what can be easily learned" via SGD. I argued that the peak in test loss in those figures is observed becuse SGD struggles to find a solution that generalizes well, not because those networks are intrinsically less powerful (as the paper seems to imply).

I mean.. those models are by definition less powerful as they have less parameters. The (to me) main point of the paper is to point out a symptom which is interesting. Their explanation for the symptom being (maybe) wrong doesn't detract from the important work of showing the symptom exists.

A legitimate criticism would be that there have been earlier papers showing the same thing.

Sidenote: It seems having a larger model does make it easier for SGD to find good solutions[1]

[1] https://www.youtube.com/watch?v=kcVWAKf7UAg