Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Excellent article and a really clear view on statistical model training dynamics. This perspective will no doubt contribute to the development of deep learning theory.

I'm interested especially in the lessons we can learn about the success of overparametrization. As mentioned at the beginning of the article:

> To use the picturesque idea of a "loss landscape" over parameter space, our problem will have a ridge of equally performing parameters rather than just a single optimal peak.

It has always been my intuition that overparametrization makes this ridge an overwhelming statistical majority of the parameter space, which would explain the success in training. What is less clear, as mentioned at the end, is why it hedges against overfitting. Could it be that "simple" function combinations are also overwhelmingly statistically likely vs more complicated ones? I'm imagining a hypersphere-in-many-dimensions kind of situation, where the "corners" are just too sharp to stay in for long before descending back into the "bulk".

Interested to hear others' perspectives or pointers to research on this in the context of a kernel-based interpretation. I hope understanding overparametrization may also go some way toward explaining the unreasonable effective of analog-based learning systems such as human brains.



Reflecting a bit more on the article I think the key lies close to this notion, quoted from the article:

> Since ker⁡Π can be described as the orthogonal complement to the set {Kti}, the orthogonal complement to ker⁡Π is exactly the closure of the span of the vectors Kti.

{Kti} is going to be very large in the overparametrized case, so the orthogonal complement will be small. Note also this part:

> Because v is chosen with minimal norm [in the context of the corresponding RKHS], it cannot be made smaller by adjusting it by an element of ker⁡Π...

So it sounds like all the "capacity" is taken up by representing the function itself and seemingly paradoxically the parameters λi are more constrained by the implicit regularization imposed by gradient descent (hypothetically enforcing the minimal-norm constraint). So the parameter space of functions that can possibly fit is tiny. The rub in practical applications is many combinations of NN parameters can correspond to one set of parameters in this kernel space, so the connection between p and λ (via f?) seems key to understanding the core of the issue.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: