Excellent article and a really clear view on statistical model training dynamics...

uoaei · on Oct 28, 2023

Reflecting a bit more on the article I think the key lies close to this notion, quoted from the article:

> Since ker⁡Π can be described as the orthogonal complement to the set {Kti}, the orthogonal complement to ker⁡Π is exactly the closure of the span of the vectors Kti.

{Kti} is going to be very large in the overparametrized case, so the orthogonal complement will be small. Note also this part:

> Because v is chosen with minimal norm [in the context of the corresponding RKHS], it cannot be made smaller by adjusting it by an element of ker⁡Π...

So it sounds like all the "capacity" is taken up by representing the function itself and seemingly paradoxically the parameters λi are more constrained by the implicit regularization imposed by gradient descent (hypothetically enforcing the minimal-norm constraint). So the parameter space of functions that can possibly fit is tiny. The rub in practical applications is many combinations of NN parameters can correspond to one set of parameters in this kernel space, so the connection between p and λ (via f?) seems key to understanding the core of the issue.