I would need 20 hours hard study minimum to understand it at any level. I need the quanta mag version (or Karpathy video) before I comment.
Although excellent the article could start off with the practical implications (maybe it is faster GPT training?) so that at least there is a motivation.
There's quite a few takeaways that can be had without fully understanding the ah.. esoteria.
1. Gradient descent is path-dependent and doesn't forget the initial conditions. Intuitively reasonable - the method can only make local decisions, and figures out 'correct' by looking at the size of its steps. There's no 'right answer' to discover, and each initial condition follows a subtly different path to 'slow enough'...
because...
2. With enough simplification the path taken by each optimization process can be modeled using a matrix (their covariance matrix, K) with defined properties. This acts as a curvature of the mathematical space, and has some side-effects like being able to use eigen-magic to justify why the optimization process locks some parameters in place quickly, but others take a long time to settle.
which is fine, but doesn't help explain why wild over-fitting doesn't plague high-dimensional models (would you even notice if it did?). Enter implicit regularization, stage left. And mostly passing me by on the way in, but:
3. Because they decided to use random noise to generate the functions they combined to solve their optimization problem there is an additional layer of interpretation that they put on the properties of the aforementioned matrix that imply the result will only use each constituent function 'as necessary' (i.e. regularized, rather than wildly amplifying pairs of coefficients)
And then something something baysian, which I'm happy to admit I'm not across
It's pretty close to the theory end of the theory-application spectrum, so it could touch on many aspects of implementation. I suggest it may inform architectural improvements and/or better justifications for gradient descent training schedules.
Thanks. I think when I saw kernel my mind jumped to the conclusion that it was a GPU kernel so was expecting something a bit more software engineeringy!
Kernel is an overloaded term and actually there's two different usages of the word in the article itself: one referring to the kernel operator[1][2] (the primary subject of the article) and another referring to the kernel of a homomorphism[3] (essentially the complement of a subspace in a coordinate system in the linear case).
Although excellent the article could start off with the practical implications (maybe it is faster GPT training?) so that at least there is a motivation.