Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Chinchilla's death has been greatly exaggerated. This article makes the same mistake as in the original GPT-3 scaling law of extrapolating from mid-training loss curves- but most of the loss improvement in the middle of training comes from simply dropping the learning rate to reduce the effective noise level from stochastic gradients.

If we want to judge the effectiveness of long-training small models, we need to look at _final_ loss as a function of compute, adjusting our LR schedule to token budget as we spend more compute, and then extrapolate on that curve- _not_ the training curve for a fixed budget. Another way to put it: you can't drop LR below 0, and LR schedule drives the shape of the training curve, so it doesn't make sense to extrapolate a curve beyond the end of training.

Of course, the overall point that long training produces gains holds true and Chinchilla says nothing about this- it only aims to minimize training compute.



I find in the article the assumption almost insulting, that labs haven’t tried training smaller models for longer. Everybody tries that first, of course. It was the common wisdom for decades. No, it doesn’t work better or even as well. The loss curve flattens out.


On the other hand models are trained once and later used a lot so you can argue that training cost can be traded for future gains.


The discussion at the end of this article starts to get to the problem with extrapolating.

Llama1-65b (roughly Chinchilla optimal) and Llama2-34b used similar compute and although Llama2 not directly comparable is the closest comparator without extrapolating and illustrates the point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: