Chinchilla's death has been greatly exaggerated. This article makes the same mis...

toxik · on Sept 5, 2023

I find in the article the assumption almost insulting, that labs haven’t tried training smaller models for longer. Everybody tries that first, of course. It was the common wisdom for decades. No, it doesn’t work better or even as well. The loss curve flattens out.

mirekrusin · on Sept 5, 2023

On the other hand models are trained once and later used a lot so you can argue that training cost can be traded for future gains.

haldujai · on Sept 5, 2023

The discussion at the end of this article starts to get to the problem with extrapolating.

Llama1-65b (roughly Chinchilla optimal) and Llama2-34b used similar compute and although Llama2 not directly comparable is the closest comparator without extrapolating and illustrates the point.