Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While the article makes good observations, this would appear to be a major oversight by leading research labs if they could have just kept the gas pedal down on simpler models for longer and gotten better performance. This is HackerNews – can we get someone from OpenAI, DeepMind, or MetaAI to respond and justify why cutting off the smaller models at a lower total compute budget is justified?


But...they did that, with Llama 2, and apparently did get better results, at least up to a point.

My big WTF is, if you're feeding the same amount of data through all of them, then to use the same amount of compute for the smaller models you need to run multiple epochs (with the same data). One thing that's always bothered me a bit about "foundation model" LLM training is that it sounds like traditionally they essentially just run a single epoch, and with stochastic gradient descent that's certainly leaving something on the table, probably a lot (and also introduces a lot of path-dependence on the order in which data is presented, what with the cosine learning rate rules).

I really want to know what would happen if, like in smaller models where we can actually get there (convnets for ImageNet classification, e.g.), we ran enough epochs on each of these models to hit the point where validation loss started increasing even as test loss decreased. It seems like we're always squarely in the realm where they're still both decreasing, so everything is severely undertrained, even given the available datasets. It's easy to come up with "laws" for that regime, but they mean nothing other than that we don't have enough compute to properly handle the data.

Big takeaway: if the results from this article are legit, it would suggest that we should really be looking at even smaller models, wouldn't it? And actually be training them to the "risk overtraining" point?


You might want to give a read to "Scaling Data-Constrained Language Models" [1]. They basically generalized the Chinchilla scaling law by investigating behavior on multi-epoch runs.

[1] https://arxiv.org/abs/2305.16264


The Llama 1 paper [1] was one of the earlier models to question the assumption that more params = better model. Since then they've released Llama 2 and this post is offering more evidence that reinforces their hypothesis.

I wouldn't say it was an oversight by other labs that they missed this. It's easier to just increase params on a model over the same training set instead of gathering a larger training set necessary for a smaller model. And at first, increasing model size did seem to be the way forward, but we've since hit diminishing returns. Now that we've hit that point, we've begun exploring other options and the Llamas are early evidence of another way forward.

[1] https://arxiv.org/abs/2302.13971


I work with LLMs, won't say where, but smaller models stop performing better on benchmarks after a certain point, e.g. they seem to hit their learning capacity (at least with current techniques). Small models struggle to keep context the way larger models do, their outputs are impressive, but lack a certain amount of logical consistency and flow.

Whether this is a fundamental issue with the model size or some combination of training technique and model size is yet to be known. But for now, we know what works and are exploring that until we squeeze all the 'easy' perf we can.


One noteworthy thing is that no one is posting validation curves, only training curves. All these models will happily bring training loss eventually to near zero with infinite compute, as the model overfits to the dataset -- there are no regularizers in any modern LLMs. The validation curves would be considerably more convincing.

The counter argument to above is that none of these models were really trained for multiple-epochs: it's hard to overfit data you've only seen once. But to go to 70T tokens, you'd inevitably have to start using many epochs.


The validation curves will look identical. These models are far too small to overfit to the training set.

With a large enough model and many epochs, you can certainly get overfitting, but for one epoch val/train curves look exactly the same and I'd expect that a 7B model will never overfit on 2T tokens no matter how many epochs you do.


> data you've only seen once

Is this still true given that they're upsampling in the pretraining dataset? I don't recall any details on how and to what extent they did this in the Llama2 paper but presumably some fraction of those 2T training tokens is repeated data.

MetaAI hasn't been as averse to repeated tokens as other groups, they trained the now forgotten about Galactica for multiple epochs with good results.

> The validation curves would be considerably more convincing.

What are they validating on? I was under the impression they weren't splitting the pretraining corpus.


The llama1 team did not have a validation set. I don’t know what the Llama2 team did - I left before seeing any of the details.

My guess is Llama2 upsamples Wikipedia a good bit, but given they didn’t report any information about training data, it’s hard to say.


> there are no regularizers in any modern LLMs.

Using a large & diverse training set is the best regulariser, but I think there is also weight decay and dropout in transformers


RWKV also uses some sort of L2-esque regularization, which was supposedly an idea taken from PaLM (although I can't find a source on this point, other than some message in the RWKV discord)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: