A separate comment about conclusions about why they are worse than OpenAI GPT2 - which to me feel to be missing the point.
One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.
Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.
The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).
On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.
OP here -- thanks! I'm in the process of doing some trains using the same code plus DDP on big Lambda Labs machines, and (within the bounds of what I can afford) will hopefully have some interesting results about all of those shortly.
OK, early indicators support both you and Gemini quite strongly re: batch size. On my (somewhat ad-hoc) test dataset, I get losses like this:
* OpenAI medium weights: 3.231
* OpenAI small weights: 3.500
* My locally trained model, FineWeb Chinchilla, batch size 6: 3.944
* My locally trained model, FineWeb-Edu Chinchilla, batch size 6: 4.167
* My locally trained model, FineWeb-Edu double Chinchilla, batch size 6: 4.135
* My cloud trained model, FineWeb Chinchilla, batch size 13 \* 8 = 104: 3.674
That last one was trained on an 8x A100 machine with 40 GiB per GPU, with the same code as before, just converted to DDP. It certainly looks like the much larger batch size has improved the model significantly.
I'll be trying on larger machines. No gradient accumulation yet, but it's certainly looking like a valuable lever to pull for local training runs (and, I suspect, might also be useful on "small" cloud machines like the one I used -- will have to see what things look like with the bigger mini-batches I can squeeze onto 80 GiB and 160 GiB GPUs).
Thanks, very nice to see these results! Certainly using GPUs with more RAM makes things simpler to scale. Gradient accumulation is as easy as adding a counter for number of steps and an "if counter % gradient_accumulation_steps:` around `optimizer.step()`, so that can also be tried simply on a single GPU / cheaper GPUs. But if you can just use 8xA100 and your pipeline parallizes well, you also get results (almost) 8 times faster, which is certainly nicer to experiment of course!
Exactly! If I can get it down to an hour or two (seems very plausible on an 8x H200 with 160 GiB VRAM per GPU, though those are almost never available on Lambda Labs), I'll do the experiments with dropout and the other possible causes of issues, then see if I can bake that all into a new train on the RTX 3090 and confirm it repros there. Looks like I'll definitely need gradient accumulation there.
I assume the zero_grad would need to go in the same if block?
Sorry came a bit late to this reply. Interesting, well, nobody says it's a monotonic function :-) in the limit of _very_ large batches you of course are worse off, because you take a very large amount of computation before taking a single step, so if you stop after a fixed amount of time your model just didn't have the time to learn properly. So certainly there is a sweet spot somewhere.
I suppose, the real "function" is a bit more complicated because
(1) If you put 2x more data through the same GPU with large enough memory, it will take less than 2x the time to compute (but certainly not 1x).
(2) At some point, empirically, increasing batch size makes it _worse_ even if you ignore the additional runtime cost (i.e. stop after n gradient update steps, and not x seconds). To my knowledge, the accepted reason for that fact is that a bit of noise helps in regularizing learning, because overly smooth learning curves end up stagnating in local loss minima more easily. In truth, I think nobody exactly understand how deep learning models work :-)
And to your other question - sorry again for the late answer. Yes, `optimizer.zero_grad()` should always be called directly after `optimizer.step()`, therefore with gradient accumulation once every `n` steps (otherwise, you'd be zeroing out the gradients, so just throwing away all the compute you did in previous steps).
Thanks re: gradient accumulation, I'm glad to hear my intuition was right!
As part of the upcoming post I'm running the DDP train on A100s with 40 GiB and 80 GiB, H100s with 80 GiB, and B200s with 160 GiB, so I'll have at least three loss vs. batch size points to plot. So that might be interesting.
I guess a full test would be to train at various batch sizes on the 160 GiB machine and plot the resulting loss. That would be very expensive as a hobby project (the bs=64 train cost a bit more than $40 excluding overhead) so I won't do it.
But perhaps a shorter train would still be of value? That is, train for 300M tokens for a tenth of the cost and see where the loss landed? The problem with that would be if the impact of batch sizes varied with the length of the train, eg. if batch size 64 was better than 512 for short trains but weaker at longer ones.
Yes exactly, I fear that shortening the training time would skew the results. In the very short term, smaller batch size is typically better just because you need a certain amount of gradient updates to move away from the original random, hence pretty terrible, weight distribution. Larger batch size gives a steadier, but slower, convergence, so it's hard to say for sure what is better for a given compute budget.
I'm definitely _not_ encouraging you on spending more money on a side topic just for the sake of optimizing this one parameter, there will always be another parameter after that that you'll feel an urge to optimize :-) I'd say it's already a pretty neat result to have come to a very close score to the original GPT2 training starting from scratch!
P.S. If you want to push it a bit further, rather than optimizing parameters for this model, last week at EurIPS I heard that a current "very good" modern repo to start from in order to train a good LLM is this: https://github.com/Niccolo-Ajroldi/plainLM. I haven't investigated this exactly (I'm not working on LLM), but it might be interesting to you for a sample run. The (N)EurIPS paper that was discussed at the conference claimed that the only important change to do was to modify the hyperparameters of the Adam optimizer, setting beta1=beta2=0.95 for example (the default values are beta1=0.9 and beta2=0.999 which are apparently outdated).
Awesome, thanks! I'm still doing trains on the big machines right now (hopefully will write up over xmas) but I think once I've worked out the sweet spot for memgatokens per dollar for this model, it's time to start tweaking the other controls -- LR and cosine variation of it, as you said, and also dropout, bias, weight tying, and definitely gradient clipping (which should at least get better bang for the buck from time/$ spent). I'll leave it to Google to follow up Chinchilla with a "best batch size across a thousand trained models" paper ;-)
> Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.
I would be surprised if there is much/any gradient acc in modern large-scale pretraining runs. You can always just recruit more GPUs with DP/PP/TP rather than training for longer.
Mmh not really. As OP shows, speed increases with larger batch size, but only initially, until the GPU has high enough utilization; then speed improvements flatten out (although you might get OOM before that and not "really" see the flat part). Using smaller batch size increases _noise_, so quite literally decreases stability. That might be good sometimes: in the limit case, if the batch is as large as your training set, you'll end up in local minima and not be able to get out of it. But this is true for toy datasets like MNIST, here it's an entirely different beast.
With such large corpora as the ones used here, and very noisy ones at that, gradient updates are very noisy and that can harm quality. Or anyway, common lore is that one needs pretty large batch size to have the language model improve steadily.
Sorry I just opened that file now, and browsed through it very quickly, but my eye fell on the excerpt:
```
However, we did not observe any speedup by increasing the batch size from
65536 to 131072 for the first stage, thus, we restrict the batch size to 65536 for this stage.
```
which I think is more or less my point: increasing batch size essentially always helps, but the speedup reduces the more you push the batch size. Provided that your dataset is large enough, more batch size will always make you run a bit faster without sacrificing accuracy, but the speedup will be less and less as you increase the batch size, until you are anyway maxing out the power of your GPU and you can't see any measurable speedup anymore.
This is a very nice, detailed post! I have a few minor comments though (maybe a few are discussed somewhere, it's a _long_ article and I can't claim 100% coverage :-) ):
Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...
The early discussion and worries about truncating strings look a bit weird. The author then realizes they're anyway not even going to use 30% of the total available data, so who cares if for each given string we're only using the first 1024 tokens? (And anyway, even if doing more epochs, he doesn't discuss the obvious solution to avoid throwing away data, i.e. not clipping always the tail but starting from a random point each epoch - maybe after a punctuation or something)
At this level of simplicity, setting up a validation loop might be an unneeded complication (for the autoregressive pretraining part, not the instruction-tuning of course). That's because anyway the model is training for < 1 epoch, so no data is seen twice (*). One might as well just track the training loss, it's slightly less "clean" because it's evaluated each time on different data, but the sheer size of it makes up for the issue. The final plot shows that the two curves are similar - train is noisier of course, but nothing a bit of rolling smoothing couldn't solve.
The choice to load all tokenized text into RAM feels odd... it works, and it's possibly slightly faster than loading on-the-fly, but only if you have enough RAM to "waste". PyTorch loads data on separate processes in a non-blocking way, so it feels like having it on disk and loaded on-the-fly would be safer and not make any hit on runtime. But well, if it fits, it's certainly easier that way (although, as the author remarks, it only works if you can store it as a numpy array or torch tensor of some internally supported dtypes like int or float; if they are any Python "object" types, they get replicated per dataloader worker, and OOM is guaranteed)
The choice to concatenate everything into a long string is a bit outdated nowadays. Because it trains with attention between different sentences that have nothing to do with each other, and could cause a bias or anyway suboptimal results. Nowadays people use masked attention ("document masking"), which is so popular it's even supported by FlashAttention: https://github.com/Dao-AILab/flash-attention/issues/654
(*) Of course, the data is dirty enough that there _will_ be some duplicated stuff here or there, but the same is true for a random train/validation split. Also such a small model would have very little risk to memorize, even if some data were replicated.*
> Calling it "training LLM" is a bit misleading. This is a small GPT-2-sized model (~160M params), while the "L" in "LLM" stands for large...
I've always felt the natural way of referring to smaller LLMs would be Medium Language Models and Small Language Models, but I guess MLM is an inauspicious acronym.
MLM is masked language modelling, another phrase for training models on the cloze task. It's the most common way to train encoder-only models.
CLM (causal language modelling) is the other common task where you autoregressively predict the next token given the previous ones. It's the most common way to train decoder-only models.
Aside from the weirdness of calling "good old" something that was released 17 months ago :-D I mean, deep learning is evolving at crazy rhythm, but you just can't assume a good paper gets written in days.
That said, as others have pointed out, and as it's also written on the blog post, they are entirely different methods. QLoRA requires access to the full training data, while theoretically you can apply SpinQuant to any given model. For example, they also apply it to Mistral, not only to their LLaMA.
(QLoRA also takes some time and compute to apply, but since SpinQuant also implies learning some weights, I don't know if it's actually faster/cheaper, too)
I know nothing about what makes an industry succeed or fail, and also nothing about web tech, but working in the field I can comment on:
> tensorflow looks like currently loosing to pytorch - seems like google got bored and more development is for JAX, Keras wrapper
Well, TensorFlow doesn't "look like currently losing", it has already lost since a long time. I haven't seen a decent paper release code in TensorFlow in years, and all the references I see to TF online are job posts from "older" companies (to the point that, if you are looking for a job in data science, seeing TF mentioned in the job post is kind of a red flag of a place you don't want to be).
That said, I am quite certain that this has only a small impact on why Google is losing terrain, and even on why it is behind in AI (which is also debatable: narrative aside, Gemini is not that much lacking behind competitors). Certainly if TensorFlow + TPUs turned out to be better than PyTorch + GPUs they would have had a lead to start from, but if that was so important, Meta or NVIDIA would have created the first LLM, not OpenAI.
Simply, sometimes stuff happens, you can't predict it all.
I know this is HN and here it's not a popular opinion, but maximum security is _not_ always a good idea. Even setting aside the problem of many different actors having to access these details mentioned below, there's value in a simple login process. Specifically for airplane tickets, the most common ones I had to struggle with multiple times are retrieving reservations bought from a different computer, or by a travel agency. In all these situations, it was exactly the simple approach that saved me. If 2FA was mandatory, the best case scenario was that the travel agency would have to send you a separate e-mail with details about how to access their portal where this 2FA would somehow work. The number of systems multiplies, the number of credentials to remember does, as well. If you are not from your usual workplace (and chances are, if you are travelling, you are not) or from a shaky connection (same), you are in a real problem. In a time-critical scenario, which makes it really worse.
Implementing a "secure" connection here would be a sure road for pain ahead, at least it would need the airplane company to increase customer support a lot, and likely a lot of bad publicity every time something fails. Delays cost money, especially in this industry. And what would you get for that? The safety that, if you publish a picture of your reservation / boarding pass online, nobody can log in with your credentials and cancel your flight? That's a rather niche and very targeted risk, which is better handled by a single customer support agent who, simply, issues you a new ticket.
(by the way, by the time you have checked in and your boarding pass has been issued, a lot of companies just don't allow you to cancel anymore, so it's really a non-issue?)
> (by the way, by the time you have checked in and your boarding pass has been issued, a lot of companies just don't allow you to cancel anymore, so it's really a non-issue?)
Which companies have a cancellation policy that is contingent upon getting a boarding pass? I've cancelled checked-in tickets before. If the flight is operated by a different airline than the ticket issuer, you just have to call the operating airline first to undo the check-in (a few airline can even do this online). After that it should be possible to cancel the ticket by the ticket issuer without any problems.
Do you have sources for "The MFU can be above 40% and certainly well above the 35 % in the estimate"?
Looking at [1], the authors there claim that their improvements were needed to push BERT training beyond 30% MFU, and that the "default" training only reaches 10%. Certainly numbers don't translate exactly, it might well be that with a different stack, model, etc., it is easier to surpass, but 35% doesn't seem like a terribly off estimate to me. Especially so if you are training a whole suite of different models (with different parameters, sizes, etc.) so you can't realistically optimize all of them.
It might be that the real estimate is around 40% instead of the 35% used here (frankly it might be that it is 30% or less, for that matter), but I would doubt it's so high as to make the estimates in this blog post terribly off, and I would doubt even more that you can get that "also for small models with plain pytorch and trivial tuning".
I'm into AI but not into sound, so I might be saying something stupid here, but I think using something like this for very high volume like concerts would be possibly outright impossible, but, even if not, certainly quite dangerous and therefore not commercializable.
My understanding is that to "mute" a sound, you need to inject another wave that is exactly the opposite, with the exact same volume and in perfect sync, so that the two waves interfere destructively. However, in general but especially in AI, you can never guarantee 100% accuracy. If you use this technology to "silence" a background fountain, and something goes wrong, at worst you get a lot of noise that make you grimace and remove them. If at a concert with 100+ dB of music you get an error and your headphones start producing a similarly loud, but not perfectly aligned noise right into your ears, you probably won't have the time to remove them before damaging your hearing system.
In general, I think that having a tool that drives 100+ dB straight into your head is probably not a wise idea :-)
You could probably achieve the same outcome by combining two approaches though. Use traditional timing and phase management that existing noise cancelling headphones do. Then, using the data from that same set of microphones use AI to extract the conversation of interest (maybe using timing differences from left/right to determine who's "in front" of you) and inject that as the thing to overlay on top of the inversion. This way there's no risk of AI error on the noise cancellation and you can rely on existing solutions.
Even putting 50db of sound in the opposite direction might help take something from the volume of a nightclub to the volume of a refrigerator [1]. Not perfectly muting it, but perhaps good enough for many scenarios.
Disclaimer - I also have no technical experience of sound
It probably wouldn't work for in-ear setups. However, I'd you have over the ear headphones with good passive noise canceling (35db) then you would need less of the active canceling (65db) to make it quiet and safe.
You can get earplugs with ~30 dB reduction and builtin in-ear monitors. Slap some microphones and such on the outside, and you can probably work with it.
Yep that also sounded weird to me. I had, IIRC, three of my wisdom teeth removed as a teenager, I was living in Italy back then. I think two of them in a single session. General anaesthesia wasn't even an option, the whole thing happened in a normal dentist cabinet with a local anaesthesia to the relevant half of the mouth. I distinctly recall the dentist complaining that for one of the teeth my roots were particularly strongly attached to the bone, and he had to push and lean on it, _hard_; it didn't really feel painful, except that my jaw was aching on the opposite side (the mostly-non-sedated one) due to the pressure he put on it.
In fact, I think people and doctors alike tend to sedate much less in Italy - maybe not completely unjustified from a few things I've read in this thread. Back then, the normal drilling & filling tooth cavities mostly happened without any anaesthesia at all, local or otherwise. Frankly, that was quite painful, whenever the drilling happened to touch a nerve, and I really don't feel like experiencing it again :-) and I think at least this changed since.
Once an old dentist lady told me that she noticed patients complaining about pain on the other side in this situations. She didn't have an explanation then.
Variety matters a lot. If you pay 1000 trained labellers, you get 1000 POVs for a good amount of money, and likely can't even think of 1000 good questions to have them ask. If you let 1000000 people give you feedback on random topics for free, and then pay 100 trained people to go through all of that and only retain the most useful 1%, you get much ten times more variety for a tenth of the cost.
Of course numbers are pretty random, but it's just to give an idea of how these things scale. This is my experience from my company's own internal -deep learning but not LLM- models to train which we had to buy data instead of collecting it. If you can't tap into data "from the wild" -in our case, for legal reason- you can still get enough data (if measured in GB), but it's depressingly more repetitive, and that's not quite the same thing when you want to generalize.
If climate change were visible at that scale (tiny resolution between 0 and 40 degrees) we'd be all boiled since a while.
Still, you can see signs: the maximum temperature until 1990 or so seems to be around 35 degrees, since then there are several peaks above that value and in 2016 (?) it looks to be 38-39. It's certainly less visible on the peaks in the low, because maybe the absolute lowest scores appear to be in the 1990-2000 decade, but then again, all years in the 2010-2020 decade seem to be slightly higher than the minimum temperature in any other decade.
That said, there is massive downscaling involved in such scale, so I wouldn't be too surprised if some details were just skipped and not visible. I wouldn't trust this interpretation much - if a visualization it needs to be, I'd rather plot a moving average with a window of 6 months at least (or even 1 year to entirely rule seasonalities out), and see if that one has an upward trend or not (I bet it does).
[EDIT] I now see the post below with the year averages since 1979. It does indeed seem that 1995-1997 were abnormally cold years, and also that 2010-2020 is the warmest decade since then (and likely since quite a bit longer). So the outliers analysis here above seem to stand :-)
Tech lead for WEMC here - see https://tealtool.earth Straightforward charts of climate related data for different countries and regions around the globe
For temperature and a few other variables, it shows historical data from the EU Copernicus service (C3S) along with three different projected series out to 2100
for CO2, it shows the latest historical data
The charts are concerning and I am sure my co-workers are not hell bent on faking data to scare people just to get more funding; they work too much and go to too many meetings.
I have not analyzed any data yet and the purpose of plotting the time series was to show an example of the data as a function of time. As others have already mentioned, the swing in Durban temperatures over the seasonal cycle is ~25°C while global temperature increases due to climate change so far are on the order of 1°C.
Plus weather data tends to be quite noisy, just think how variable the weather can be day-to-day and we're squishing 80 years of that into one plot. Also worth noting that different places may experience climate change differently. Some places may be the average temperature go up, some maybe only in the summer, so you'll have to look at averages. Some places may see more extreme summer highs, so then you can't just look at averages but the average extremes or the tail end of the temperature distribution.
So it'll be hard to discern any climate change from just a cursory glance. I'm not saying it's there, just that it requires more analysis.
One main point is batch size - I'd agree with Gemini here. Batch size <= 5 with 1024 seq len is really tiny. Nowadays models are trained with effective batch size of millions of tokens in total. Of course, this won't fit into memory, one uses gradient accumulations to that purpose, again as mentioned by Gemini.
Training duration is definitely also a reason - models do get better over time, otherwise people wouldn't train so long wasting millions :-) just how long for optimality is unclear, but certainly < 2 days is not optimal even at this "small" scale.
The optimizer could also play a role. As the author mentions, a fixed learning rate is hardly optimal, it is typically both increased in the beginning ("warm up", but that's for stability, if training works without, that's not an issue) and scaled down at the end ("cool down" - that is, annealing, with cosine as mentioned in the article). This generally squeezes out a bit more performance. Also, while it's true that dropout was used back then (might be useful for many epochs, likely only harmful for < 1 epoch), using _both_ dropout _and_ weight_decay > 0, as the author does, is probably wrong and makes training too slow & careful to get good results. Also, even if used, a "good" implementation of weight decay should skip some layers like embeddings and biases (GPT2 did that, and it's relatively important to do so).
On the other hand, I'm pretty sure that using mixed precision and TF32 has absolutely no downsides. It's really standard nowadays to use either mixed precision (FP16 gradients + FP32 base weights) or directly BF16 ("brain" float 16, a bit like the TF32 described there, but with only 16 bits) and I have almost never seen either one fail... and when it does, it typically fails spectacularly, with NaN losses or the model degenerating to trivial performance.
reply