Yes exactly, I fear that shortening the training time would skew the results. In...

gpjt · 2025-12-17T21:28:20 1766006900

Awesome, thanks! I'm still doing trains on the big machines right now (hopefully will write up over xmas) but I think once I've worked out the sweet spot for memgatokens per dollar for this model, it's time to start tweaking the other controls -- LR and cosine variation of it, as you said, and also dropout, bias, weight tying, and definitely gradient clipping (which should at least get better bang for the buck from time/$ spent). I'll leave it to Google to follow up Chinchilla with a "best batch size across a thousand trained models" paper ;-)