Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A related learning rate observation that is obvious in hindsight but not if you are just "tweaking" the learning rate: if you decay the learning rate exponentially, then you can only travel a bounded distance in parameter space, which may not be enough to reach a minimum. In practice that doesn't seem to be a problem, but then again, a cosine learning rate schedule doesn't look like a problem either.


Adam/AdamW doesn't follow any intuitive logic. Pretty much everyone has different experiments with setting learning rates and schedules.


That's fine - you can just use SGDR (SGD with restarts). https://arxiv.org/abs/1608.03983


These LLMs with cosine learning rate are usually decayed just to 1/10th the peak LR. Even if you used exponential decay, your LR on your final step could still be arbitrarily large depending on how fast you configure it to decay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: