A related learning rate observation that is obvious in hindsight but not if you are just "tweaking" the learning rate: if you decay the learning rate exponentially, then you can only travel a bounded distance in parameter space, which may not be enough to reach a minimum. In practice that doesn't seem to be a problem, but then again, a cosine learning rate schedule doesn't look like a problem either.
These LLMs with cosine learning rate are usually decayed just to 1/10th the peak LR. Even if you used exponential decay, your LR on your final step could still be arbitrarily large depending on how fast you configure it to decay.