This gap makes sense to me. The academic point of the Cerebras paper is to show their nice empirical scaling law for compute-optimal training, whereas the academic point of the LLaMA paper was to show that you can make small models punch above their weight by training them in a way that is deliberately not compute-optimal. Of course both of those publications had other academic and marketing purposes.
From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."
From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."