I wonder what led to such a gap between llama 7b and Cerebras 13b. I hope they d...

ftxbro · on March 28, 2023

This gap makes sense to me. The academic point of the Cerebras paper is to show their nice empirical scaling law for compute-optimal training, whereas the academic point of the LLaMA paper was to show that you can make small models punch above their weight by training them in a way that is deliberately not compute-optimal. Of course both of those publications had other academic and marketing purposes.

From the Cerebras blog post: "Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget."

From the LLaMA paper: "The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used."

gpm · on March 28, 2023

Is there a benchmark comparing the two that I missed?

Edit: The huggingface page has 0-shot benchmarks which you can compare against the llama paper

https://huggingface.co/cerebras/Cerebras-GPT-13B

https://arxiv.org/pdf/2302.13971.pdf

freeqaz · on March 28, 2023

I'm on mobile and struggled to compare these two tables properly. Would you mind posting a summary of your findings?

Here are some values but I don't know what they mean. LLama 60B on the left, Cerebras 13B on the right.

PiQA: 82.8 / 76.6 WinoGrade: 77.0 / 64.6 ARC-e: 78.9 / 71.4

gpm · on March 28, 2023

Really short summary: LLaMa is better, even smaller LLaMa models.

Table format: Benchmark, Cerebras 13B, LLama 7B, LLama 13B, LLama 60B

HellaSwag, 51.3, 76.1, 79.2, 84.2

Piqa, 76.6, 79.8, 80.1, 82.8

Wino-Grande, 64.6, 70.1, 73.0, 77.0

Arc-e, 71.4, 72.8, 74.8, 78.9

Arc-c, 36.7, 47.6, 52.7, 56.0

OpenBookQA, 28.6, 57.2, 56.4, 60.2

jumpCastle · on March 28, 2023

Looks like llama 7b was trained on 4 times more tokens.