They have baseline transformer of max size 6B in tables. Other models are traine...

GaggiX · on April 10, 2024

All the MQA transformers, Hawk and Griffin are trained on the same MassiveText dataset so no.

riku_iki · on April 10, 2024

Yes, but MQA is limited to 6B size, while "other" larger non-RNN models in table(Llama-2) are not trained on the same dataset, and Hawk and Griffin are 7B. Sorry, I don't understand your point.

GaggiX · on April 10, 2024

The point is that it also beats the baseline on every other size (1B and 3B). So it wouldn't be surprising to see it beat a 7B transformer model like the 6B model. Note 2 on page 5 probably explains why the sizes are different.