Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

They have baseline transformer of max size 6B in tables. Other models are trained on very different data and probably differently.


All the MQA transformers, Hawk and Griffin are trained on the same MassiveText dataset so no.


Yes, but MQA is limited to 6B size, while "other" larger non-RNN models in table(Llama-2) are not trained on the same dataset, and Hawk and Griffin are 7B. Sorry, I don't understand your point.


The point is that it also beats the baseline on every other size (1B and 3B). So it wouldn't be surprising to see it beat a 7B transformer model like the 6B model. Note 2 on page 5 probably explains why the sizes are different.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: