Yes, but MQA is limited to 6B size, while "other" larger non-RNN models in table(Llama-2) are not trained on the same dataset, and Hawk and Griffin are 7B. Sorry, I don't understand your point.
The point is that it also beats the baseline on every other size (1B and 3B). So it wouldn't be surprising to see it beat a 7B transformer model like the 6B model. Note 2 on page 5 probably explains why the sizes are different.