GPT4 is much stronger though. You’re comparing apple to oranges.

verdverm · on May 2, 2024

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

1250 vs 1253 ELO, is that really "much stronger"?

ein0p · on May 2, 2024

Yes, it is much stronger from personal experience on real world queries. A lot less hallucinations, more ability to answer nontrivial questions, a lot more coverage of the tail. Which is not surprising for a much larger model, but unlikely to make much of a difference in largely superficial evals. Source: I personally use both, as well as Anthropic models multiple times daily, and use them in batch use cases as well.