Claude 3 is a very clear improvment on GPT-4 but where GPT-4 does have the edge ...

CSMastermind · on March 27, 2024

Is it a clear improvement?

I run every query through Claude 3, GPT-4, and Gemini Advanced just to compare results.

Claude 3 and GPT-4 seem roughly on par with each other while Gemini is very clearly inferior.

I've run 47 queries in the last month. I marked Claude as doing better than GPT-4 on 2 of those and worse on 3 with the rest being roughly equal.

I wouldn't say it's a clear improvement so much as its an on par competitor.

whiplash451 · on March 27, 2024

Just curious: do you also benchmark Mistral?

CSMastermind · on March 27, 2024

No though if there's an easy way for me to do it I'd add it to the list. Is someone offering a web interface?

stavros · on March 27, 2024

Which Claude 3 version is this?

nycdatasci · on March 27, 2024

Opus.

CSMastermind · on March 27, 2024

jug · on March 27, 2024

Yes, and they basically conclude that OpenAI might be a better choice for you despite Claude 3 Opus technically performing better.

> While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.

> ... snip ...

> Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI’s models. You can get almost the same coding skill faster and cheaper with OpenAI’s models.

It's an interesting time of AI. Is this the first sign in a launched commercial product hitting diminishing returns given current LLM design? I'm going to be very interested in seeing where OpenAI is headed next, and "GPT-5" performance.

Also, given these indicators, the real news here might not be that Opus just barely has an edge on GPT-4 at a high cost, but what's going on at the lower/cheaper end where both Sonnet and Haiku now beats some current versions of GPT-4 on LMSys Chatbot Arena. https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Considering that Sonnet is offered for free on claude.ai, ChatGPT 3.5 in particular now looks hopelessly behind.

CuriouslyC · on March 27, 2024

I don't care about Opus, it's way overpriced if not through the web interface. Sonnet and Haiku are absolute achievements for the speed/cost though.

I recently read research that demonstrated that having multiple AIs answer a question then treating their answers as votes to select the correct answer significantly improves question answering performance (https://arxiv.org/pdf/2402.05120.pdf), and while this approach isn't really cost effective or fast enough in most cases, I think with Claud 3 Haiku it might just work, as you can have it answer a question 10 times for the cost of a single GPT3.5/Sonnet API call.

CuriouslyC · on March 27, 2024

Claude 3 is a clear improvement in stylish writing, because it hasn't been turbo-aligned to produce "helpful short form article answers." Coding wise, it depends on the language and a lot of other factors, but I don't think it's a clear winner over GPT4.

I've noticed that Claude likes to really ham up its writing though, and you have to actively prompt it to be less hammy. GPT4's writing is less hammy, but sounds vaguely like marketing material even when it's clearly not supposed to be.

sokz · on March 27, 2024

I am curious how perplexity handles the rate limiting. I use it a lot during the course of the day and find no rate limiting while Claude 3 Opus set as the default model.