Claude 3 is a very clear improvment on GPT-4 but where GPT-4 does have the edge is that it doesn't rate limit you as quickly or as harshly... I find myself running out of Claude prompts very quickly. Not because I'm asking questions best suited to a smaller model but when attempting to debug a prompt or hard problem it will quickly run out of requests if you get stuck in a corner.
Yes, and they basically conclude that OpenAI might be a better choice for you despite Claude 3 Opus technically performing better.
> While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.
> ... snip ...
> Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI’s models. You can get almost the same coding skill faster and cheaper with OpenAI’s models.
It's an interesting time of AI. Is this the first sign in a launched commercial product hitting diminishing returns given current LLM design? I'm going to be very interested in seeing where OpenAI is headed next, and "GPT-5" performance.
Also, given these indicators, the real news here might not be that Opus just barely has an edge on GPT-4 at a high cost, but what's going on at the lower/cheaper end where both Sonnet and Haiku now beats some current versions of GPT-4 on LMSys Chatbot Arena. https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
Considering that Sonnet is offered for free on claude.ai, ChatGPT 3.5 in particular now looks hopelessly behind.
I don't care about Opus, it's way overpriced if not through the web interface.
Sonnet and Haiku are absolute achievements for the speed/cost though.
I recently read research that demonstrated that having multiple AIs answer a question then treating their answers as votes to select the correct answer significantly improves question answering performance (https://arxiv.org/pdf/2402.05120.pdf), and while this approach isn't really cost effective or fast enough in most cases, I think with Claud 3 Haiku it might just work, as you can have it answer a question 10 times for the cost of a single GPT3.5/Sonnet API call.
Claude 3 is a clear improvement in stylish writing, because it hasn't been turbo-aligned to produce "helpful short form article answers." Coding wise, it depends on the language and a lot of other factors, but I don't think it's a clear winner over GPT4.
I've noticed that Claude likes to really ham up its writing though, and you have to actively prompt it to be less hammy. GPT4's writing is less hammy, but sounds vaguely like marketing material even when it's clearly not supposed to be.
I am curious how perplexity handles the rate limiting. I use it a lot during the course of the day and find no rate limiting while Claude 3 Opus set as the default model.