In a brief test, I found that the bigger context window only meant that I could ...

koboll · on May 15, 2023

The fundamental problem seems to be that it's still slightly sub-GPT-3.5-quality, and even a long context window can't fix that. It will remember things from many many tokens ago, but it still doesn't reliably produce passable work.

The combination of a GPT-4-quality model and a long context window will unlock a lot of applications that now rely on somewhat lossy window-prying hacks (i.e. summarizing chunks). But any model quality below that won't move the needle much in terms of what useful work is possible, with the exception of fairly simple summarization and text analysis tasks.

phillipcarter · on May 15, 2023

Maybe! I certainly look forward to that. Although in my testing GPT-4 also hallucinates a bit (less than gpt-3.5), and the latency is so poor that it's unworkable for our product.

koboll · on May 15, 2023

Agreed. My heuristic is that GPT-4 is good for compile time tasks but bad for runtime tasks for both cost and speed reasons.

pmoriarty · on May 15, 2023

> The fundamental problem seems to be that it's still slightly sub-GPT-3.5-quality

It really depends on what you use it for.

I've found Claude better than GPT4 and even Claude+ at creative writing.

It also tends to give more comprehensive explanations without additional prompting. So I prefer to have it, rather than GPT3.5 or 4, explain things to me.

It's also free, which is another big win over GPT4.

dr_dshiv · on May 15, 2023

I find Claude significantly better than 3.5. I’d love to be able to make the case for that with data…

sanxiyn · on May 15, 2023

Since Chatbot Arena Leaderboard https://lmsys.org/blog/2023-05-10-leaderboard/ agrees with you, it's not just you.

famouswaffles · on May 15, 2023

There are 2 main claude models. I'm guessing it's claude-v1.3 aka claude plus that you find much better than 3.5 ? That tracks if so.

phillipcarter · on May 15, 2023

I've found for my use case that both claude-instant-* and claude-* are roughly on par with each other and gpt-3.5. claude-* seems to be the least inaccurate, but we also haven't put it into production like gpt-3.5, so it's hard to say for sure.

In either case, the claude models are very good. I think they'd do fine in a real product. But there's definitely issues that they all have (or that my prompt engineering has).

ssd532 · on May 16, 2023

I am very impressed with the quality of GPT-4, even with the 8k model. However, I have started reaching the limit of what the 8k model can do. I am eagerly awaiting the release of the 32k model.

Claude 100k model is nowhere near in terms of quality in my experience.