Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In a brief test, I found that the bigger context window only meant that I could stuff a whole schema into the input. It still hallucinated a value. When I plugged in a call to a vector embedding to only use the top k most "relevant" fields it did exactly what I wanted: https://twitter.com/_cartermp/status/1657037648400117760

YMMV.



The fundamental problem seems to be that it's still slightly sub-GPT-3.5-quality, and even a long context window can't fix that. It will remember things from many many tokens ago, but it still doesn't reliably produce passable work.

The combination of a GPT-4-quality model and a long context window will unlock a lot of applications that now rely on somewhat lossy window-prying hacks (i.e. summarizing chunks). But any model quality below that won't move the needle much in terms of what useful work is possible, with the exception of fairly simple summarization and text analysis tasks.


Maybe! I certainly look forward to that. Although in my testing GPT-4 also hallucinates a bit (less than gpt-3.5), and the latency is so poor that it's unworkable for our product.


Agreed. My heuristic is that GPT-4 is good for compile time tasks but bad for runtime tasks for both cost and speed reasons.


> The fundamental problem seems to be that it's still slightly sub-GPT-3.5-quality

It really depends on what you use it for.

I've found Claude better than GPT4 and even Claude+ at creative writing.

It also tends to give more comprehensive explanations without additional prompting. So I prefer to have it, rather than GPT3.5 or 4, explain things to me.

It's also free, which is another big win over GPT4.


I find Claude significantly better than 3.5. I’d love to be able to make the case for that with data…


Since Chatbot Arena Leaderboard https://lmsys.org/blog/2023-05-10-leaderboard/ agrees with you, it's not just you.


There are 2 main claude models. I'm guessing it's claude-v1.3 aka claude plus that you find much better than 3.5 ? That tracks if so.


I've found for my use case that both claude-instant-* and claude-* are roughly on par with each other and gpt-3.5. claude-* seems to be the least inaccurate, but we also haven't put it into production like gpt-3.5, so it's hard to say for sure.

In either case, the claude models are very good. I think they'd do fine in a real product. But there's definitely issues that they all have (or that my prompt engineering has).


I am very impressed with the quality of GPT-4, even with the 8k model. However, I have started reaching the limit of what the 8k model can do. I am eagerly awaiting the release of the 32k model.

Claude 100k model is nowhere near in terms of quality in my experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: