Guessing you used 4o and not 4o-mini. For stuff like this you are better off let...

threeseed · on Aug 11, 2024

This assumes that the model knows it is wrong. It doesn't.

It only knows statistically what is the most likely sequence of words to match your query.

For rarer datasets e.g. I had Claude/OpenAI help out with an IntelliJ plugin it would continually invent methods for classes that never existed. And could never articulate why.

popinman322 · on Aug 11, 2024

This is where supporting machinery & RAG are very useful.

You can auto- lint and test code before you set eyes on it, then re-run the prompt with either more context or an altered prompt. With local models there are options like steering vectors, fine-tuning, and constrained decoding as well.

There's also evidence that multiple models of different lineages, when their outputs are rated and you take the best one at each input step, can surpass the performance of better models. So if one model knows something the others don't you can automatically fail over to the one that can actually handle the problem, and typically once the knowledge is in the chat the other models will pick it up.

Not saying we have the solution to your specific problem in any readily available software, but that there are approaches specific to your problem that go beyond current methods.

threeseed · on Aug 11, 2024

It doesn't make sense that the solution here is to put more load on the user to continually adjust the prompt or try different models.

I asked Claude and OpenAI models over 30x times to generate code. Both failed every time.

thierrydamiba · on Aug 12, 2024

If Claude and OpenAI are so useless why does every company ban it during interviews?

krageon · on Aug 13, 2024

Managers make most of those decisions and they have no idea what is achievable, reasonable or even particularly likely.

noddingham · on Aug 12, 2024

Do think that says more about the tools or the interview process?

__loam · on Aug 11, 2024

This is a really complicated (and more expensive) setup that doesn't fundamentally fix any of the problems with these systems.

segfaltnh · on Aug 13, 2024

Yep when I read stuff like this I think, "nah I'll just write the damn code." Looking forward to being replaced by a robot, myself.

dartos · on Aug 12, 2024

Popular programming in a nutshell.

It’s the new pop psych.

OutOfHere · on Aug 11, 2024

4o-mini is cheap, but is not practically free. At scale it will still rack up a cost, although I acknowledge that we are currently in the honeymoon phase with it. Computing is the kind of thing that we just do more of when it becomes cheaper, with the budget being constant.

MattDaEskimo · on Aug 11, 2024

It doesn't work like that. You're more likely to end up with a fractal pattern of token waste, potentially veering off into hallucinations than some actual progress by "double" or "triple checking everything".