Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Guessing you used 4o and not 4o-mini. For stuff like this you are better off letting it use mini which is practically free, and then have it double and triple check everything.


This assumes that the model knows it is wrong. It doesn't.

It only knows statistically what is the most likely sequence of words to match your query.

For rarer datasets e.g. I had Claude/OpenAI help out with an IntelliJ plugin it would continually invent methods for classes that never existed. And could never articulate why.


This is where supporting machinery & RAG are very useful.

You can auto- lint and test code before you set eyes on it, then re-run the prompt with either more context or an altered prompt. With local models there are options like steering vectors, fine-tuning, and constrained decoding as well.

There's also evidence that multiple models of different lineages, when their outputs are rated and you take the best one at each input step, can surpass the performance of better models. So if one model knows something the others don't you can automatically fail over to the one that can actually handle the problem, and typically once the knowledge is in the chat the other models will pick it up.

Not saying we have the solution to your specific problem in any readily available software, but that there are approaches specific to your problem that go beyond current methods.


It doesn't make sense that the solution here is to put more load on the user to continually adjust the prompt or try different models.

I asked Claude and OpenAI models over 30x times to generate code. Both failed every time.


If Claude and OpenAI are so useless why does every company ban it during interviews?


Managers make most of those decisions and they have no idea what is achievable, reasonable or even particularly likely.


Do think that says more about the tools or the interview process?


This is a really complicated (and more expensive) setup that doesn't fundamentally fix any of the problems with these systems.


Yep when I read stuff like this I think, "nah I'll just write the damn code." Looking forward to being replaced by a robot, myself.


Popular programming in a nutshell.

It’s the new pop psych.


4o-mini is cheap, but is not practically free. At scale it will still rack up a cost, although I acknowledge that we are currently in the honeymoon phase with it. Computing is the kind of thing that we just do more of when it becomes cheaper, with the budget being constant.


It doesn't work like that. You're more likely to end up with a fractal pattern of token waste, potentially veering off into hallucinations than some actual progress by "double" or "triple checking everything".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: