Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting.

I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right.

Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.



> I wanted it to refactor a parser in a small project

This expression tree parser (typescript to sql query builder - https://tinqerjs.org/) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time.

I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes.


Hum yeah, it shows. Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.


> Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here.

How does the API look completely different for pg and sqlite? Can you share an example?

It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful.

Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database.


I guess the interesting question is whether @jeswin could have created this project at all if AI tools were not involved. And if yes, would the quality even be better?


Very true. However, to claim that the "API looks completely different for Postgre and SQLite" is disingenuous. What was he looking at?


There are two examples on the landing page, and they both look quite different. Surely if the API is the same for both, there'd be just one example that covers both cases, or two examples would be deliberately made as identical as possible? (Like, just a different new somewhere, or different import directive at the top, and everything else exactly the same?) I think that's the point.

Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression.


If you're mentioning the first two examples, they're doing different things. The pg example does an orderby, and the sqlite example does a join. You'll be able to switch the client (ie, better-sqlite and pg-promise) in either statement, and the same query would work on the other database.

Maybe I should use the same example repeated for clarity. Let me do that.

Edit: Fixed. Thank you.


Actually the interesting question is whether this library not existing would have been a loss for humanity. I'll posit that it would not.


Quite impressive, thank you for sharing!

Question - this loads a 2 MB JS parser written in Rust to turn `x => x.foo` into `{ op: 'project', field: 'foo', target: 'x' }`. But you don't actually allow any complex expressions (and you certainly don't seem to recursively parse references or allow return uplift, e. g. I can't extract out `isOver18` or `isOver(age: int)(Row: IQueryable): IQueryable`). Why did you choose the AST route instead of doing the same thing with a handful of regular expressions?


Parsing code with regex is a minefield. You can get it to work with simpler cases, but even that might get complex very quickly with all sorts of formatting preferences that people have. In fact, I'll be very surprised if it can be done with a few regular expressions; so I never gave it much consideration. Additionally, improved subquery support etc is coming, involving deeper recursion.

I could have allowed (I did consider it) functions external to the expression, like isOver18 in your example. But it would have come at the cost of the parser having to look across the code base, and would have required tinqerjs to attach via build-time plugins. The only other way (without plugins) might be to identify callers via Error.stack, and attempting to find the calling JS file.


Development tools and libraries seem like they may be one of the absolute easiest use cases to get LLMs to work with since they generally have far less ambiguous requirements than other software and the LLMs generally have an enormous amount of data in their training set to help them understand the domain.


I have tried several. Overall I've now set on strict TDD (which it still seems to not do unless I explicitly tell it to even though I have it as a hard requirement in claude.md).


Claude forgets claude.md after a while, so you need to keep reminding. I find that codex does a design job better than Claude at the moment, but it's 3x slower which I don't mind.


>I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place.

The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead.


That would be a solution, yes. But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.

I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects".

It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work.

We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.


>But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.

Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)

>We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM.

We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time...


> >But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways.

> Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :)

But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back.


I was hoping that LLMs being able to access strict tools, like Gemini using Python libraries, would finally give reliable results.

So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors.

But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result.

You still cannot trust LLMs. And that is a problem.


The obvious point has to be made: Generating formal proofs might be a partial fix for this. By contrast, coding is too informal for this to be as effective for it.


Might be related with what the article was talking. AI can't cut-paste. It deletes the code and then regenerates it at another location instead of cut-paste.

Obviously generated code drift a little from deleted ones.


Interesting. What model and tool was used?

I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors.


Claude code. Whichever model it started up automatically last weekend, I didn't explicitly check.


This feels like a classic Sonnet issue. From my experience, Opus or GPT-5-high are less likely to do the "narrow instruction following without making sensible wider decisions based on context" than Sonnet.


This is "just use another Linux distro" all over again


Yes and no, it's a fair criticism to some extent. Inasamuch as I would agree that different models of the same type have superficial differences.

However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: