I'm in the camp of 'no good for existing'. I try to get ~1000 line files refacto...

fragmede · 2025-05-22T06:23:38 1747895018

Which LLM are you using? what LLM tool are you using? What's your tech stack that you're generating code for? Without sharing anything you can't, what prompts are you using?

Incipient · 2025-05-22T08:03:32 1747901012

Was more of a general comment - I'm surprised there is significant variation between any of the frontier models?

However, vscode with various python frameworks/libraries; dash, fastapi, pandas, etc. Typically passing the 4-5 relevant files in as context.

Developing via docker so I haven't found a nice way for agents to work.

fragmede · 2025-05-22T20:49:20 1747946960

> I'm surprised there is significant variation between any of the frontier models?

This comment of mine is a bit dated, but even the same model can have significant variation if you change the prompt by just a few words.

https://news.ycombinator.com/item?id=42506554

danielbln · 2025-05-22T08:10:18 1747901418

I would suggest using an agentic system like Cline, so that the LLM can wander through the codebase by itself and do research and build a "mental model" and then set up an implementation plan. The you iterate in that and hand it off for implementation. This flow works significantly better than what you're describing.

otabdeveloper4 · 2025-05-22T10:03:13 1747908193

> LLM can wander through the codebase by itself and do research and build a "mental model"

It can't really do that due to context length limitations.

exe34 · 2025-05-22T11:25:29 1747913129

It doesn't need the entire codebase, it just needs the call map, the function signatures, etc. It doesn't have to include everything in a call - but having access to all of it means it can pick what seems relevant.

danielbln · 2025-05-22T12:37:20 1747917440

Yes, that's exactly right. The LLM gets a rough overview over the project (as you said, including function signatures and such) and will then decide what to open and use to complete/implement the objective.

otabdeveloper4 · 2025-05-23T10:12:59 1747995179

In a real project the call map and function signatures are millions of tokens themselves.

exe34 · 2025-05-23T14:43:35 1748011415

For sufficiently large values of real.

otabdeveloper4 · 2025-05-23T16:08:33 1748016513

Anything less is not a "project", it's a "file".

exe34 · 2025-05-23T17:39:25 1748021965

That's right, there is no true Scotsman!

otabdeveloper4 · 2025-05-25T20:29:12 1748204952

Incorrect attempt as fallacy baiting.

If your repo map fits into 1000 tokens then your repo is small enough that you can just concatenate all the files together and feed the result as one prompt to the LLM.

No, current LLM technology does not allow to process actual (i.e. large) repos.

simonw · 2025-05-25T22:05:34 1748210734

Where's your cutoff for "large"?

johnisgood · 2025-05-22T11:47:06 1747914426

1k LOC is perfectly fine, I did not experience issues with Claude with most (not all) projects around ~1k LOC.

otabdeveloper4 · 2025-05-23T10:15:44 1747995344

Actual projects where you'd want some LLM help start with millions of lines of code, not thousands.

With 1k lines of code you don't need an LLM, the entire source code can fit in one intern's head.

johnisgood · 2025-05-23T12:08:01 1748002081

The OP mentioned having LLM issues with 1k LOC, so I suppose he would have problems with millions. :D

simonw · 2025-05-23T19:10:41 1748027441

Have you tried Claude Code yet?

Even with it's 200,000 token limit it's still really impressive at diving through large codebases using find and grep.

lukan · 2025-05-22T10:52:07 1747911127

I guess people are talking about different kinds of projects here in terms of project size.

jacob019 · 2025-05-22T12:32:22 1747917142

I've refactored some files over 6000 loc. It was necessary to do it iteratively with smaller patches. "Do not attempt to modify more than one function per iteration" It would just gloss over stuff. I would tell it repeatedly: I noticed you missed something, can you find it? I kept doing that until it couldn't find anything. Then I had to manually review and ask for more edits. Also lots of style guidelines and scope limit instructions. In the end it worked fine and saved me hours of really boring work.

landl0rd · 2025-05-27T20:44:37 1748378677

I'll back this up. I feel constantly gaslit by people who claim they get good output.

I was hacking on a new project and wanted to see if LLMs could write some of it. So I picked an LLM friendly language (python). I picked an LLM friendly DB setup (sqlalchemy and postgres). I used typing everywhere. I pre-made the DB tables and pydantic schema. I used an LLM-friendly framework (fastapi). I wrote a few example repositories and routes.

I then told it to implement a really simple repository and routes (users stuff) from a design doc that gave strict requirements. I got back a steaming pile of shit. It was utterly broken. It ignored my requirements. It fucked with my DB tables. It fucked with (and broke) my pydantic. It mixed db access into routes which is against the repository pattern. Etc.

I tried several of the best models from claude, oai, xai, and google. I tried giving it different prompts. I tried pruning unnecessary context. I tried their web interfaces and I tried cursor and windsurf and cline and aider. This was a pretty basic task I expect an intern could handle. It couldn't.

Every LLM enthusiast I've since talked to just gives me the run-around on tooling and prompting and whatever. "Well maybe if you used this eighteenth IDE/extension." "Well maybe if you used this other prompt hack." "Well maybe if you'd used a different design pattern."

The fuck?? Can vendors not produce a coherent set of usage guidelines? If this is so why isn't there a set of known best practices? Why can't I ever replicate this? Why don't people publish public logs of their interactions to prove it can do this beyond a "make a bouncing ball web game" or basic to-do list app?

simonw · 2025-05-28T05:49:59 1748411399

> Why don't people publish public logs of their interactions to prove it can do this beyond a "make a bouncing ball web game" or basic to-do list app?

It's possible I've published more of those than anyone else. I share links to Gists with transcripts of how I use the models all the time.

You can browse a lot of my collection here: https://simonwillison.net/search/?q=Gist&sort=date

Look for links that's at things like "transcript".