More

mulmboy · 2026-01-18T02:21:46 1768702906

LLMs aren't like you or me. They can comprehend large quantities of code quickly and piece things together easily from scattered fragments. so go to reference etc become much less important. Of course though things change as the number of usages of a symbol becomes large but in most cases the LLM can just make perfect sense of things via grep.

To provide it access to refactoring as a tool also risks confusing it via too many tools.

It's the same reason that waffling for a few minutes via speech to text with tangents and corrections and chaos is just about as good as a carefully written prompt for coding agents.

mulmboy · 2026-01-01T10:37:20 1767263840

> AI seems to have caught up to my own intelligence even in those narrow domains where I have some expertise. What is there left that AI can’t do that I would be able to verify?

The last few days I've been working on some particularly tricky problems, tricky in the domain and in backwards compatibility with our existing codebase. For both these problems GPT 5.2 has been able to come to the same ideas as my best, which took me quite a bit of brain racking to get to. Granted it's required a lot of steering and context management from me as well as judgement to discard other options. But it's really getting to the point that LLMs are a good sparring partner for (isolated technical) problems at the 99th percentile of difficulty

judahmeek · 2026-01-01T13:30:33 1767274233

You steered a sycophantic LLM to the same idea that you had already had & think that's worth bragging about?

mulmboy · 2026-01-02T07:14:09 1767338049

I'm well ware that they can be sycophantic, and I structure things to avoid that like asking "what do you think of this problem" and seeing the idea fall out rather than providing anything that would suggest it. In one of these two cases it took an idea that I had inkling of, fleshed it out, and expanded it to be much better than I had.

And I'm not bragging. I'm expressing awe, and humility that I am finding a machine can match me on things that I find quite difficult. Maybe those things aren't so difficult after all.

By steering I mean more steering to flesh out the context of the problem and to find relevant code and perform domain-specific research. Not steering toward a specific solution.

mulmboy · 2025-12-26T19:43:05 1766778185

Is it just me or is codex slow?

With claude code I'll ask it to read a couple of files and do x similar to existing thing y. It takes a few moments to read files and then just does it. All done in a minute or so.

I tried something similar with codex and it took 20 minutes reading around bits of file and this and that. I didn't bother letting it finish. Is this normal? Do I have something misconfigured? This was a couple of months ago.

mulmboy · 2025-12-07T08:09:43 1765094983

What do these look like?

pmg101 · 2025-12-07T11:14:30 1765106070

  1. Take every single function, even private ones.
  2. Mock every argument and collaborator.
  3. Call the function.
  4. Assert the mocks were  called in the expected way.

These tests help you find inadvertent changes, yes, but they also create constant noise about changes you intend.

senbrow · 2025-12-07T18:25:01 1765131901

These tests also break encapsulation in many cases because they're not testing the interface contract, they're testing the implementation.

ornornor · 2025-12-07T16:31:11 1765125071

Juniors on one of the teams I work with only write this kind of tests. It’s tiring, and I have to tell them to test the behaviour, not the implementation. And yet every time they do the same thing. Or rather their AI IDE spits these out.

girvo · 2025-12-07T21:53:12 1765144392

You beat me to it, and yep these are exactly it.

“Mock the world then test your mocks”, I’m simply not convinced these have any value at all after my nearly two decades of doing this professionally

mulmboy · 2025-12-03T06:05:33 1764741933

> Everything it does can be done reasonable well with list comprehensions and objects that support type annotations and runtime type checking (if needed).

I see this take somewhat often, and usually with similar lack of nuance. How do you come to this? In other cases where I've seen this it's from people who haven't worked in any context where performance or scientific computing ecosystem interoperability matters - missing a massive part of the picture. I've struggled to get through to them before. Genuine question.

mulmboy · 2025-11-30T00:57:18 1764464238

It does largely avoid the issue if you configure to allow only specific environments AND you require reviews before pushing/merging to branches in that environment.

https://docs.pypi.org/trusted-publishers/adding-a-publisher/

For a malicious version to be published would then require full merge which is a fairly high bar.

AWS allows similar

LtWorf · 2025-11-30T01:17:15 1764465435

As we're seeing, properly configuring github actions is rather hard. By default force pushes are allowed on any branch.

mulmboy · 2025-11-30T01:38:29 1764466709

Yes and anyone who knows anything about software dev knows that the first thing you should do with an important repo is set up branch protections to disallow that, and require reviews etc. Basic CI/CD.

This incident reflects extremely poorly on PostHog because it demonstrates a lack of thought to security beyond surface level. It tells us that any dev at PostHog has access at any time to publish packages, without review (because we know that the secret to do this is accessible from plain GHA secret which can be read from any GHA run which presumably run on any internal dev's PR). The most charitable interpretation of this is that it's consciously justified by them because it reduces friction, in which case I would say that demonstrates poor judgement, a bad balance.

A casual audit would have revealed this and suggested something like restricting the secret to a specific GHA environment and requiring reviews to push to that env. Or something like that.

LtWorf · 2025-11-30T10:36:06 1764498966

Nobody understands github. I guess someone at microsoft did but they probably got fired at some point.

You can't really fault people for this.

It's literally the default settings.

mulmboy · 2025-11-15T02:10:43 1763172643

Along with a bunch of limitations that make it useless for anything but trivial use cases https://docs.claude.com/en/docs/build-with-claude/structured...

I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.

And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern

ACCount37 · 2025-11-15T10:46:22 1763203582

Yeah, the pattern of "kick the error message back to the LLM" is powerful. Even more so with all the newer AIs trained for programming tasks.

mulmboy · 2025-09-18T01:45:14 1758159914

Big missing piece - what was the impact of the degraded quality?

Was it 1% worse / unnoticeable? Did it become useless? The engineering is interesting but I'd like to see it tied to actual impact

cpursley · 2025-09-18T08:42:29 1758184949

Significant, check any Claude related thread here over the last month or the Claude Code subreddit. Anecdotally, the degradation has been so bad that I had to downgrade to a month old version - which has helped a lot. I think part of the problem is there as well (Claude Code).

mulmboy · 2025-06-29T07:34:07 1751182447

We operate a saas where a common step is inputting rates of widgets in $/widget, $/widget/day, $/1kwidgets, etc etc. These are incredibly tedious and error prone to enter. And usually the source of these rates is an invoice which presents them in ambiguous ways e.g. rows with "quantity" and "charge" from which you have to back calculate the rate. And these invoices are formatted in all different ways.

We offer a feature to upload the invoice and we pull out all the rates for you. Uses LLMs under the hood. Fundamentally it's a "chatgpt wrapper" but there's a massive amount of work in tweaking the prompts based on evals, splitting things up into multiple calls, etc.

And it works great! Niche software, but for power users were saving them tens of minutes of monotonous work per day and in all likelihood entering things more accurate. This complements the manual entry process with full ability to review the results. Accuracy is around 98-99 percent.

mulmboy · 2025-06-26T01:36:55 1750901815

I gave it a shot just now with a fairly simple refactor. +19 lines, -9 lines, across two files. Totally ballsed it up. Defined one of the two variables it was meant to, referred to the non-implemented one. I told it "hey you forgot the second variable" and then it went and added it in twice. Added comments (after prompting it to) which were half-baked, ambiguous when read in context.

Never had anything like this with claude code.

I've used Gemini 2.5 Pro quite a lot and like most people I find it's very intelligent. I've bent over backwards to use Gemini 2.5 Pro in another piece of work because it's so good. I can only assume it's the gemini CLI itself that's using the model poorly. Keen to try again in a month or two and see if this poor first impression is just a teething issue.

I told it that it did a pretty poor job and asked it why it thinks that is, told it that I know it's pretty smart. It gave me a wall of text and I asked for the short summary

> My tools operate on raw text, not the code's structure, making my edits brittle and prone to error if the text patterns aren't perfect. I lack a persistent, holistic view of the code like an IDE provides, so I can lose track of changes during multi-step tasks. This led me to make simple mistakes like forgetting a calculation and duplicating code.

luckydata · 2025-06-26T02:11:19 1750903879

I noticed a significant degradation of Gemini's coding abilities in the last couple checkpoints of 2.5. the benchmarks say it should be better but it doesn't jive with my personal experience.

tom_m · 2025-06-26T01:42:48 1750902168

Oh interesting. I have yet to try it. I love Gemini 2.5 Pro, so I expect the same here...but if not, wow. That would be a big whoops on their part.