> The argument I'd dismiss isn't the improvement, it's that there's a whole load of sudden economic factors, or use cases, that have been unlocked in the last 6 months because of the improvements in LLMs.
Yes I agree here in principle here in some cases: I think there are certainly problems that LLMs are now better at but that don't reach the critical reliability threshold to say "it can do this". E.g. hallucinations, handling long context well (still best practice to reset context window frequently), long-running tasks etc.
> That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.
This is where I disagree (but again you are absolutely right for certain classes of capabilities and problems).
- Claude code did not exist until 2025
- We have gone from e.g. people using coding agents for like ~10% of their workflow to like 90-100% pretty typically. Like code completion --> a reasonably good SWE (with caveats and pain points I know all too well). This is a big step change in what you can actually do, it's not like we're still doing only code completion and it's marginally better.
- Long horizon task success rate has now gotten good enough that basically also enable the above (good SWE) for like refactors, complicated debugging with competing hypotheses, etc, looping attempts until success
- We have nascent UI agents now, they are fragile but will see a similar path as coding which opens up yet another universe of things you can only do with a UI
- Enterprise voice agents (for like frontline support) now have a low enough bounce rate that you can actually deploy them
So we've gone from "this looks promising" to production deployment and very serious usage. This may kind of be like you say "same capabilities but just getting gradually better" but at some point that becomes a step change. Before a certain failure rate (which may be hard to pin down explicitly) it's not tolerable to deploy, but as evidenced by e.g. adoption alone we've crossed that threshold, especially for coding agents. Even sonnet 4 -> opus 4.5 has for me personally (beyond just benchmark numbers) made full project loops possible in a way that sonnet 4 would have convinced you it could and then wasted like 2 whole days of your time banging your head against the wall. Same is true for opus 4.5 but its for much larger tasks.
> To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.
Precisely. Lots and lots of hyperbole, some with varying degrees of underlying truth. But I would say: the true underlying reality here is somewhat easy to follow along with hard numbers if you look hard enough. Epoch.ai is one of my favorite sources for industry analysis, and e.g. Dwarkesh Patel is a true gift to the industry. Benchmarks are really quite terrible and shaky, so I don't necessarily fault people "checking the vibes", e.g. like Simon Willison's pelican task is exactly the sort of thing that's both fun and also important!
Yes I agree here in principle here in some cases: I think there are certainly problems that LLMs are now better at but that don't reach the critical reliability threshold to say "it can do this". E.g. hallucinations, handling long context well (still best practice to reset context window frequently), long-running tasks etc.
> That's kind of a fuzzier point, and a hard one to know until we all have hindsight. But I think OP is right that people have been claiming "LLMs are fundamentally in a different category to where they were 6 months ago" for the last 2 years - and as yet, none of those big improvements have yet unlocked a whole new category of use cases for LLMs.
This is where I disagree (but again you are absolutely right for certain classes of capabilities and problems).
- Claude code did not exist until 2025
- We have gone from e.g. people using coding agents for like ~10% of their workflow to like 90-100% pretty typically. Like code completion --> a reasonably good SWE (with caveats and pain points I know all too well). This is a big step change in what you can actually do, it's not like we're still doing only code completion and it's marginally better.
- Long horizon task success rate has now gotten good enough that basically also enable the above (good SWE) for like refactors, complicated debugging with competing hypotheses, etc, looping attempts until success
- We have nascent UI agents now, they are fragile but will see a similar path as coding which opens up yet another universe of things you can only do with a UI
- Enterprise voice agents (for like frontline support) now have a low enough bounce rate that you can actually deploy them
So we've gone from "this looks promising" to production deployment and very serious usage. This may kind of be like you say "same capabilities but just getting gradually better" but at some point that becomes a step change. Before a certain failure rate (which may be hard to pin down explicitly) it's not tolerable to deploy, but as evidenced by e.g. adoption alone we've crossed that threshold, especially for coding agents. Even sonnet 4 -> opus 4.5 has for me personally (beyond just benchmark numbers) made full project loops possible in a way that sonnet 4 would have convinced you it could and then wasted like 2 whole days of your time banging your head against the wall. Same is true for opus 4.5 but its for much larger tasks.
> To be honest, it's a very tricky thing to weight into, because the claims being made around LLMs are very varied from "we're 2 months away from all disease being solved" to "LLMs are basically just a bit better than old school Markov chains". I'd argue that clearly neither of those are true, but it's hard to orient stuff when both those sides are being claimed at the same time.
Precisely. Lots and lots of hyperbole, some with varying degrees of underlying truth. But I would say: the true underlying reality here is somewhat easy to follow along with hard numbers if you look hard enough. Epoch.ai is one of my favorite sources for industry analysis, and e.g. Dwarkesh Patel is a true gift to the industry. Benchmarks are really quite terrible and shaky, so I don't necessarily fault people "checking the vibes", e.g. like Simon Willison's pelican task is exactly the sort of thing that's both fun and also important!