> It programs with the skill of the median programmer on GitHub
This is a common intuition but it's provably false.
The fact that LLMs are trained on a corpus does not mean their output represents the median skill level of the corpus.
Eighteen months ago GPT-4 was outperforming 85% of human participants in coding contests. And people who participate in coding contests are already well above the median skill level on Github.
And capability has gone way up in the last 18 months.
The best argument I've yet heard against the effectiveness of AI tools for SW dev is the absence of an explosion of shovelware over the past 1-2 years.
Basically, if the tools are even half as good as some proponents claim, wouldn't you expect at least a significant increase in simple games on Steam or apps in app stores over that time frame? But we're not seeing that.
Interesting approach. I can think of one more explanation the author didn't consider: what if software development time wasn't the bottleneck to what he analyzed? The chart for Google Play app submissions, for example, goes down because Google made it much more difficult to publish apps on their store in ways unrelated to software quality. In that case, it wouldn't matter whether AI tools could write a billion production-ready apps, because the limiting factor is Google's submission requirements.
There are other charts besides Google play. Particularly insightful id the steam chart as steam is already full of shovelware and, in my experience, many developers wish they were making games but the pay is bad.
GitHub repos is pretty interesting too but it could be that people just aren't committing this stuff. Showing zero increase is unexpected though.
I've had this same thought for some time. There should have been an explosion in startups, new product from established companies, new apps by the dozen every day. If LLMs can now reliably turn an idea into an application, where are they?
The argument against this is that shovelware has a distinctly different distribution model now.
App stores have quality hurdles that didn’t exist in the diskette days. The types of people making low quality software now can self publish (and in fact do, often), but they get drowned out by established big dogs or the ever-shifting firehose of our social zeitgeist if you are not where they are.
Anyone who has been on Reddit this year in any software adjacent sub has seen hundreds (at minimum) of posts about “feedback on my app” or slop posts doing a god awful job of digging for market insights on pain points.
The core problem with this guy’s argument is that he’s looking in the wrong places - where a SWE would distribute their stuff, not a normie - and then drawing the wrong conclusions. And I am telling you, normies are out there, right now, upchucking some of the sloppiest of slop software you could ever imagine with wanton abandon.
Or even solving problems that business need to solve, generally speaking.
This complete misunderstand of what software engineering even is is the major reason so many engineers are fed up with the clueless leaders foisting AI tools upon their orgs because they apparently lack the critical reasoning skills to be able to distinguish marketing speak from reality.
I don't think this disproves my claim, for several reasons.
First, I don't know where those human participants came from, but if you pick people off the street or from a college campus, they aren't going to be the world's best programmers. On the other hand, github users are on average more skilled than the average CS student. Even students and beginners who use github usually don't have much code there. If the LLMs are weighted to treat every line of code about same, they'd pick up more lines of code from prolific developers (who are often more experienced) than they would from beginners.
Also in a coding contest, you're under time pressure. Even when your code works, its often ugly and thrown together. On github, the only code I check in is code that solves whatever problem I set out to solve. I suspect everyone writes better code on github than we do in programming competitions. I suspect if you gave the competitors functionally unlimited time to do the programming competition, many more would outperform GPT-4.
Programming contests also usually require that you write a fully self contained program which has been very well specified. The program usually doesn't need any error handling, or need to be maintained. (And if it does need error handling, the cases are all fully specified in the problem description). Relatively speaking, LLMs are pretty good at these kind of problems - where I want some throwaway code that'll work today and get deleted tomorrow.
But most software I write isn't like that. And LLMs struggle to write maintainable software in large projects. Most problems aren't so well specified. And for most code, you end up spending more effort maintaining the code over its lifetime than it takes to write in the first place. Chatgpt usually writes code that is a headache to maintain. It doesn't write or use local utility functions. It doesn't factor its code well. The code is often overly verbose. It often writes code that's very poorly optimized. Or the code contains quite obvious bugs for unexpected input - like overflow errors or boundary conditions. And the code it produces very rarely handles errors correctly. None of these problems really matter in programming competitions. But it does matter a lot more when writing real software. These problems make LLMs much less useful at work.
> The fact that LLMs are trained on a corpus does not mean their output represents the median skill level of the corpus.
It does, by default. Try asking ChatGPT to implement quicksort in JavaScript, the result will be dogshit. Of course it can do better if you guide it, but that implies you recognize dogshit, or at least that you use some sort of prompting technique that will veer it off the beaten path.
I asked the free version of ChatGPT to implement quicksort in JS. I can't really see much wrong with it, but maybe I'm missing something? (Ugh, I just can't get HN to format code right... pastebin here: https://pastebin.com/tjaibW1x)
----
function quickSortInPlace(arr, left = 0, right = arr.length - 1) {
if (left < right) {
const pivotIndex = partition(arr, left, right);
quickSortInPlace(arr, left, pivotIndex - 1);
quickSortInPlace(arr, pivotIndex + 1, right);
}
return arr;
}
function partition(arr, left, right) {
const pivot = arr[right];
let i = left;
for (let j = left; j < right; j++) {
if (arr[j] < pivot) {
[arr[i], arr[j]] = [arr[j], arr[i]];
i++;
}
}
[arr[i], arr[right]] = [arr[right], arr[i]]; // Move pivot into place
return i;
}
This is exactly the level of code I've come to expect from chatgpt. Its about the level of code I'd want from a smart CS student. But I'd hope to never use this in production:
- It always uses the last item as a pivot, which will give it pathological O(n^2) performance if the list is sorted. Passing an already sorted list to a sort function is a very common case. Good quicksort implementations will use a random pivot, or at least the middle pivot so re-sorting lists is fast.
- If you pass already sorted data, the recursive call to quickSortInPlace will take up stack space proportional to the size of the array. So if you pass a large sorted array, not only will the function take n^2 time, it might also generate a stack overflow and crash.
- This code: ... = [arr[j], arr[i]]; Creates an array and immediately destructures it. This is - or at least used to be - quite slow. I'd avoid doing that in the body of quicksort's inner loop.
- There's no way to pass a custom comparator, which is essential in real code.
I just tried in firefox:
// Sort an array of 1 million sorted elements
arr = Array(1e6).fill(0).map((_, i) => i)
console.time('x')
quickSortInPlace(arr)
console.timeEnd('x')
My computer ran for about a minute then the javascript virtual machine crashed:
Uncaught InternalError: too much recursion
This is about the quality of quicksort implementation I'd expect to see in a CS class, or in a random package in npm. If someone on my team committed this, I'd tell them to go rewrite it properly. (Or just use a standard library function - which wouldn't have these flaws.)
OK, you just added requirements the previous poster had not mentioned. Firstly, how often do you really need to sort a million elements in a browser anyway? I expect that sort of heavy lifting would usually be done on the server, where you'd also want to do things like paging.
Secondly, if a standard implementation was to be used, that's essentially a No-Op. AI will reuse library functions where possible by default and agents will even "npm install" them for you. This is purely the result of my prompt, which was simply "Can you write a QuickSort implementation in JS?"
In any case, to incorporate your feedback, I simply added "that needs to sort an array of a million elements and accepts a custom comparator?" to the initial prompt and reran in a new session, and this is what I got in less than 5 seconds. It runs in about 160ms on Chrome:
How long would your team-mate have taken? What else would you change? If you have further requirements, seriously, you can just add those to the prompt and try it for yourself for free. I'd honestly be very curious to see where it fails.
However, this exchange is very illustrative: I feel like a lot of the negativity is because people expect AI to read their minds and then hold it against it when it doesn't.
> OK, you just added requirements the previous poster had not mentioned.
Lol of course! The real requirements for a piece of software are never specified in full ahead of time. Figuring out the spec is half the job.
> Firstly, how often do you really need to sort a million elements in a browser anyway? I expect that sort of heavy lifting would usually be done on the server
Who said anything about the browser? I run javascript on the server all the time.
Don't defend these bugs. 1 million items just isn't very many items for a sort function. On my computer, the built in javascript sort function can sort 1 million sorted items in 9ms. I'd expect any competent quicksort implementation to be able to do something similar. Hanging for 1 minute then crashing is a bug.
If you want a use case, consider the very common case of sorting user-supplied data. If I can send a JSON payload to your server and make it hang for 1 minute then crash, you've got a problem.
> If you have further requirements, seriously, you can just add those to the prompt and try it for yourself for free. [..] How long would your team-mate have taken?
We've gotta compare like for like here. How long does it take to debug code like this when an AI generates it? It took me about 25 minutes to discover & verify those problems. That was careful work. Then you reprompted it, and then you tested the new code to see if it fixed the problems. How long did that take, all added together? We also haven't tested the new code for correctness or to see if it has new bugs. Given its a complete rewrite, there's a good chance chatgpt introduced new issues. I've also had plenty of instances where I've spotted a problem and chatgpt apologises then completely fails to fix the problem I've spotted. Especially lifetime issues in rust - its really bad at those!
The question is this: Is this back and forth process faster or slower than programming quicksort by hand? I'm really not sure. Once we've reviewed and tested this code, and fixed any other problems in it, we're probably looking at about an hour of work all up. I could probably implement quicksort at a similar quality in a similar amount of time. I find writing code is usually less stressful than reviewing code, because mistakes while programming are usually obvious. But mistakes while reviewing are invisible. Neither you nor anyone else in this thread spotted the pathological behavior this implementation had with sorted data. Finding problems like that by just looking is hard.
Quicksort is also the best case for LLMs. Its a well understood, well specified problem with a simple, well known solution. There isn't any existing code it needs to integrate with. But those aren't the sort of problems I want chatgpt's help solving. If I could just use a library, I'm already doing that. I want chatgpt to solve problems its probably never seen before, with all the context of the problem I'm trying to solve, to fit in with all the code we've already written. It often takes 5-10 minutes of typing and copy+pasting just to write a suitable prompt. And in those cases, the code chatgpt produces is often much, much worse.
> I feel like a lot of the negativity is because people expect AI to read their minds and then hold it against it when it doesn't.
Yes exactly! As a senior developer, my job is to solve the problem people actually have, not the problem they tell me about. So yes of course I want it to read my mind! Actually turning a clear spec into working software is the easy part. ChatGPT is increasingly good at doing the work of a junior developer. But as a senior dev / tech lead, I also need to figure out what problems we're even solving, and what the best approach is. ChatGPT doesn't help much when it comes to this kind of work.
(By the way, that is basically a perfect definition of the difference between a junior and senior developer. Junior devs are only responsible for taking a spec and turning it into working software. Senior devs are responsible for reading everyone's mind, and turning that into useful software.)
And don't get me wrong. I'm not anti chatgpt. I use it all the time, for all sorts of things. I'd love to use it more for production grade code in large codebases if I could. But bugs like this matter. I don't want to spend my time babysitting chatgpt. Programming is easy. By the time I have a clear specification in my head, its often easier to just type out the code myself.
That's where we come in of course! Look into spec-driven development. You basically encourage the LLM to ask questions and hash out all these details.
> Who said anything about the browser?... Don't defend these bugs.
A problem of insufficient specification... didn't expect an HN comment to turn into an engineering exercise! :-) But these are the kinds of things you'd put in the spec.
> How long does it take to debug code like this when an AI generates it? It took me about 25 minutes to discover & verify those problems.
Here's where it gets interesting: before reviewing any code, I basically ask it to generate tests, which always all pass. Then I review the main code and test code, at which point I usually add even more test-cases (e.g. https://news.ycombinator.com/item?id=46143454). And, because codegen is so cheap, I can even include performance tests, (which statistically speaking, nobody ever does)!
Here's a one-shot result of that approach (I really don't mean to take up more of your time, this is just so you can see what it is capable of):
https://pastebin.com/VFbW7AKi
While I do review the code (a habit -- I always review my own code before a PR), I review the tests more closely because, while boring, I find them a) much easier to review, and b) more confidence-inspiring than manual review of intricate logic.
> I want chatgpt to solve problems its probably never seen before, with all the context of the problem I'm trying to solve, to fit in with all the code we've already written.
Totally, and again, this is where we come in! Still, it is a huge productivity booster even in uncommon contexts. E.g. I'm using it to do computer vision stuff (where I have no prior background!) with opencv.js for a problem not well-represented in the literature. It still does amazingly well... with the right context. It's initial suggestions were overindexed on the common case, but over many conversations, it "understood" my use-case and consistently gives appropriate suggestions. And because it's vision stuff, I can instantly verify results by sight.
Big caveat: success depends heavily on the use-case. I have had more mixed results in other cases, such as concurrency issues in an LD_PRELOAD library in C. One reason for the mixed sentiments we see.
> ChatGPT is increasingly good at doing the work of a junior developer.
Yes, and in fact, I've been rather pessimistic about the prospects of junior developers, a personal concern given I have a kid who wants to get into software engineering...
I'll end with a note that my workflow today is extremely different from before AI, and it took me many months of experimentation to figure out what worked for me. Most engineers simply haven't had the time to do so, which is another reason we see so many mixed sentiments. But I would strongly encourage everybody to invest the time and effort because this discipline is going to change drastically really quickly.
How "real" are iPhone photos? They're also computationally generated, not just the light that came through the lens.
Even without any other post-processing, iPhones generate gibberish text when attempting to sharpen blurry images, they delete actual textures and replace them with smooth, smeared surfaces that look like a watercolor or oil paintings, and combine data from multiple frames to give dogs five legs.
The fact that Gemini 3 is so far ahead of every other frontier model in math might be telling us something more general about the model itself.
It scored 23.4% on MathArena Apex, compared with 0.5% for Gemini 2.5 Pro, 1.6% for Claude Sonnet 4.5 and 1.0% for GPT 5.1.
This is not an incremental advance. It is a step change. This indicates a new discovery, not just more data or more compute.
To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You need to verify what you're doing, detect when you make a mistake, and backtrack to try a different approach.
The SimpleQA benchmark is another datapoint that we're probably looking at a research breakthrough, not just more data or more compute. Gemini 3 Pro achieved more than double the reliability of GPT-5.1 (72.1% vs. 34.9%).
This isn't an incremental gain, it's a step-change leap in reducing hallucinations.
And it's exactly what you'd expect to see if there's an underlying shift from probabilistic token prediction to verified search, with better error detection and backtracking when it finds an error.
That could explain the breakout performance on math, and reliability, and even operating graphical user interfaces (ScreenSpot-Pro at 72.7% vs. 3.5% for GPT-5.1).
I usually ask a simple question that ALL the models get wrong: List of mayor of my city [Londrina]. ALL the models (offine) get wrong. And I mean, all the models. The best that I could, it's o3 I believe, saying it couldn't give a good answer for that, and told to access the city website.
Gemini 3 somehow is able to give a list of mayors, including details on who got impeached, etc.
This should be a simple answer, because all the data is on wikipedia, that certainly the models are trained on, but somehow most models don't manage to give that answer right, because... it's just a irrelevant city in a huge dataset.
But somehow, Gemini 3 did it.
Edit: Just asked "Cool places to visit in Londrina" (In portuguese), and it was also 99% right, unlike other models, who just create stuff. The only thing wrong here, it mentioned sakuras in a lake... Maybe it confused with Brazilian ipês, which are similar, and indeed the city it's full of them.
Ha, I just did the same with my hometown (Guaiba, RS), a city that is 1/6th of Londrina, and its wikipedia page in English hasn't been updated in years, and still has the wrong mayor (!).
Gemini 3 nailed on the first try, included political affiliation, and added some context on who they competed with and won over in each of the last 3 elections. And I just did a fun application with AI Studio, and it worked on first shot. Pretty impressive.
(disclaimer: Googler, but no affiliation with Gemini team)
Pure fact-based, niche questions like that aren't really the focus of most providers any more from what I've heard, since they can be solved more reliably by integrating search tools (and all providers now have search).
I wouldn't be surprised if the smallest models can answer fewer such (fact-only) questions over time offline as they distill/focus them more thoroughly on logic etc.
Funny, I just asked "Ask Brave", which uses a cheap LLM connected directly to its search engine, and it got it right without any issues.
It shows once again that for common searches, (indexed) data is the king, and that's where I expect that even a simple LLM directly connected to a huge indexed dataset would win against much more sophisticated LLMs that have to use agents for searching.
The one thing I got out of the MIT OpenCourseWare AI course by Patrick Winston was that all of AI could be framed as a problem of search. Interesting to see Demis echo that here.
It tells me that the benchmark is probably leaking into training data, and going to the benchmark site :
> Model was published after the competition date, making contamination possible.
Aside from eval on most of these benchmarks being stupid most of the time, these guys have every incentive to cheat - these aren't some academic AI labs, they have to justify hundreds of billions being spent/allocated in the market.
Actually trying the model on a few of my daily tasks and reading the reasoning traces all I'm seeing is same old, same old - Claude is still better at "getting" the problem.
> To succeed this well in math, you can't just do better probabilistic generation, you need verifiable search.
You say "probabilistic generation" like it's some kind of a limitation. What is exactly the limiting factor here? [(0.9999, "4"), (0.00001, "four"), ...] is a valid probability distribution. The sampler can be set to always choose "4" in such cases.
I'll give you the style is like an LLM but the thoughts seem a bit unlike one. I mean the MathArena Apex results indicating a new discovery rather than more data is definitely a hypothesis.
From my understanding, Google put online the largest RL cluster in the world not so long ago. It's not surprising they do really well on things that are "easy" to RL, like math or SimpleQA
I’ll take you at your word, sorry for the incorrect callout. Your comment format appeared malicious, so my response wasn’t an attempt at being “snarky”, just acting defensively. I like the HN Rules/Guidelines.
You mentioned "step change" twice. Maybe a once over next time? My favorite Mark Twain quote is (very paraphrased) "My apologies, had I more time, I would have written a shorter letter".
This is something that is happening to me too, and frankly I'm a little concerned. English is not my first language, so I use AI for checking and writing many things. And I spend a lot of time with coding tools. And now I need sometimes to do a conscient effort to avoid mimicking some LLM patterns...
You seem very comfortable making unfounded claims. I don't think this is very constructive or adds much to the discussion. While we can debate the stylistic changes of the previous commenter, you seem to be discounting the rate at which the writing style of various LLMs has backpropagated into many peoples' brains.
I can sympathize with being mistakingly accused of using LLM output, but as a reader the above format of "Its not x - it's y" repeated multiple times for artificial dramatic emphasis to make a pretty mundane point that could use 1/3 the length grates on me like reading LinkedIn or marketing voice whether it's AI or not (and it's almost always AI anyway).
I've seen fairly niche subreddits go from enjoyable and interesting to ruined by being clogged with LLM spam that sounds exactly like this so my tolerance for reading it is incredibly low, especially on HN, and I'll just dismiss it.
I probably lose the occasionally legitimate original observation now and then but in a world where our attention is being hijacked by zero effort spam everywhere you look I just don't have the time or energy to avoid that heuristic.
Also discounting the fact that people actually do talk like that. In fact, these days I have to modify my prose to be intentionally less LLM-like lest the reader thinks it's LLM output.
1) Models learn these patterns from common human usage. They are in the wild, and as such there will be people who use them naturally.
2) Now, given its for-some-reason-ubiquitous choice by models, it is also a phrasing that many more people are exposed to, every day.
Language is contagious. This phrasing is approaching herd levels, meaning models trained from up-to-the-moment web content will start to see it as less distinctly salient. Eventually, there will be some other high-signal novel phrase with high salience, and the attention heads will latch on to it from the surrounding context, and then that will be the new AI shibboleth.
It's just how language works. We see it in the mixes between generations when our kids pick up new lingo, and then it stops being in-group for them when it spreads too far.. Skibidi, 6 7, etc.
It's just how language works, and a generation ago the internet put it on steroids. Now? Even faster.
The jump in ARC-AGI and MathArena suggests Google has solved the data scarcity problem for reasoning, maybe with synthetic data self-play??
This was the primary bottleneck preventing models from tackling novel scientific problems they haven't seen before.
If Gemini 3 Pro has transcended "reading the internet" (knowledge saturation), and made huge progress in "thinking about the internet" (reasoning scaling), then this is a really big deal.
The thing about bubbles is that it's devilishly difficult to tell the difference between a total sham and a genuine regime shift while it's happening because the hype level is similar for both.
Sure, it certainly could be difficult. But this is often because of information assymetry — the purveyors are making big claims about the dragon in their garage. [1]
If your intuition told you that there ought to be some evidence for grand claims, you would be right to be suspicious of the continuing absence of suitable evidence.
93% of American drivers think they're better drivers than the median driver [0].
This overconfidence causes humans to take unnecessary risks that not only endanger themselves, but everyone else on the road.
After taking several dozen Waymo rides and watching them negotiate complex and ambiguous driving scenarios, including many situations where overconfident drivers are driving dangerously, I realize that Waymo is a far better driver than I am.
Waymos don't just prevent a large percentage of accidents by making fewer mistakes than a human driver, but Waymos also avoid a lot of accidents caused by other distracted and careless human drivers.
Now when I have to drive a car myself, my goal is to try to drive as much like a Waymo as I can.
Speeding feels like "I'm more important than everyone else and the safety of others and rules don't apply to me" personally. It's one thing to match the speed of traffic and avoid being a nuisance (that I'm fine with) - a lot of people just think they're the main character and everyone else is just in their way.
It's a problem that goes way beyond driving, sadly.
Eh this doesn't mean much. The quality of drivers is pretty bimodal.
You have the group that's really bad and does things like drive drunk, weave in and out of traffic, do their makeup and so on.
The other group generally pays attention and tries to drive safely. This is larger than the first group and realistically there's not all that much difference within the group.
If you're in group two you will think you're above average because the comparison is to the crap drivers in group one.
If a science experiment works and is transformational can be worth a trillion dollars, how much is it worth if it has a 5% chance of being transformational?
This is a common intuition but it's provably false.
The fact that LLMs are trained on a corpus does not mean their output represents the median skill level of the corpus.
Eighteen months ago GPT-4 was outperforming 85% of human participants in coding contests. And people who participate in coding contests are already well above the median skill level on Github.
And capability has gone way up in the last 18 months.
reply