The key insight is that LLMs can 'reason' when they've seen similar solutions in training data, but this breaks down on truly novel problems. This isn't reasoning exactly, but close enough to be useful in many circumstances. Repeating solutions on demand can be handy, just like repeating facts on demand is handy. Marcus gets this right technically but focuses too much on emotional arguments rather than clear explanation.
If that was the case, it would have been great already but these tools can’t even do that. They frequently make mistake repeating the same solutions available everywhere during their “reasoning” process and fabricates plausible hallucinations which you then have to inspect carefully to catch.
That alone would be revolutionary - but still aspirational for now. The other day Gemini mixed up left and right on me in response to basic textbook problem.
I’m so tired of hearing this be repeated, like the whole “LLMs are _just_ parrots” thing.
It’s patently obvious to me that LLMs can reason and solve novel problems not in their training data. You can test this out in so many ways, and there’s so many examples out there.
______________
Edit for responders, instead of replying to each:
We obviously have to define what we mean by "reasoning" and "solving novel problems". From my point of view, reasoning != general intelligence. I also consider reasoning to be a spectrum. Just because it cannot solve the hardest problem you can think of does not mean it cannot reason at all. Do note, I think LLMs are generally pretty bad at reasoning. But I disagree with the point that LLMs cannot reason at all or never solve any novel problems.
In terms of some backing points/examples:
1) Next token prediction can itself be argued to be a task that requires reasoning
2) You can construct a variety of language translation tasks, with completely made up languages, that LLMs can complete successfully. There's tons of research about in-context learning and zero-shot performance.
1) Even though they start to fail at some complexity threshold, it's incredibly impressive that LLMs can solve any of these difficult puzzles at all! GPT3.5 couldn't do that. We're making incremental progress in terms of reasoning. Bigger, smarter models get better at zero-shot tasks, and I think that correlates with reasoning.
2) Regarding point 4 ("Bigger models might to do better"): I think this is very dismissive. The paper itself shows a huge variance in the performance of different models. For example, in figure 8, we see Claude 3.7 significantly outperforming DeepSeek and maintaining stable solutions for a much longer sequence length. Figure 5 also shows that better models and more tokens improve performance at "medium" difficulty problems. Just because it cannot solve the "hard" problems does not mean it cannot reason at all, nor does it necessarily mean it will never get there. Many people were saying we'd never be able to solve problems like the medium ones a few years ago, but now the goal posts have just shifted.
> It’s patently obvious that LLMs can reason and solve novel problems not in their training data.
Would you care to tell us more ?
« It’s patently obvious » is not really an argument, I could say just as well that everyone know LLM can’t resonate or think (in the way we living beings do).
I'm working on new API. I asked the LLM to read the spec and write tests for it. It does. I don't know if that's "reasoning". I know that no tests exist for this API. I know that the internet is not full of training data for this API because it's a new API. It's also not a CRUD API or some other API that's got a common pattern. And yet, with a very short prompt, Gemini Code Assist wrote valid tests for a new feature.
It certainly feels like more than fancy auto-complete. That is not to say I haven't run into issue but I'm still often shocked at how far it gets. And that's today. I have no idea what to expect in 6 months, 12, 2 years, 4, etc.
> I know that the internet is not full of training data for this API because it's a new API.
1) are you sure? That's a bold guess. It was also a really stupid assumption made by the HumanEval benchmark authors. That if you "hand write" simple leet code style questions then you can train on all of GitHub. Go ahead, go look at what kinds of questions are in that benchmark...
2) LLMs aren't discrete databases. They are curve fitting functions. Compression. They work in very very high dimensions. They can generate new data but that is limited. People mostly aren't saying that LLMs can't create novel things but that they can't reason in the way that humans can. Humans can't memorize half of what a LLM can yet are able to figure out lots of crazy shit.
I just made up this scenario and these words, so I'm sure it wasn't in the training data.
Kwomps can zark but they can't plimf. Ghirns are a lot like Kwomps, but better zarkers. Plyzers have the skills the Ghirns lack.
Quoning, a type of plimfing, was developed in 3985. Zhuning was developed 100 years earlier.
I have an erork that needs to be plimfed. Choose one group and one method to do it.
> Use Plyzers and do a Quoning procedure on your erork.
If that doesn't count as reasoning or generalization, I don't know what does.
It’s just a truth table. I had a hunch that it was a truth table and then I asked AI how it figured it out and it confirmed it built a truth table. Still impressive either way
* Goal: Pick (Group ∧ Method) such that Group can plimf ∧ Method is a type of plimfing
* Only one group (Plyzers) passes the "can plimf" test
* Only one method (Quoning) is definitely plimfing
Therefore, the only valid (Group ∧ Method) combo is: → (Plyzer ∧ Quoning)
If anything you'd think that the neurosymbolic people would be pleased that the LLMs do in fact reason by learning circuits representing boolean logic and truth tables. In a way they were right, it's just that starting with logic and then feeding in knowledge grounded in that logic (like Cyc) seems less scalable than feeding in knowledge and letting the model infer the underlying logic.
Right, that’s my point. LLMs are doing pattern abstraction and in this way can mimic logic. They are not trained explicitly to do just truth tables even thought truth tables are fundamental.
So far they cannot even answer questions which are straight up fact checking and search engine like queries. Reasoning means they would be able to work through a problem and generate a proof they way a student might.
You're mistaking pattern matching and the modeling of relationships in latent space for genuine reasoning.
I don't know what you're working on, but while I'm not curing cancer, I am solving problems that aren't in the training data and can't be found on Google. Just a few days ago, Gemini 2.5 Pro literally told me it didn’t know what to do and asked me for help. The other models hallucinated incorrect answers. I solved the problem in 15 minutes.
If you're working on yet another CRUD app, and you've never implemented transformers yourself or understood how they work internally, then I understand why LLMs might seem like magic to you.
It's definitely not true in any meaningful sense. There are plenty of us practitioners in software engineering wishing it was true, because if it was, we'd all have genius interns working for us on Mac Studios at home.
It's not true. It's plainly not true. Go have any of these models, paid, or local try to build you novel solutions to hard, existing problems despite being, in some cases, trained on literally the entire compendium of open knowledge in not just one, but multiple adjacent fields. Not to mention the fact that being able to abstract general knowledge would mean it would be able to reason.
They. Cannot. Do it.
I have no idea what you people are talking about because you cannot be working on anything with real substance that hasn't been perfectly line fit to your abundantly worked on problems, but no, these models are obviously not reasoning.
I built a digital employee and gave it menial tasks that compare to current cloud solutions who also claim to be able to provide you paid cloud AI employees and these things are stupider than fresh college grads.
>> 1) Next token prediction can itself be argued to be a task that requires reasoning
That is wishful thinking popularised by Ilya Sutskever and Greg Brockman of OpenAI to "explain" why LLMs are a different class of system than smaller language models or other predictive models.
I'm sorry to say that (John Mearsheimer voice) that's simply not a serious argument. Take a multivariate regression model that predicts blood pressure from demographic data (age, sex, weight, etc). You can train a pretty accurate model for that kind of task if you have enough data (a few thousand data points). Does that model need to "reason" about human behaviour in order to be good at predicting BP? Nope. All it needs is a lot of data. That's how statistics works. So why is it different for a predictive model of BP and different for a next-token prediction model? The only answer seems to be "because language is magickal and special". But without any attempt to explain why, in terms of sequence prediction, language is special. Unless the er reasoning is that humans can produce language, humans can reason, LLMs can produce language, therefore LLMs can reason; which obviously doesn't follow.
But I have to guess here because neither Sutskever nor Brockman have ever tried to explain why next token prediction needs reasoning (or, more precisely, "understanding", the term they have used).
> That is wishful thinking popularised by Ilya Sutskever
Ilya and Hinton have claimed even crazier things
| to understand next token prediction you must understand the casual reality
This is objectively false. It's a result known in physics to be wrong for centuries. You can probably reason a weaker case yourself, that I'm sure you can make accurate predictions about some things without fully understanding them.
But the stronger version is the entire difficulty of physics and causal modeling. Distinguishing a confounding variable is very very hard. But you can still make accurate predictions without access to the underlying causal graph
Hinton and Sutskever are victims of their own success: they can say whatever they like and nobody dares criticise them, or tell them how they're wrong.
I recently watched a video of Sutskever speaking to some students, not sure where and I can't dig out the link now. To summarise he told them that the human brain is a biological computer. He repeated this a couple of times then said that this is why we can create a digital computer that can do everything a brain can.
This is the computational theory of mind, reduced to a pin-point with all context removed. Two seconds of thought suffice to show how that doesn't work: if a digital computer can do everything the brain can do, because the brain is a biological computer, then how come the brain can't do everything a digital computer can do? Is it possible that two machines can be both computers, and still not equivalent in every sense of the term? Nooooo!!! Biological computers!! AGI!!
Those guys really need to stop and think about what they're talking about before someone notices what they're saying and the entire field becomes a laughing stock.
> Two seconds of thought suffice to show how that doesn't work: if a digital computer can do everything the brain can do, because the brain is a biological computer, then how come the brain can't do everything a digital computer can do? Is it possible that two machines can be both computers, and still not equivalent in every sense of the term? Nooooo!!! Biological computers!! AGI!!
Another two seconds of thought would suffice to answer that: because you can freely change neither hardware or software of the brain, like you can with computers.
Obviously, Angry Birds on the phone can't do everything digital computers can do, but that doesn't mean a smartphone isn't a digital computer.
Another 2 seconds of thought might have told you only a magic genie can "freely" change hardware and software capability.
Humans have to work within whatever constraints accompany being physical things with physical bodies trying to invent software and hardware in the physical world.
I'm fine with calling the brain a computer. A computer is a very vague term. But yes, I agree that the conclusion does not necessarily follow. It's possible, not not necessarily
> Take a multivariate regression model that predicts blood pressure from demographic data (age, sex, weight, etc). You can train a pretty accurate model for that kind of task if you have enough data (a few thousand data points). Does that model need to "reason" about human behaviour in order to be good at predicting BP? Nope. All it needs is a lot of data. That's how statistics works. So why is it different for a predictive model of BP and different for a next-token prediction model?
For one, because the goal function for the latter is "predict output that makes sense to humans", in the fully broad, fully general sense of that statement.
It's not just one thing, like parse grocery lists, XOR write simple code, XOR write a story, XOR infer sentiment. XOR be a lossy cache for Wikipedia. It's all of them, separate or together, plus much more, plus correctly handling humor, sarcasm, surface-level errors (e.g. typos, naming), implied rules, shorthands, deep errors (think user being confused and using terminology wrong; LLMs can handle that fine), and an uncountable number of other things (because language is special, see below). It's quite obvious this is a different class of things than a narrowly specialized model like BP predictor.
And yes, language is special. Despite Chomsky's protestations to the contrary, it's not really formally structured; all the grammar and syntax and vocabulary is merely classification of high-level patterns that tend to occur (though invention of print and public education definitely strengthened them). Any experience with learning a language, or actual talking to other people, makes it obvious that grammar or vocabulary are neither necessary nor sufficient to communication. At the same time, though, once established, the particular choices become another dimension that packs meaning (as it becomes apparent when e.g. pondering why some books or articles seem better than other).
Ultimately, language not a set of easy patterns you can learn (or code symbolically!) - it's a dance people do when communicating, whose structure is fluid and bound by reasoning capabilities of humans. Being able to reason this way is required to communicate with real humans in real, generic scenarios. Now, this isn't a proof LLMs can do it, but the degree to which they excel at this is at least a strong suggestion they qualitatively could be.
I've done this excercise dozens of times because people keep saying it, but I can't find an example where this is true. I wish it was. I'd be solving world problems with novel solutions right now.
People make a common mistake by conflating "solving problems with novel surface features" with "reasoning outside training data." This is exactly the kind of binary thinking I mentioned earlier.
"Solving novel problems" does not mean "solving world problems that even humans are unable to solve", it simply means solving problems that are "novel" compared to what's in the training data.
Can you reason? Yes? Then why haven't you cured cancer? Let's not have double standards.
I think that "solving world problems with novel solutions" is a strawman for an ability to reason well. We cannot solve world problems with reasoning, because pure reasoning has no relation to reality. We lack data and models about the world to confirm and deny our hypotheses about the world. That is why the empirical sciences do experiments instead of sit in an armchair and mull all day.
They can't create anything novel and it's patently obvious if you understand how they're implemented.
But I'm just some anonymous guy on HN, so maybe this time I will just cite the opinion of the DeepMind CEO, who said in a recent interview with The Verge (available on YouTube) that LLMs based on transformers can't create anything truly novel.
Since when is reasoning synonymous with invention? All humans with a functioning brain can reason, but only a tiny fraction have or will ever invent anything.
Read what OP said "It’s patently obvious to me that LLMs can ... solve novel problems", this is what I was replying to. I see everyone is smarter here than researchers at DeepMind, without any proofs or credentials to back their claims.
"I don't think today's systems can invent, you know, do true invention, true creativity, hypothesize new scientific theories. They're extremely useful, they're impressive, but they have holes."
Demis Hassabis On The Future of Work in the Age of AI (@ 2:30 mark)
He doesn't say "that LLMs based on transformers can't create anything truly novel". Maybe he thinks that, maybe not, but what he says is that "today's systems" can't do that. He doesn't make any general statement about what transformer-based LLMs can or can't do; he's saying: we've interacted with these specific systems we have right now and they aren't creating genuinely novel things. That's a very different claim, with very different implications.
Again, for all I know maybe he does believe that transformer-based LLMs as such can't be truly creative. Maybe it's true, whether he believes it or not. But that interview doesn't say it.
That’s the opposite of reasoning tho. Ai bros want to make people believe LLM are smart but they’re not capable of intelligence and reasoning.
Reasoning mean you can take on a problem you’ve never seen before and think of innovative ways to solve it.
LLM can only replicate what is in its data, it can in no way think or guess or estimate what will likely be the best solution, it can only output a solution based on a probability calculation made on how frequent it has seen this solution linked to this problem.
You're assuming we're saying LLMs can't reason. That's not what we're saying. They can execute reasoning-like processes when they've seen similar patterns, but this breaks down when true novel reasoning is required. Most people do the same thing. Some poeple can come up with novel solutions to new problems, but LLMs will choke. Here's an example:
Prompt: "Let's try a reasoning test. Estimate how many pianos there are at the bottom of the sea."
I tried this on three advanced AIs* and they all choked on it without further hints from me. Claude then said:
Roughly 3 million shipwrecks on ocean floors globally
Maybe 1 in 1000 ships historically carried a piano (passenger ships, luxury vessels)
So ~3,000 ships with pianos sunk
Average maybe 0.5 pianos per ship (not all passenger areas had them)
Estimate: ~1,500 pianos
GPT4o isn't considered an "advanced" LLM at this point. It doesn't use reasoning.
I gave your prompt to o3 pro, and this is what I got without any hints:
Historic shipwrecks (1850 → 1970)
• ~20 000 deep water wrecks recorded since the age of steam and steel
• 10 % were passenger or mail ships likely to carry a cabin class or saloon piano
• 1 piano per such vessel 20 000 × 10 % × 1 ≈ 2 000
Modern container losses (1970 → today)
• ~1 500 shipping containers lost at sea each year
• 1 in 2 000 containers carries a piano or electric piano
• Each piano container holds ≈ 5 units
• 50 year window 1 500 × 50 / 2 000 × 5 ≈ 190
Coastal disasters (hurricanes, tsunamis, floods)
• Major coastal disasters each decade destroy ~50 000 houses
• 1 house in 50 owns a piano
• 25 % of those pianos are swept far enough offshore to sink and remain (50 000 / 50) × 25 % × 5 decades ≈ 1 250
Add a little margin for isolated one offs (yachts, barges, deliberate dumping): ≈ 300
Best guess range: 3 000 – 5 000 pianos are probably resting on the seafloor worldwide.
What does "choked on it" mean for you? Gemini 2.5 pro gives this, even estimating what amouns of those 3m ships that sank after pianos became common item. Not pasting the full reasoning here since it's rather long.
Combining our estimates:
From Shipwrecks: 12,500
From Dumping: 1,000
From Catastrophes: 500
Total Estimated Pianos at the Bottom of the Sea ≈ 14,000
Also I have to point out that 4o isn't a reasoning model and neither is Sonnet 4, unless thinking mode was enabled.
I think you missed the part where I had to give them hinits to solve it. All 3 initially couldn't or refused saying it was not a real problem on their first try.
You must be on the wrong side of an A/B test or very unlucky.
Because I gave your exact prompt to o3, Gemini, and Claude and they all produced reasonable answers like above on the first shot, with no hints, multiple times.
FWIW I just gave a similar question to Claude Sonnet 4 (I asked about something other than pianos, just in case they're doing some sort of constant fine-tuning on user interactions[1] and to make it less likely that the exact same question is somewhere in its training data[2]) and it gave a very reasonable-looking answer. I haven't tried to double-check any of its specific numbers, some of which don't match my immediate prejudices, but it did the right sort of thing and considered more ways for things to end up on the ocean floor than I instantly thought of. No hints needed or given.
[1] I would bet pretty heavily that they aren't, at least not on the sort of timescale that would be relevant here, but better safe than sorry.
[2] I picked something a bit more obscure than pianos.