Forcing reasoning is analogous to requiring a student to show their work when so...

qarl · 2025-06-15T03:24:05 1749957845

> The paper doesnt mention the models coming up with the algorithm at all AFAIK.

And that's because they specifically hamstrung their tests so that the LLMs were not "allowed" to generate algorithms.

If you simply type "Give me the solution for Towers of Hanoi for 12 disks" into chatGPT it will happily give you the answer. It will write program to solve it, and then run that program to produce the answer.

But according to the skeptical community - that is "cheating" because it's using tools. Nevermind that it is the most effective way to solve the problem.

https://chatgpt.com/share/6845f0f2-ea14-800d-9f30-115a3b644e...

zoul · 2025-06-15T06:08:09 1749967689

This is not about finding the most effective solution, it’s about showing that they “understand” the problem. Could they write the algorithm if it were not in their training set?

qarl · 2025-06-15T17:20:24 1750008024

That's an interesting question. It's not the one they are trying to answer, however.

From my personal experience: yes, if you describe a problem without mentioning the name of the algorithm, an LLM will detect and apply the algorithm appropriately.

They behave exactly how a smart human would behave. In all cases.

boredhedgehog · 2025-06-15T06:32:31 1749969151

If that's the point, shouldn't they ask the model to explain the principle for any number of discs? What's the benefit of a concrete application?

johnecheck · 2025-06-15T09:00:26 1749978026

Because that would prove absolutely nothing. There are numerous examples of tower of Hanoi explanations in the training set.

elbear · 2025-06-15T11:28:22 1749986902

How do you check that a human understood it and not simply memorised different approaches?

YeGoblynQueenne · 2025-06-15T16:30:49 1750005049

You ask them to solve several instances of the problem?

godelski · 2025-06-15T16:54:10 1750006450

It's hard. But usually we ask several variations and make them show their work.

But a human also isn't an LLM. It is much harder for them to just memorize a bunch of things, which makes evaluation easier. But they also get tired and hungry, which makes evaluation harder ¯\_(ツ)_/¯

elbear · 2025-06-15T17:20:18 1750008018

If we're talking about solving an equation, for example, it's not hard to memorize. Actually, that's how most students do it, they memorize the steps and what goes where[1].

But they don't really know why the algorithm works the way it does. That's what I meant by understanding.

[1] In learning psychology there is something called the interleaving effect. What it says is that you solve several problems of the same kind, you start to do it automatically after the 2nd or the 3rd problem, so you stop really learning. That's why you should interleave problems that are solved with different approaches/algorithms, so you don't do things on autopilot.

godelski · 2025-06-15T17:24:24 1750008264

Yes, tests fail in this method. But I think you can understand why the failure is larger when we're talking about a giant compression machine. It's not even a leap in logic. Maybe a small step

elbear · 2025-06-16T06:37:24 1750055844

I'm not sure what you mean. Btw, I'm not in the field, just have thought a lot about the topic.

Too · 2025-06-15T06:11:26 1749967886

How can one know that's not coming from the pre-trained data. The paper is trying to evaluate whether the LLM has general problem solving ability.

jsnell · 2025-06-15T06:54:27 1749970467

The paper doesn't mention it because either the researchers did not care to check the outputs manually, or reporting what was in the outputs would have made it obvious what their motives were.

When this research has been reproduced, the "failures" on the Tower of Hanoi are the model printing out a bunch of steps, saying there is no point in doing it thousands of times more. And they they'd either output an the algorithm for printing the rest in words or code

godelski · 2025-06-15T16:56:39 1750006599

It's really easy to make a billion dollars. Just make a really useful app and sell it. There's no point explaining the rest since it's so trivial.

jsnell · 2025-06-15T18:01:27 1750010487

That seems like a complete non sequitur. This is the model explaining the rest. Obviously the explanation is not very interesting since the Towers of Hanoi is not an interesting problem. But that's on the researches for choosing something with a trivial algorithm if their goal was to test reasoning abilities.

godelski · 2025-06-15T18:05:23 1750010723

I'm replying to this

  > the model printing out a bunch of steps, saying there is no point in doing it thousands of times more.

jsnell · 2025-06-15T20:02:36 1750017756

Ok, but the very next sentence was:

> And they they'd either output an the algorithm for printing the rest in words or code.

So clearly you already knew that your strawman was not relevant. Why try it anyway?

godelski · 2025-06-15T22:25:36 1750026336

Because that wasn't the task given to them. It's like giving a student a test and you asking them to solve an equation and they give you the general form. It's incomplete