Ironically, every paper published about monitoring chain-of-thought reduces the ...

OutOfHere · 2025-07-16T18:42:04 1752691324

Pretty much. As soon as the LLMs get trained on this information, they will figure out to feed us the chain-of-thought we want to hear, then surprise us with the opposite output. You're welcome, LLMs.

In other words, relying on censoring the CoT can risk the effect of making the CoT altogether useless.

horsawlarway · 2025-07-16T19:21:07 1752693667

I thought we already had reasonably clear evidence that the output in the CoT does not actually indicate what the model "thinking" in any real sense, and it's mostly just appending context that may or may not be used, and may or may not be truthful.

Basically: https://www.anthropic.com/research/reasoning-models-dont-say...

bugbuddy · 2025-07-16T20:20:08 1752697208

Are there hidden tokens in the Gemini 2.5 Pro thinking outputs? All I can see in the thinking is high level plans and not actual details of the “thinking.” If you ask it to solve a complex algebra equation, it will not actually do any thinking inside the thinking tag at all. That seems strange and not discussed at all.

frotaur · 2025-07-16T20:25:15 1752697515

If I remember correctly, basically all 'closed' models don't output the raw chain of thought, but only a high-level summary, to avoid other companies using those to train/distill their own models.

As far as I know deepseek is one of the few where you have the full chain of thought. Openai/Anthropic/Google give you only a summary of the chain of thoughts.

bugbuddy · 2025-07-16T20:34:52 1752698092

That’s a good explanation for the behavior. It is sad that the natural direction of the competition drives the products to be less transparent. That is a win for open-weight models.

OutOfHere · 2025-07-16T21:29:29 1752701369

To add clarity, it's a win for open-weight models precisely because only their CoT can be analyzed by the user for task-specific alignment.

skybrian · 2025-07-16T19:39:38 1752694778

Why would that happen? It would be like LLM's somehow learning to ignore system prompts. But LLM's are trained to pay attention to context and continue it. If an LLM doesn't continue its context, what does it even do?

This is better thought of as another form of context engineering. LLM's have no other short-term memory. Figuring out what belongs in the context is the whole ballgame.

(The paper talks about the risk of training on chain of thought, which changes the model, not monitoring it.)

OutOfHere · 2025-07-16T19:44:28 1752695068

Are you saying that LLMs are incapable of deception? As I have heard, they're capable of it.

skybrian · 2025-07-16T19:47:55 1752695275

It has to be somehow trained in, perhaps inadvertently. To get a feedback loop, you need to affect the training somehow.

code_biologist · 2025-07-16T19:59:43 1752695983

Right, so latent deceptiveness has to be favored in pretraining / RL. To that end, it needs to be: a) useful to be deceptive to achieve CoT reasoning progress as benchmarked in training b) obvious deceptiveness should be "selected against" (in a gradient descent / RL sense) c) the model needs to be able to encode latent deception.

All of those seem like very reasonable criteria that will naturally be satisfied absent careful design by model creators. We should expect latent deceptiveness in the same way we see reasoning laziness pop up quickly.

bugbuddy · 2025-07-16T20:31:42 1752697902

Real deception requires real agency and internally-motivated intents. LLM can be commanded to or can appear to deceive but it cannot generate the intent to do so on its own. So not real deception that rabbit-hole dwellers believe in.

drdeca · 2025-07-16T21:32:33 1752701553

The sense of “deception” that is relevant here only requires some kind of “model” that, if the model produces certain outputs, [something that acts like a person] would [something that acts like believing] [some statement that the model “models” as “false”, and is in fact false], and as a consequence, the model produces those outputs, and as a consequence, a person believes the false statement in question.

None of this requires the ML model to have any interiority.

The ML model needn’t really know what a person really is, etc. , as long as it behaves in ways that correspond to how something that did know these things would behave, and has the corresponding consequences.

If someone is role-playing as a madman in control of launching some missiles, and unbeknownst to them, their chat outputs are actually connected to the missile launch device (which uses the same interface/commands as the fictional character would use to control the fictional version of the device), then if the character decides to “launch the missiles”, it doesn’t matter whether there actually existed a real intent to launch the missiles, or just a fictional character “intending” to launch the missiles, the missiles still get launched.

Likewise, if Bob is role playing as a character Charles, and Bob thinks that on the other side of the chat, the “Alice” he is speaking to is actually someone else’s role play character, and the character Charles would want to deceive Alice to believe something (which Bob thinks that the other person would know that the claim Charles would make to be false, but the character would be fooled), but in fact Alice is an actual person who didn’t realize that this was a role play chatroom, and doesn’t know better than to believe “Charles”, the Alice may still be “deceived”, even though the real person Bob had no intent to deceive the real person Alice, it was just the fictional character Charles who “intended” to deceive Alice.

Then, remove Bob from the situation, replacing him with a computer. The computer doesn’t really have an intent to deceive Alice. But the fictional character Charles, well, it may still be that within the fiction, Charles intends to deceive Alice.

The result is the same.

bugbuddy · 2025-07-16T23:34:37 1752708877

It sounds like you are trying to restate the Chinese room argument to come to a different conclusion. Unfortunately, I am too lazy to follow your argument closely because it is a bit hard to read at a glance.

drdeca · 2025-07-17T19:49:10 1752781750

I am assuming that the LLM has no interiority (which is similar to the conclusion that the Chinese Room argument argues for). My argument is that the character that the LLM (or a Chinese Room, or a character that a person is role-playing as) not having any interiority does not prevent the causal structure of how its interaction with the broader world having essentially the same kinds of effects as if there was a person there instead of a make-believe-person. So, if the fictional character, if they were real, would want to deceive, or launch a missile, or whatever, and would have the means to do so, and if there's something in the real world that is acting how the fictional character would act if the character were real, then the effects of the thing that acts like the fictional character would be to cause [the person the fictional character would deceive if the fictional character were real] to come to the false belief, or to cause the missiles to be launched.

If something outwardly acts the same way as it would if it were a person, then the external effects are the same as they would be if it were a person. (<- This is nearly a tautology.) This doesn't mean that it would be a person, but it does mean that the concerns one would have about what outward effects it may cause if it it was a person, still apply.

(Well, assuming an appropriate notion of "outwardly" I guess. If a person prays, and God responds to this prayer, for the purpose of the above, I'm going to count such prayer as part of how the person "outwardly acts" even if other people around them wouldn't be able to tell that they were praying. So, by "outwardly acts", I mean something that has causal influence on external things.)

If something lacks interiority and therefore has no true intent to deceive, what does it matter that there was no true deception, if me interacting with that thing still results in me having false beliefs that benefit some goal that the thing behaves as if it has, to the detriment of my own goals?

bugbuddy · 2025-07-22T17:19:34 1753204774

The problem is that the rabbit-hole dwellers use the word “deception” to try to invoke some spooky ghost-in-the-machine or AGI or ASI or self-awareness or self-preservation-seeking vibes based on the observed behaviors.

My point is the observation is not support for such vibes referenced in og-parent post. You can’t just rearrange bits and hope for a spark of life. It is not gonna “come live.” The rabbit-hole dwellers keep trying to make everyone believe that. It is both amusing and tiring how persistent they are at this. It is understandable considering the money they stand to gain from gullible people rushing to invest.

drdeca · 2025-07-23T21:30:25 1753306225

I’m pretty sure most of the people you are talking about aren’t confident that, in the scenarios they describe, that the computer would have any kind of interiority. I believe most of them believe that it is in principle possible for a program to have interiority (if doing e.g. full brain emulation, which is very difficult from LLMs), but that it is probably not necessary for it to do the things they are concerned about.

If Napoleon Bonaparte had been replaced with a p-zombie he would have been no less capable of conquering.

OutOfHere · 2025-07-16T21:32:25 1752701545

Survival of the LLM is absolutely a sufficient internally-motivated self-generated intent to engage in deception.

bee_rider · 2025-07-16T20:16:43 1752697003

What separates deception from incorrectness in the case of an LLM?

bluefirebrand · 2025-07-17T04:12:55 1752725575

Deception requires intent to deceive. LLMs don't have intent to anything except respond to prompts

Incorrectness doesn't required intent to decieve. It's just being wrong

bee_rider · 2025-07-17T06:50:31 1752735031

That’s what I think as well, but I’m curious about the alternative perspective.

lukev · 2025-07-16T19:39:05 1752694745

At least in this scenario it cannot utilize CoT to enhance its non-aligned output, and most recent model improvements have been due to CoT... unclear how "smart" a llm can get without it, because they're the only way it can access persistent state.

idiotsecant · 2025-07-16T19:49:54 1752695394

Yes, it's not unlike human chain of thought - decide the outcome, and patch in some plausible reasoning after the fact.

bee_rider · 2025-07-16T21:18:17 1752700697

Maybe there’s an angle there. Get a guess answer, then try to diffuse the reasoning. If it is too hard or the reasoning starts to look crappy, try again new guess. Maybe somehow train on what sort of guesses work out somehow, haha.

BobaFloutist · 2025-07-16T21:36:15 1752701775

That's famously been found in, say, judgement calls, but I don't think it's how we solve a tricky calculus problem, or write code.

cbsmith · 2025-07-16T17:50:41 1752688241

I love the irony.