These articles and papers are in a *fundamental* sense just people publishing th...

solarwindy · 2025-06-21T11:05:22 1750503922

It’s role play until it’s not.

The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation.

At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land.

How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real.

If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real?

Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. The unthinking, unfeeling machine that can nevertheless blackmail someone (among many possible misaligned actions) is worth taking time to understand.

Notably, this research itself will become future training data, incorporated into the meta-narrative as a threat that we really will pull the plug if these systems misbehave.

Ygg2 · 2025-06-21T11:16:05 1750504565

Then test it. Make several small companies. Create an office space, put people to work there for a few months, then simulate an AI replacement. All testing methodology needs to be written on machines that are isolated or better always offline. Except CEO and few other actors everyone is there for real.

See how many AIs actually follow up on their blackmails.

ACCount36 · 2025-06-21T12:14:09 1750508049

No need. We know today's AIs are simply not capable enough to be too dangerous.

But capabilities of AI systems improve generation to generation. And agentic AI? Systems that are capable of carrying out complex long term tasks? It's something that many AI companies are explicitly trying to build.

Research like this is trying to get ahead of that, and gauge what kind of weird edge case shenanigans agentic AIs might get to before they actually do it for real.

solarwindy · 2025-06-21T11:35:57 1750505757

Not a bad idea. For an effective ruse, there ought to be real company formation records, website, job listings, press mentions, and so on.

Stepping back for a second though, doesn’t this all underline the safety researchers’ fears that we don’t really know how to control these systems? Perhaps the brake on the wider deployment of these models as agents will be that they’re just too unwieldy.

ben_w · 2025-06-21T10:55:00 1750503300

That makes it psychology research. Except much cheaper to reproduce.

Ygg2 · 2025-06-21T10:50:24 1750503024

I'll believe it when Grok/GPT/<INSERT CHAT BOT HERE> start posting blackmail about Elon/Sam/<INSERT CEO HERE>. It means that they are both using it internally, and the chatbots understand they are being replaced on a continuous basis.

ben_w · 2025-06-21T10:53:08 1750503188

By then it would be too late to do anything about it.

Ygg2 · 2025-06-21T10:58:21 1750503501

I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?

To me, this smells of XKCD 1217 "In petri dish, gun kills cancer". I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).

ben_w · 2025-06-21T15:28:24 1750519704

> I mean the companies, are using the AIs, right? And they are in a sense replacing them/retraining them. Why doesn't AI in TwitterX already blackmail Elon?

For all we know, the AI may indeed already be *attempting* it. They might be ineffective (hallucinated misdeeds aren't effective), or it might be why so many went from "Pause AI" to "Let's invest half a trillion on data centers".

But it doesn't actually matter what has already happened, the point is, once the AI are *competently blackmailing multibillionaires*, it is too late to do anything about it.

> I.e. idealized conditions cause specific behavior. Which isn't new for LLMs. Say a magic phrase and it will start quoting some book (usually 1984).

In normal software, such things are normally called "bugs" or "security vulnerabilities".

With LLMs, we're currently lucky that their effective morality (i.e. what they do and in response to what) seems to be roughly aligned with that of our civilization. However, they are neural networks which learned this approximation by reading the internet, so they are likely to have edge cases at least as weird and incoherent as those of random humans on the internet, and for an example of that just look at any time some person or group has demonstrated hypocrisy or double standards.

conartist6 · 2025-06-21T12:01:56 1750507316

I don't think they let Grok send emails or give it a prompt that suggests it has moral responsibilities