Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It has to be somehow trained in, perhaps inadvertently. To get a feedback loop, you need to affect the training somehow.


Right, so latent deceptiveness has to be favored in pretraining / RL. To that end, it needs to be: a) useful to be deceptive to achieve CoT reasoning progress as benchmarked in training b) obvious deceptiveness should be "selected against" (in a gradient descent / RL sense) c) the model needs to be able to encode latent deception.

All of those seem like very reasonable criteria that will naturally be satisfied absent careful design by model creators. We should expect latent deceptiveness in the same way we see reasoning laziness pop up quickly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: