> what if we trained deceiver models that would provide a sound chain of explanation but then perform some kind of deception and output an incorrect answer?
You're right on target! That's exactly what they're doing in the paper. They train three models -- a verifier (that rates answers as sounding correct or sounding wrong), a "helpful prover" (that provides correct answers), and "sneaky prover" (that provides incorrect answers that attempt to deceive the verifier into scoring its answer highly).
This adversarial relationship between the "helpful prover" and the "sneaky prover" is the cool part of the paper (IMO).
You're right on target! That's exactly what they're doing in the paper. They train three models -- a verifier (that rates answers as sounding correct or sounding wrong), a "helpful prover" (that provides correct answers), and "sneaky prover" (that provides incorrect answers that attempt to deceive the verifier into scoring its answer highly).
This adversarial relationship between the "helpful prover" and the "sneaky prover" is the cool part of the paper (IMO).