> If CoT improves performance, then CoT improves performance, however the naivel...

> If CoT improves performance, then CoT improves performance, however the naively obvious read of "it improves performance because it is 'thinking' the 'thoughts' it tell us it is thinking, for the reasons it gives" is not completely accurate.

I can't imagine why anyone who knows even a little about how these models work would believe otherwise.

The "chain of thought" is text generated by the model in response to a prompt, just like any other text it generates. It then consumes that as part of a new prompt, and generates more text. Those "thoughts" are obviously going to have an effect on the generated output, simply by virtue of being present in the prompt. And the evidence shows that it can help improve the quality of output. But there's no reason to expect that the generated "thoughts" would correlate directly or precisely with what's going on inside the model when it's producing text.