CoT is a poor debugging tool, even if it's readable. Non-prompted CoT deceptively looks to be written in the natural language, but it's actually in the "bird language" the model learned due to reward hacking, where yes in the CoT can correspond to no in the reply, which makes total sense for the model but not for a human. In fact it's pretty common to have one thing in CoT and a completely opposite in the reply. You can probably train it to be readable enough, but CoT efficiency drops with each additional constraint not related to producing a correct reply. CoT is not an explanation, ideally it just stores intermediate results.
The right tool is proper interpretability of the internal state of the model, so you can tap into it directly and produce a usable explanation of what it does.
The right tool is proper interpretability of the internal state of the model, so you can tap into it directly and produce a usable explanation of what it does.