I think the problem with long chains of steps on their own (without the bfs stuf...

I think the problem with long chains of steps on their own (without the bfs stuff) is that your failure probability quickly grows to unreasonable levels.

Basically, if each step has a 97% chance of being completed correctly, if your task requires 10 steps one after the other, the chance of success falls to 97%*10=74%

If I understand correctly, part of the point of the BFS is to throw compute at it, in order to lower the failure rates. Kind of a "run many times in parallel and pick the best one". This can be effective, but also quite expensive, as seen in the costs OpenAI had to pay for their ARC-AGI benchmarking runs.