That's not the pattern I see when looking at the op-code/cycle chart. I recently implemented part of a C64-emulator in JavaScript and it seems very much like every step takes a cycle.
For example the instructions NOP(or CLI, STI, INX etc), 1 byte, 2 cycles. 1 cycle for fetching the instruction and one for executing the fetched instruction.
LDA addr,x seems to be pipelined a bit though. It's "AD lo hi" in memory and takes 4 cycles unless lo+x > 255, then it takes 5 cycles. The lo+x calculation seems to occur while hi is being read.
I will have to dig up my 6502 documentation, but, IIRC, by the time the processor executed the NOP (CLI, INX etc) it already fetched the next instruction, so, if it's another NOP, it will complete in one cycle instead of two. Unless you crossed a page boundary, which implies a one-cycle penalty.
Since I never wrote timing-critical code for the 6502 (apart from "make it as fast as possible") I cannot recall many specifics. Since you did, you certainly have a better understanding of how it worked.
I am restoring a 65c02-based //e clone, so, I may be able to properly measure instruction timings, but I won't hold my breath.
Yes you are right, the instruction timings were very exact as far as I remember. The only cases where there was an option was in the case of a branch taken or not.
For example the instructions NOP(or CLI, STI, INX etc), 1 byte, 2 cycles. 1 cycle for fetching the instruction and one for executing the fetched instruction.
LDA addr,x seems to be pipelined a bit though. It's "AD lo hi" in memory and takes 4 cycles unless lo+x > 255, then it takes 5 cycles. The lo+x calculation seems to occur while hi is being read.