I'm not sure that's entirely true. According to this (see "Why can’t Intel and A...

AnimalMuppet · on Jan 19, 2021

Well, if we can have speculative execution, why not speculative decode? You could decode the stream as if the next instruction started at $CURRENT_PC+1, $CURRENT_PC+2, etc. When you know how many bytes the instruction at $CURRENT_PC takes, you could keep the right decode and throw the rest away.

Sure, it would mean multiple duplicate decoders, which eats up transistors. On the other hand, we've got to find something useful for all those transistors to do, and this looks useful...

usefulcat · on Jan 19, 2021

According to the article I linked, that's basically how they do it:

"The brute force way Intel and AMD deal with this is by simply attempting to decode instructions at every possible starting point. That means x86 chips have to deal with lots of wrong guesses and mistakes which has to be discarded. This creates such a convoluted and complicated decoder stage that it is really hard to add more decoders. But for Apple, it is trivial in comparison to keep adding more.

In fact, adding more causes so many other problems that four decoders according to AMD itself is basically an upper limit for them."

exmadscientist · on Jan 19, 2021

That doesn't make any sense. The ROB is after instructions have been cracked into uops; the internal format and length of uops is "whatever is easiest for the design", since it's not visible to the outside world.

This argument does apply to the L1 cache, which sits before decode. (It does not apply to uop caches/L0 caches, but is related to them anyway, as they are most useful for CISCy designs, with instructions that decode in complicated ways into many uops.)

usefulcat · on Jan 19, 2021

Maybe it wasn't clear, but the article I linked is saying that compared to M1, x86 architectures are decode-limited, because parallel decoding with variable-length instructions is tricky. Intel and AMD (again according to the linked article) have at most 4 decoders, while M1 has 8.

So yes the ROB is after decoding, but surely there's little point in having the ROB be larger than can be kept relatively full by the decoders.

jabl · on Jan 20, 2021

Well intentioned as that article may be, it makes plenty of mistakes. For a rather glaring one, no, uops are not linked to OoO.

Secondly, it ignores the existence of uop caches that x86 designs use in order to not need such wide decoders. Some ARM designs also use uop caches, FWIW, since it can be more power efficient.

That doesn't mean that fixed width decoding like on aarch64 isn't an advantage; it certainly is. Also, M1 is certainly a very impressive design, though of course it also helps that it's fabbed on the latest and greatest process.