This vulnerability was not caused by OoO or speculative execution. It was caused by the fact that x86 was designed 45 years ago, and has had feature after feature piled on the same base, which has never been adequately rebuilt.
The more proximate cause is that some instructions with multiple redundant prefixes (which is legal, but pointless) have their length miscalculated by some Intel CPUs, which results in wrong outcomes.
> It was caused by the fact that x86 was designed 45 years ago, and has had feature after feature piled on the same base, which has never been adequately rebuilt.
Itanic would like to object! Unfortunately it can’t get through the door.
A more sensible approach for that use-case would be IMO to have well-defined specialized prefixes for padding, instead of relying on the case-by-case behavior of redundant prefixes. (However I understand that there's almost certainly a good historical reason why this was not the way it was done)
The easiest way of doing padding is to add a bunch of `nop` instructions which are one byte each.
If you read the manual, Intel encourages minor variations of the `nop` instructions that can be lengthened into different number of bytes (like `nop dword ptr [eax]` or `nop dword ptr [eax + eax*1 + 00000000h]`).
It is never recommended anywhere in my knowledge to rely on redundant prefixes of random non-nop instructions.
Note that this technique is really only legitimate where the used prefix already has defined behavior with the given instruction ("Use of repeat prefixes and/or undefined opcodes with other Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable behavior."), and of course the REX prefix has special limitations. The key is redundant, not spurious. It is not a good idea to be doing rep add for example. But otherwise, there is no issue.
The prefixes are redundant so it's not really case-by-case behavior. You're just repeating the prefix you would be using anyway in that location.
Using specialized prefixes wastes encoding space for no real gain.
You realize on most common processors NOP itself is a pseudo-instruction? Even the apparently meme-worthy (see sibling comment) RISC-V, it's ADDI x0, x0, 0.
Usually, the historical reason is that adding the logic to do something well-defined when unexpected prefixes are used is going to cost ten more transistors per chip, which is going to add to cost to handle a corner case that almost nobody will try to be in anyway. Far better to let whatever the implementation does happen as long as what happens doesn't break the system.
The issue here is their verification of possible internal CPU states didn't account for this one.
(There is, perhaps, an argument to be made that the x86 architecture has become so complex that the emulator between its embarrassingly stupid PDP-11-style single-thread codeflow and the embarrassingly parallel computation it does under the hood to give the user more performance than a really fast PDP-11 cannot be reliably tested to exhaustion, so perhaps something needs to give on the design or the cost of the chips).
Both approaches are viable, but RISC-V's approach is better, as it provides higher code density without imposing a significant increase in complexity in exchange.
Higher code density is valuable. E.g.:
- The decoders can see more by looking at a window of code of the same size, or we can have a narrowed window.
- We can have less cache and save area and power. We can also clock the cache higher, enabled by it being smaller, lowering latency cycles.
- Smaller binaries or rom image.
Soon to be available (2024) large, high performance implementations will demonstrate RISC-V advantages well.
The more proximate cause is that some instructions with multiple redundant prefixes (which is legal, but pointless) have their length miscalculated by some Intel CPUs, which results in wrong outcomes.