There is a more-or-less comprehensive archive[1] up to 2019 (which should probably be scraped and hoarded). The article in question is there, comments included[2].
Side note: Michael Kaplan’s blog, which Microsoft took down in a remarkably shameful manner, has also been archived[3], while Eric Lippert reposted (most of?) his old articles on his personal WordPress instance[4].
It’s still there[1], although a number of the posts seem to be missing. Inexplicably, (most of) the comments are in place. (There are other Microsoft blogs that have disappeared without a trace, but not this one.)
I don't understand the reasoning behind this. Why do you need 5 bytes of unexecuted patch space before the program _and_ 2 bytes of patch space at the beginning of the program?
Wouldn't it be the same to have a single 5-byte effectless operation to patch a single long jump instead of needing space for two jumps?
Because you can’t atomically replace the NOPs. So there’s nothing to prevent you from inserting your patch while a thread partway through consuming the NOPs, resulting in a portion of your patch being decoded out of order.
Modern x86 processors decode multiple instructions per clock. By “slots”, I’m assuming he means entries in the dispatcher or reservation stations. But NOPs don’t even make it to there. As I said, the decoder that encounters it will probably swallow it and emit nothing.
Besides, it sounds like premature optimization. This isn’t the 1980s; An extra clock cycle per function call is not going to make or break your program.
There is a very good chance this dates back to 16-bit Windows. Even Windows 98 supported the 486 which was not capable of independent execution (that’s P5) or separate decode from execution (P5Pro there).
Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?
https://devblogs.microsoft.com/oldnewthing/20110921-00/?p=95...