This is fantastic for reminding us of how the "programmer's model" of a CPU differs from what's actually happening, which can be considerably more complicated. We humans basically cannot handle anything other than an in-order model where the processor fixes everything up to maintain that appearance.
It seems like you could also model this mentally (although I'm not sure you could instrument it) as: "On receiving an interrupt, do not give any more instructions to the execution units; wait for execution to drain, then transfer control flow to the interrupt handler, and mark the instruction after the last drained instruction as the interrupt return point."
Well one of the key points is that interrupts (apparently) do not drain currently executing instructions: almost all outstanding isntructions are thrown away, including already executed but unretired instructions. Only the next to retire instruction is apparently allowed to complete (possibly because interrupts are only checked every time an instruction retires).
I'm not all the way though it yet, but I'm having trouble understanding the "eligible" column. There are a few nops with a nonzero entries, and I haven't been able to figure out the pattern yet. I'd appreciate a clarification if you have a moment to spare.
Sure. Here's the section from the post for context:
<quote>
Here, the eligible column indicates when the instruction is eligible for retirement in the sense that it has
finished executing. nop instructions have no dependencies
hence basically execute instantly (in fact, no not execute
at all) and hence are eligible on cycle 0. The other
instructions come eligible as defined by their position in
the mov -> add -> ... dependency chain. The actual column
shows when the instruction actually retires. You can build
it by basically scanning down the eligible column and
retiring up to 4 uops if their eligible cycle is less than
or equal to the current cycle.
</quote>
Basically instructions [1] need to execute before they can retire. Instructions execute out-of-order when their inputs become ready and when some other constraints are met (e.g., their execution port is available). This post [2] is a good introduction.
So here "eligible" is the cycle at which an instruction would be expected to complete execution (starting arbitrarily at cycle 0), if you have a chain of mov instructions they will be eligible at 0, 4, 8, 12, ... since this chain has latency 4. If you have a bunch of independent instructions with throughput 4/cycle, they would be become eligible at 0,0,0,0,1,1,1,1,2,2,2,2 since 4 can start every cycle and they take 1 cycle each.
"Eligible" is relevant here because it defines when the instruction can retire. Unlike execution, retirement happens in order, so you can imagine the retirement unit maintaining a pointer to the oldest (i.e., "higher up" in source order) instruction, and asking "is it eligible yet?" every cycle. If yes, that instruction will be retired in addition to up to the 3 subsequent in-order instructions if they were also eligible.
So from eligible cycle analysis we can reconstruct the retirement pattern (sometimes more than one pattern is possible, I didn't touch on this yet in this post).
Let me know if it makes some sense?
---
[1] Actually uops are what execute, but they are equivalent for this example.
I would think that the fourth nop is eligible only at cycle 1 because only four uops can start each cycle, but then the fifth nop is also eligible at cycle 4. What's different about the fourth nop from the third and fifth?
You are not being thick. It was a typo on my part, I intended for all the nops to have 0 for their eligible column.
As you point out, putting them all at 0 isn't even really accurate because only 4 can allocate (rename) a cycle, so the true eligible pattern should be 0-0-0-0-1-1-1-1-2-2-2-2 or something like that (it's more complicated than than because the mov/add also needs to allocate, so it's rp). I'm basically ignoring that assuming the front-end (including allocation) can keep up, or basically that the scheduler is infinitely large and always totally full.
I am going to fix this to be more accurate but I'll have to unroll the loop a bit more for the difference between retire and eligible to make sense.
BTW, this is just a few simple version of what llvm-mca -timeline view does:
Timeline view simulates a simple CPU pipeline, and with -dispatch=4 -mcpu=Skylake it's pretty close to Skylake. Unfortunately, llcm-mca models "unlimited retire bandwidth" so if 100 instructions all become eligible in the same cycle, it will retire all 100. So it's not too helpful here in any cases where it would retire more than 4 per cycle (but it works for this example because that doesn't happen due to the dep chains).
I didn't find any argument you can pass to limit retire bandwidth. In practice this is not a problem for most performance-oriented investigation because the retire limit isn't really a bottleneck, so reported performance from the tool is likely to be the same even though it simulates unlimited bandwidth.
I fixed the typo, but also tried to improve the way cycles are accounted, adding two columns in support of that. If you have a moment can you check out:
Coincidentally, this recent HN post also contains a fine introduction to processor internals of this sort: https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.... I know you (BeeOnRope) don't need the introductory parts, but you might appreciate the meat of his post.
Yes, I've read and it have been thinking about how to do this kind of synchronized counter reading from userspace, where it would be easier to access (even if you wanted to go the kernel route, this kernel is not publicly available).
I really like this way of displaying per-cycle information from the performance counters. Very cool.
As a degenerate case, can anyone design a simple 'meta machine' or constraint-set in the instructions which imposes specific costs on the interrupt decision logic? Can one code to this? Is this useful or just entertaining?
It might (for instance) be analogous to designing solutions which fit in L1 cache, for the outcome in time to execute.
Or it might affect the side channel consequences of some forms of crypto.
(or am I missing the obvious "this is in fact, how it works" answer. I am sure I recall in simpler times reading that writing interrupt handlers demanded some consideration of what was fixed or mutable in the throw)
One thing I have noticed with statistical profilers is that conditional jumps tend to be highlighted a lot.
On the other hand, if I can massage the compiler into using CMOV instead of e.g. JNE, I often end up with much faster code, so the clustering around that instruction ends up being useful sometimes anyways.
I think that's an indication that the branch has a non-negligible branch misprediction rate. A mispredicts means the branch will often sit at the head of ROB as recovery happens, so will be heavily represented in the samples. That's also consistent with cmov providing a good speedup: that usually means the underlying branches were hard to predict (otherwise cmov usually ends up slower).
In this post I don't find any particular clusterting around branches, but they are all predicted essentially 100% of the time, so that's not surprising.
Out of curiosity, how have you managed to do that? I've noticed even with GCC or Clang's -O3 option I'm still getting jumps rather than conditional moves.
Clang\LLVM has some heuristics where it will convert cmovs generated by earlier passes back in to jumps https://reviews.llvm.org/D34769. Also the compilers can't always prove its safe to evaluate both sides of a ternary operator which would be needed to use a cmov.
Here's a compiler explorer sample for clang and GCC that shows all these problems https://godbolt.org/z/IpZfZv, you change the -x86-cmov-converter option to true re-enable clangs cmov converter which will disable cmov in the last function.
Yeah I have spent a long time fighting to get cmov into code that wants it.
The cmov1 case in your example ends up getting a cmov1 because the compiler was able to apply the cmov to the pointer rather than to the read value (essentially transforming it to `return (cond ? v1 : v2)->i`).
It's fragile though, if you change it a bit it stops working:
It isn't clear if it stops because the heuristics aren't in favor any more (you'd need 2 cmov-y things to handle the +1), or because the compiler no longer sees the transformation is possible at all.
gcc even fails if you add +1 to both sides which should be almost as simple as the original:
I have tried to use the "unconditionally deference" trick many times to convince a cmov, but it seems it doesn't always work because the compiler might re-optimize it back to only doing the read inside the branch/conditional, so later phases don't see the read as safe.
I was using SBCL[1], not GCC nor clang. For that I found that a very short case statement with trivial expressions seems to work. e.g. in a UTF-8 decoder, masking the bits and then having a case statement rather than an if/then/else (i.e. COND in lisp) construct on them.
As another side note, for the UTF-8 decoder case, it was easy to make ASCII-only text fairly fast with the jumps (it's trivial for the branch predictor), but mixed text suffered greatly.
It seems like you could also model this mentally (although I'm not sure you could instrument it) as: "On receiving an interrupt, do not give any more instructions to the execution units; wait for execution to drain, then transfer control flow to the interrupt handler, and mark the instruction after the last drained instruction as the interrupt return point."