Where do interrupts happen?

pjc50 · on Aug 21, 2019

This is fantastic for reminding us of how the "programmer's model" of a CPU differs from what's actually happening, which can be considerably more complicated. We humans basically cannot handle anything other than an in-order model where the processor fixes everything up to maintain that appearance.

It seems like you could also model this mentally (although I'm not sure you could instrument it) as: "On receiving an interrupt, do not give any more instructions to the execution units; wait for execution to drain, then transfer control flow to the interrupt handler, and mark the instruction after the last drained instruction as the interrupt return point."

BeeOnRope · on Aug 21, 2019

Well one of the key points is that interrupts (apparently) do not drain currently executing instructions: almost all outstanding isntructions are thrown away, including already executed but unretired instructions. Only the next to retire instruction is apparently allowed to complete (possibly because interrupts are only checked every time an instruction retires).

rrss · on Aug 20, 2019

Thanks for writing this, BeeOnRope!

I'm not all the way though it yet, but I'm having trouble understanding the "eligible" column. There are a few nops with a nonzero entries, and I haven't been able to figure out the pattern yet. I'd appreciate a clarification if you have a moment to spare.

BeeOnRope · on Aug 20, 2019

Sure. Here's the section from the post for context:

<quote>

Here, the eligible column indicates when the instruction is eligible for retirement in the sense that it has finished executing. nop instructions have no dependencies hence basically execute instantly (in fact, no not execute at all) and hence are eligible on cycle 0. The other instructions come eligible as defined by their position in the mov -> add -> ... dependency chain. The actual column shows when the instruction actually retires. You can build it by basically scanning down the eligible column and retiring up to 4 uops if their eligible cycle is less than or equal to the current cycle.

</quote>

Basically instructions [1] need to execute before they can retire. Instructions execute out-of-order when their inputs become ready and when some other constraints are met (e.g., their execution port is available). This post [2] is a good introduction.

So here "eligible" is the cycle at which an instruction would be expected to complete execution (starting arbitrarily at cycle 0), if you have a chain of mov instructions they will be eligible at 0, 4, 8, 12, ... since this chain has latency 4. If you have a bunch of independent instructions with throughput 4/cycle, they would be become eligible at 0,0,0,0,1,1,1,1,2,2,2,2 since 4 can start every cycle and they take 1 cycle each.

"Eligible" is relevant here because it defines when the instruction can retire. Unlike execution, retirement happens in order, so you can imagine the retirement unit maintaining a pointer to the oldest (i.e., "higher up" in source order) instruction, and asking "is it eligible yet?" every cycle. If yes, that instruction will be retired in addition to up to the 3 subsequent in-order instructions if they were also eligible.

So from eligible cycle analysis we can reconstruct the retirement pattern (sometimes more than one pattern is possible, I didn't touch on this yet in this post).

Let me know if it makes some sense?

---

[1] Actually uops are what execute, but they are equivalent for this example.

[2] https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introdu...

rrss · on Aug 21, 2019

Thanks.

Sorry if I'm being thick, but I'm still having trouble seeing where the pattern for nops is coming from, in this example from the post:

                       /- eligible
                       |  /- actual
    mov  rax, [rax]  ; 0  0                 
    nop              ; 0  0
    nop              ; 0  0  
    add  rax, 0      ; 1  1
    nop              ; 0  1  
    nop              ; 1  1  <------ ?
    nop              ; 0  1  
    mov  rax, [rax]  ; 6  6

I would think that the fourth nop is eligible only at cycle 1 because only four uops can start each cycle, but then the fifth nop is also eligible at cycle 4. What's different about the fourth nop from the third and fifth?

BeeOnRope · on Aug 21, 2019

You are not being thick. It was a typo on my part, I intended for all the nops to have 0 for their eligible column.

As you point out, putting them all at 0 isn't even really accurate because only 4 can allocate (rename) a cycle, so the true eligible pattern should be 0-0-0-0-1-1-1-1-2-2-2-2 or something like that (it's more complicated than than because the mov/add also needs to allocate, so it's rp). I'm basically ignoring that assuming the front-end (including allocation) can keep up, or basically that the scheduler is infinitely large and always totally full.

I am going to fix this to be more accurate but I'll have to unroll the loop a bit more for the difference between retire and eligible to make sense.

BTW, this is just a few simple version of what llvm-mca -timeline view does:

https://godbolt.org/z/ZrGXrH

Timeline view simulates a simple CPU pipeline, and with -dispatch=4 -mcpu=Skylake it's pretty close to Skylake. Unfortunately, llcm-mca models "unlimited retire bandwidth" so if 100 instructions all become eligible in the same cycle, it will retire all 100. So it's not too helpful here in any cases where it would retire more than 4 per cycle (but it works for this example because that doesn't happen due to the dep chains).

I didn't find any argument you can pass to limit retire bandwidth. In practice this is not a problem for most performance-oriented investigation because the retire limit isn't really a bottleneck, so reported performance from the tool is likely to be the same even though it simulates unlimited bandwidth.

BeeOnRope · on Aug 21, 2019

I fixed the typo, but also tried to improve the way cycles are accounted, adding two columns in support of that. If you have a moment can you check out:

https://travisdowns.github.io/blog/2019/08/20/interrupts.htm...

and see if makes sense to you? I'll upgrade the other examples to that format if it does.

I've added you to the Thanks section at the bottom.

rrss · on Aug 27, 2019

Yeah, that makes a lot of sense, thanks.

Thanks again for writing these articles, I very much appreciate them.

nkurz · on Aug 21, 2019

Coincidentally, this recent HN post also contains a fine introduction to processor internals of this sort: https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.... I know you (BeeOnRope) don't need the introductory parts, but you might appreciate the meat of his post.

BeeOnRope · on Aug 21, 2019

Yes, I've read and it have been thinking about how to do this kind of synchronized counter reading from userspace, where it would be easier to access (even if you wanted to go the kernel route, this kernel is not publicly available).

I really like this way of displaying per-cycle information from the performance counters. Very cool.

ggm · on Aug 21, 2019

As a degenerate case, can anyone design a simple 'meta machine' or constraint-set in the instructions which imposes specific costs on the interrupt decision logic? Can one code to this? Is this useful or just entertaining?

It might (for instance) be analogous to designing solutions which fit in L1 cache, for the outcome in time to execute.

Or it might affect the side channel consequences of some forms of crypto.

(or am I missing the obvious "this is in fact, how it works" answer. I am sure I recall in simpler times reading that writing interrupt handlers demanded some consideration of what was fixed or mutable in the throw)

ebeip90 · on Aug 21, 2019

I’m not sure that this addresses your question directly, but the x86 MMU is turing complete.

http://beza1e1.tuxen.de/articles/accidentally_turing_complet...

ggm · on Aug 21, 2019

its turtles all the way down.

aidenn0 · on Aug 20, 2019

One thing I have noticed with statistical profilers is that conditional jumps tend to be highlighted a lot.

On the other hand, if I can massage the compiler into using CMOV instead of e.g. JNE, I often end up with much faster code, so the clustering around that instruction ends up being useful sometimes anyways.

BeeOnRope · on Aug 20, 2019

I think that's an indication that the branch has a non-negligible branch misprediction rate. A mispredicts means the branch will often sit at the head of ROB as recovery happens, so will be heavily represented in the samples. That's also consistent with cmov providing a good speedup: that usually means the underlying branches were hard to predict (otherwise cmov usually ends up slower).

In this post I don't find any particular clusterting around branches, but they are all predicted essentially 100% of the time, so that's not surprising.

hamsham · on Aug 20, 2019

Out of curiosity, how have you managed to do that? I've noticed even with GCC or Clang's -O3 option I'm still getting jumps rather than conditional moves.

fsfod · on Aug 20, 2019

Clang\LLVM has some heuristics where it will convert cmovs generated by earlier passes back in to jumps https://reviews.llvm.org/D34769. Also the compilers can't always prove its safe to evaluate both sides of a ternary operator which would be needed to use a cmov. Here's a compiler explorer sample for clang and GCC that shows all these problems https://godbolt.org/z/IpZfZv, you change the -x86-cmov-converter option to true re-enable clangs cmov converter which will disable cmov in the last function.

BeeOnRope · on Aug 20, 2019

Yeah I have spent a long time fighting to get cmov into code that wants it.

The cmov1 case in your example ends up getting a cmov1 because the compiler was able to apply the cmov to the pointer rather than to the read value (essentially transforming it to `return (cond ? v1 : v2)->i`).

It's fragile though, if you change it a bit it stops working:

https://godbolt.org/z/hs71Jb

It isn't clear if it stops because the heuristics aren't in favor any more (you'd need 2 cmov-y things to handle the +1), or because the compiler no longer sees the transformation is possible at all.

gcc even fails if you add +1 to both sides which should be almost as simple as the original:

https://godbolt.org/z/y7PIcU

I have tried to use the "unconditionally deference" trick many times to convince a cmov, but it seems it doesn't always work because the compiler might re-optimize it back to only doing the read inside the branch/conditional, so later phases don't see the read as safe.

aidenn0 · on Aug 20, 2019

I was using SBCL[1], not GCC nor clang. For that I found that a very short case statement with trivial expressions seems to work. e.g. in a UTF-8 decoder, masking the bits and then having a case statement rather than an if/then/else (i.e. COND in lisp) construct on them.

As another side note, for the UTF-8 decoder case, it was easy to make ASCII-only text fairly fast with the jumps (it's trivial for the branch predictor), but mixed text suffered greatly.

1: http://sbcl.org/