"Conditional move is slow" is not the right takeaway. (In fact, conditional move can be much /faster/ than a branch in certain circumstances (e.g. hard-to-predict branches, like binary search).)
The reason why this code is slow is because the conditional move is on the critical path for resolving a loop-carried data dependency. In other words, to execute the assignment `maxV = (v if (v > maxV) else maxV)`, the CPU has to test v > maxV. But in order to make this test, the CPU has to wait until the conditional move in the previous loop iteration finishes (to know what the new value of maxV is). So the overall execution time of the loop is worse, because each iteration depends on the previous iteration. (This is analogous to why traversing a linked list is slower than an array.)
However, for the branch version, there is no data dependency on the previous loop, so the CPU really can (speculatively) execute multiple loop iterations in parallel. The the next loop iteration can start before the previous finishes. And with the large reorder buffers on modern CPUs, maybe up to 4-8 loop iterations can execute in parallel!
On the other hand, conditional moves would be much faster for a loop like:
bigs := make([]int, len(numsA))
for idx, v : range numsA {
if numsA[idx] > numsB[idx] {
bigs[idx] = numsA[idx]
} else {
bigs[idx] = numsB[idx]
}
}
There is no loop-carried data dependency, so the conditional moves can resolve in parallel. The above code with a conditional move is likely to be much faster than code with an equivalent branch. (And even if branches are perfectly predicted, the conditional move version will only be slightly slower, not 2X slower.)
But why does the print statement matter in how the compiler decides this? In both cases, the execution still depends on the vMax value calculated in the previous iteration.
The version with the print statement requires the branch because it's conditionally executed; the 'continue' statement in the if-block will skip it if the loop finds a new maximum.
Since the function-with-if version requires a branch, there's no advantage (regardless of predictability) in using a conditional move for the assignment to maximum.
How exactly does that change anything? It's being run on an increasing array, so the data is assigned every loop. Surely the comparison creates a dependency on the value of the variable which was assigned in the last loop.
For branch instructions but not for conditional move instructions, the processor will happily execute the next instruction before the values are actually ready, then invalidate the work it has done later if it turns out retrospectively to have been wrong. This is called branch prediction.
Technically my understanding of terminology is that the branch predictor is responsible for determining which branch to take when speculative execution encounters a branch. Speculative execution is the thing that will invalidate work if the branch predictor was wrong.
Also my understanding based on reading this thread is that a conditional move is harder to speculate across because of the data dependency carried across iteration loops, preventing speculation. With a conditional branch, speculation can keep executing the next loop and most of the time the branch predictor will guess correctly how to execute the next iteration for this problem.
I wouldn't say it's harder to speculate the result of a conditional move, more that processors don't do it. Or even stronger, that a conditional move instruction is intentionally opting out of prediction for one reason or another.
For a strictly increasing array, assuming the branch predictor always predicts correctly (guesses taken), the CPU does not have to wait for the test to resolve (v > maxV) before it executes the assignment (maxV = v). So effectively the CPU is just running
maxV = v
maxV = v
maxV = v
maxV = v
over and over (with tests running in parallel as well, but those matter less, because they are mostly just "checking" whether the branch predictor guessed correctly.).
Then, in that above sequence, there is no /read/ of maxV after each write, so there is no true data dependency. (We don't need to wait for the last write to finish before we write again, unlike the read after write case.)
So would you say in the general case, i.e. the compiler obviously doesn't know how predictable a branch may be, using conditional move is a poor choice when writing to an address which is repeatedly written to inside the loop?
What I am trying to get a feel for, is whether this represents a bad choice of assembly in general for this kind of loop.
Nit: If a variable/object is only being written, that’s fine. There is a data dependency here because there is a read after a write.
I don’t have enough compiler knowledge to be able to definitively answer your question, but I think it would be a reasonable heuristic to favor emitting a branch over a cmov in cases where the cmov participates in a loop-carried data dependency.
The other advantage of emitting a branch is that branch prediction can better adapt to runtime conditions. Branch prediction tends to struggle when data is perfectly random, but most data is not. So in general it’s also reasonable for compilers to default to emitting branches.
Yeah, after reading the blog post I felt reasonably confident that this is a compiler bug (in terms of perf).
Hopefully someone with a deeper understanding can verify or discredit this here. I'm curious to see the fix for this (I assume it will generate a fix).
I don't know if I would classify it as a bug, but definitely a heuristic that could use tuning.
At least (at a high level) in LLVM, branches and cmovs are represented with the exact same construct, and one of the codegen passes looks at the condition and the two sides and heuristically determines whether to emit a branch or a cmov.
I don't know how Go codegen works, but I assume they do something similar.
Confused, why do you think this is a compiler bug? You should definitely expect a predictable branch to be faster than a conditional move, and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).
> and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).
Would you mind expanding? If the conditional move isn't needed, and given that it's costly in terms of perfs, how is that “totally sensible” to have one here?
The performance of branches is data-dependent. The performance of conditional moves is data-independent. In this case the branches are predictable, so they perform better here. In general, though, the compiler has no idea whether that's the case, so it makes sense to insert a conditional move to avoid branch misprediction penalties.
But still, why the difference? With or without the "lol", the data dependence is identical, the predictability is identical. Yet with the "lol" it optimises differently.
My best guess is that it's because of the added cost to the other branch: printing is expensive, so the compiler really doesn't want to do that, and would prefer to incorrectly predict not-printing than to incorrectly predict printing.
If that tradeoff is what causes the difference in behaviour here, then I understand. But I don't like it, because it's ridiculous and we shouldn't have to con the compiler into doing what's right. If we're going to have to do this, I'd rather have the language provide a way for us to make explicit what we expect here.
Ah, so it's the function call (which could just as easily have been something other than print), that forces the compiler to use a normally-less-optimal branch rather than the normally-more-optimal cmov. Only in this particular use case, cmov is actually less optimal, but the compiler doesn't know that.