Ah there you go, that's the conditional move on ARM. Yeah, those are slow.

tylerhou · on Aug 24, 2023

"Conditional move is slow" is not the right takeaway. (In fact, conditional move can be much /faster/ than a branch in certain circumstances (e.g. hard-to-predict branches, like binary search).)

The reason why this code is slow is because the conditional move is on the critical path for resolving a loop-carried data dependency. In other words, to execute the assignment `maxV = (v if (v > maxV) else maxV)`, the CPU has to test v > maxV. But in order to make this test, the CPU has to wait until the conditional move in the previous loop iteration finishes (to know what the new value of maxV is). So the overall execution time of the loop is worse, because each iteration depends on the previous iteration. (This is analogous to why traversing a linked list is slower than an array.)

However, for the branch version, there is no data dependency on the previous loop, so the CPU really can (speculatively) execute multiple loop iterations in parallel. The the next loop iteration can start before the previous finishes. And with the large reorder buffers on modern CPUs, maybe up to 4-8 loop iterations can execute in parallel!

On the other hand, conditional moves would be much faster for a loop like:

  bigs := make([]int, len(numsA))
  for idx, v : range numsA {
    if numsA[idx] > numsB[idx] {
      bigs[idx] = numsA[idx]
    } else {
      bigs[idx] = numsB[idx]
    }
  }

There is no loop-carried data dependency, so the conditional moves can resolve in parallel. The above code with a conditional move is likely to be much faster than code with an equivalent branch. (And even if branches are perfectly predicted, the conditional move version will only be slightly slower, not 2X slower.)

mcv · on Aug 24, 2023

But why does the print statement matter in how the compiler decides this? In both cases, the execution still depends on the vMax value calculated in the previous iteration.

Majromax · on Aug 24, 2023

The version with the print statement requires the branch because it's conditionally executed; the 'continue' statement in the if-block will skip it if the loop finds a new maximum.

Since the function-with-if version requires a branch, there's no advantage (regardless of predictability) in using a conditional move for the assignment to maximum.

kaba0 · on Aug 24, 2023

So basically the print prevents an optimization, that so happens to hinder performance in this particular case?

James_K · on Aug 24, 2023

How exactly does that change anything? It's being run on an increasing array, so the data is assigned every loop. Surely the comparison creates a dependency on the value of the variable which was assigned in the last loop.

anonymoushn · on Aug 24, 2023

For branch instructions but not for conditional move instructions, the processor will happily execute the next instruction before the values are actually ready, then invalidate the work it has done later if it turns out retrospectively to have been wrong. This is called branch prediction.

vlovich123 · on Aug 24, 2023

Technically my understanding of terminology is that the branch predictor is responsible for determining which branch to take when speculative execution encounters a branch. Speculative execution is the thing that will invalidate work if the branch predictor was wrong.

Also my understanding based on reading this thread is that a conditional move is harder to speculate across because of the data dependency carried across iteration loops, preventing speculation. With a conditional branch, speculation can keep executing the next loop and most of the time the branch predictor will guess correctly how to execute the next iteration for this problem.

Dylan16807 · on Aug 24, 2023

I wouldn't say it's harder to speculate the result of a conditional move, more that processors don't do it. Or even stronger, that a conditional move instruction is intentionally opting out of prediction for one reason or another.

tylerhou · on Aug 24, 2023

For a strictly increasing array, assuming the branch predictor always predicts correctly (guesses taken), the CPU does not have to wait for the test to resolve (v > maxV) before it executes the assignment (maxV = v). So effectively the CPU is just running

  maxV = v
  maxV = v
  maxV = v
  maxV = v

over and over (with tests running in parallel as well, but those matter less, because they are mostly just "checking" whether the branch predictor guessed correctly.).

Then, in that above sequence, there is no /read/ of maxV after each write, so there is no true data dependency. (We don't need to wait for the last write to finish before we write again, unlike the read after write case.)

fmstephe · on Aug 24, 2023

So would you say in the general case, i.e. the compiler obviously doesn't know how predictable a branch may be, using conditional move is a poor choice when writing to an address which is repeatedly written to inside the loop?

What I am trying to get a feel for, is whether this represents a bad choice of assembly in general for this kind of loop.

tylerhou · on Aug 24, 2023

Nit: If a variable/object is only being written, that’s fine. There is a data dependency here because there is a read after a write.

I don’t have enough compiler knowledge to be able to definitively answer your question, but I think it would be a reasonable heuristic to favor emitting a branch over a cmov in cases where the cmov participates in a loop-carried data dependency.

The other advantage of emitting a branch is that branch prediction can better adapt to runtime conditions. Branch prediction tends to struggle when data is perfectly random, but most data is not. So in general it’s also reasonable for compilers to default to emitting branches.

fmstephe · on Aug 24, 2023

Yeah, after reading the blog post I felt reasonably confident that this is a compiler bug (in terms of perf).

Hopefully someone with a deeper understanding can verify or discredit this here. I'm curious to see the fix for this (I assume it will generate a fix).

tylerhou · on Aug 24, 2023

I don't know if I would classify it as a bug, but definitely a heuristic that could use tuning.

At least (at a high level) in LLVM, branches and cmovs are represented with the exact same construct, and one of the codegen passes looks at the condition and the two sides and heuristically determines whether to emit a branch or a cmov.

I don't know how Go codegen works, but I assume they do something similar.

dataflow · on Aug 24, 2023

Confused, why do you think this is a compiler bug? You should definitely expect a predictable branch to be faster than a conditional move, and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).

littlestymaar · on Aug 24, 2023

> and the insertion of a conditional move in the original is totally sensible (albeit not intuitive if you haven't seen it).

Would you mind expanding? If the conditional move isn't needed, and given that it's costly in terms of perfs, how is that “totally sensible” to have one here?

dataflow · on Aug 24, 2023

The performance of branches is data-dependent. The performance of conditional moves is data-independent. In this case the branches are predictable, so they perform better here. In general, though, the compiler has no idea whether that's the case, so it makes sense to insert a conditional move to avoid branch misprediction penalties.

mcv · on Aug 24, 2023

But still, why the difference? With or without the "lol", the data dependence is identical, the predictability is identical. Yet with the "lol" it optimises differently.

My best guess is that it's because of the added cost to the other branch: printing is expensive, so the compiler really doesn't want to do that, and would prefer to incorrectly predict not-printing than to incorrectly predict printing.

If that tradeoff is what causes the difference in behaviour here, then I understand. But I don't like it, because it's ridiculous and we shouldn't have to con the compiler into doing what's right. If we're going to have to do this, I'd rather have the language provide a way for us to make explicit what we expect here.

anonymoushn · on Aug 24, 2023

When compiling code like

  if a > b
    b = a
    continue
  else
    continue

you can replace the entire if-else with a cmov. but if there is a function call in one of the prongs and not in the other, you cannot.

mcv · on Aug 24, 2023

Ah, so it's the function call (which could just as easily have been something other than print), that forces the compiler to use a normally-less-optimal branch rather than the normally-more-optimal cmov. Only in this particular use case, cmov is actually less optimal, but the compiler doesn't know that.

fmstephe · on Aug 24, 2023

This is a very good answer. Thanks.

tylerhou · on Aug 24, 2023

https://news.ycombinator.com/item?id=36623759

littlestymaar · on Aug 24, 2023

Thank you.