Yes, that is why I say it should not be done if something is very close that dep...

ncmncm · on March 8, 2020

All very true.

The partition used in quicksort has a textbook example of a result--the possibly swapped elements--not used in the next iteration. (The conditional increment is used immediately, though.) But replacing the innards of partition's loop with

  bool c = *left <= pivot
  T v[2] = { std::move(*left), std::move(*right) };
  *right = std::move(v[1-c]), *left = std::move(v[c]);
  left += c;

almost doubles the speed. Persuade the compiler to substitute cmov instructions, and you get more than double.

Gcc cannot be persuaded, under any circumstances, to put more than one cmov in a basic block. Clang can be persuaded, with a bit of subterfuge:

  *right = (v[0] & -c) | (v[1] & c-1);
  *left  = (v[1] & -c) | (v[0] & c-1);

or (less fun) with ?: expressions. Probably the optimizers should recognize the array index trick, which is formally correct for all types.

Pressure on the cache of predicted branches is hard to quantify. They don't tell us how many branch prediction slots we get. But you definitely can blow that cache.

We don't know what all dodgy gimcracks are in modern CPU cores. I do know that loops with FILE getc/putc and streambuf sgetc/sputc are faster than they have any business being. If you can get your loops to look enough like those, they will be improbably faster.