Yes, that is why I say it should not be done if something is very close that depends on i.
But in many algorithms and in a lot of user loops you don't need data from previous iterations; or if you do, they may be far enough eg. if you are running something complex in each iteration.
So you have to be in the case where you are implementing a 1) non-trivial loop that is 2) small enough and in which 3) branches can be predicted. It indeed happens, but I bet most code is not like that. (I have no clue though).
We also have to remember that every added branch may be destroying performance for some other branch out there (and this is way harder to know unless you measure your particular program). That is why I feel removing the branch is almost always a good idea unless proven otherwise; rather than the other way around.
The partition used in quicksort has a textbook example of a result--the possibly swapped elements--not used in the next iteration. (The conditional increment is used immediately, though.) But replacing the innards of partition's loop with
bool c = *left <= pivot
T v[2] = { std::move(*left), std::move(*right) };
*right = std::move(v[1-c]), *left = std::move(v[c]);
left += c;
almost doubles the speed. Persuade the compiler to substitute cmov instructions, and you get more than double.
Gcc cannot be persuaded, under any circumstances, to put more than one cmov in a basic block. Clang can be persuaded, with a bit of subterfuge:
or (less fun) with ?: expressions. Probably the optimizers should recognize the array index trick, which is formally correct for all types.
Pressure on the cache of predicted branches is hard to quantify. They don't tell us how many branch prediction slots we get. But you definitely can blow that cache.
We don't know what all dodgy gimcracks are in modern CPU cores. I do know that loops with FILE getc/putc and streambuf sgetc/sputc are faster than they have any business being. If you can get your loops to look enough like those, they will be improbably faster.
But in many algorithms and in a lot of user loops you don't need data from previous iterations; or if you do, they may be far enough eg. if you are running something complex in each iteration.
So you have to be in the case where you are implementing a 1) non-trivial loop that is 2) small enough and in which 3) branches can be predicted. It indeed happens, but I bet most code is not like that. (I have no clue though).
We also have to remember that every added branch may be destroying performance for some other branch out there (and this is way harder to know unless you measure your particular program). That is why I feel removing the branch is almost always a good idea unless proven otherwise; rather than the other way around.