From my experience the greatest optimization gains can be achieved by introducing more domain knowledge into the code. Heuristics that cut the number of executions of some expensive function, shortcuts that introduce good enough approximations, pre-computing and memoizing certain values that are actually constant in the processed data. Things like that.
And in programs that had never been optimized it's usually very easy to fire up the profiler and find some obviously terrible things. Like reinitializing in a tight loop some expensive object that can easily be a singleton (or at least a thread local object). I've once spent a month on a decent piece of code and brought down the CPU cost by literally 100x (nobody had optimized the code before).
Also, always, first create a reliable framework for measuring progress (cpu instruction retired counters in Intel CPUs are very good and don't need a lot of execution repetitions to bring the p-values low), and only then start optimizing.
Low-level, hardcore optimization rarely pays out (not never, but rarely).
Domain-specific optimizations do often yield order-of-magnitude improvements, but once those opportunities have been exploited, there is very often another factor of two or ten remaining.
For example, a three-line change to quicksort, not changing the algorithm at all, more than doubles its speed.
The key insight not taught is that the familiar order notation -- O(N), O(N^2) -- does not dictate what N must count. Traditionally, it counts comparisons, swaps, or other algorithmic events. Nowadays, these operations are practically free, and what matters are cache misses and pipeline stalls.
Because each miss is so devastating, it is essential to keep plenty of work in the pipeline. Interfaces should provide batched operations, because often it takes no more time to, say, produce four hashes than one.
Measurement is necessary, but since it is so hard to know whether a thing is fast, reading out the hardware performance counters is essential. How many pipeline stalls, how many cache misses, how many branch mispredictions?
Transforming "if (c) ++i;" to "i += c;" can make a huge difference, in an inner loop, when c is unpredictable. Similarly, "if (c) a -= b;" may become "a -= (-c & b);". Compilers usually will not make these transformations because they guess that c is predictable. When it isn't, you get a pipeline stall every other iteration, with 10-15 cycles down the drain each time, enough for ~40-60 useful instructions.
Branches break dependency chains. So, correctly predicted branches enable more instruction-scheduling parallelism, essentially running multiple loop iterations in parallel. So, reasoning about this is very uncertain. You have to measure both ways and see.
Yes, that is why I say it should not be done if something is very close that depends on i.
But in many algorithms and in a lot of user loops you don't need data from previous iterations; or if you do, they may be far enough eg. if you are running something complex in each iteration.
So you have to be in the case where you are implementing a 1) non-trivial loop that is 2) small enough and in which 3) branches can be predicted. It indeed happens, but I bet most code is not like that. (I have no clue though).
We also have to remember that every added branch may be destroying performance for some other branch out there (and this is way harder to know unless you measure your particular program). That is why I feel removing the branch is almost always a good idea unless proven otherwise; rather than the other way around.
The partition used in quicksort has a textbook example of a result--the possibly swapped elements--not used in the next iteration. (The conditional increment is used immediately, though.) But replacing the innards of partition's loop with
bool c = *left <= pivot
T v[2] = { std::move(*left), std::move(*right) };
*right = std::move(v[1-c]), *left = std::move(v[c]);
left += c;
almost doubles the speed. Persuade the compiler to substitute cmov instructions, and you get more than double.
Gcc cannot be persuaded, under any circumstances, to put more than one cmov in a basic block. Clang can be persuaded, with a bit of subterfuge:
or (less fun) with ?: expressions. Probably the optimizers should recognize the array index trick, which is formally correct for all types.
Pressure on the cache of predicted branches is hard to quantify. They don't tell us how many branch prediction slots we get. But you definitely can blow that cache.
We don't know what all dodgy gimcracks are in modern CPU cores. I do know that loops with FILE getc/putc and streambuf sgetc/sputc are faster than they have any business being. If you can get your loops to look enough like those, they will be improbably faster.
It is all powered by hardware support in CPUs. Intel CPUs have 3 programmable counters, you can set them to trace events, like instructions retired or Ln cache misses.
Domain knowledge is great and I think 30 years ago this was much more true. These days I think most software has so many mind melting inefficiencies that there is an avalanche of fat waiting to be shaken loose.
Even without nonsense O(n^2) accidentally lurking or huge amounts of time going to granular operations depending on something external that has a large latency, there are still usually gratuitous memory allocations and of course pointer chasing.
Taking out allocations from inner loops can easily be a 7x speedup. Taking out pointer chasing, linked lists, virtual pointers etc. and just looping through an array of structs can be another 25x for the areas that use them.
From there it's off to SIMD and multi-threading but I don't expect that from most software.
Interestingly, none of these rely on programming directly in assembly.
It is a good point that software has a lot of fat these days, therefore optimization is an easier task (because it's done much more rarely). Windows 98, for however many bugs it had, was so much more responsive.
Low level optimization (assembly) is mostly only useful with algorithms which have instruction level parallelism. Modern compilers are incredibly good at optimizing serial algorithms.
Making use of compile time branch reduction can be massive. For example template parameters which are evaluated at compile time skipping lots of work at runtime.
I tried recently to optimise a loop looking for a value in a small array of short integers (8 elements), not so easy, there are so many choices: tight loop (a priori not an optimal choice but use only one instruction cache line), unrolled loop, using cmovs, SIMD?
And in programs that had never been optimized it's usually very easy to fire up the profiler and find some obviously terrible things. Like reinitializing in a tight loop some expensive object that can easily be a singleton (or at least a thread local object). I've once spent a month on a decent piece of code and brought down the CPU cost by literally 100x (nobody had optimized the code before).
Also, always, first create a reliable framework for measuring progress (cpu instruction retired counters in Intel CPUs are very good and don't need a lot of execution repetitions to bring the p-values low), and only then start optimizing.
Low-level, hardcore optimization rarely pays out (not never, but rarely).