Software optimization resources for C++, assembly, Windows, Linux, BSD, Mac OS

H8crilA · on March 7, 2020

From my experience the greatest optimization gains can be achieved by introducing more domain knowledge into the code. Heuristics that cut the number of executions of some expensive function, shortcuts that introduce good enough approximations, pre-computing and memoizing certain values that are actually constant in the processed data. Things like that.

And in programs that had never been optimized it's usually very easy to fire up the profiler and find some obviously terrible things. Like reinitializing in a tight loop some expensive object that can easily be a singleton (or at least a thread local object). I've once spent a month on a decent piece of code and brought down the CPU cost by literally 100x (nobody had optimized the code before).

Also, always, first create a reliable framework for measuring progress (cpu instruction retired counters in Intel CPUs are very good and don't need a lot of execution repetitions to bring the p-values low), and only then start optimizing.

Low-level, hardcore optimization rarely pays out (not never, but rarely).

ncmncm · on March 7, 2020

My experience differs.

Domain-specific optimizations do often yield order-of-magnitude improvements, but once those opportunities have been exploited, there is very often another factor of two or ten remaining.

For example, a three-line change to quicksort, not changing the algorithm at all, more than doubles its speed.

The key insight not taught is that the familiar order notation -- O(N), O(N^2) -- does not dictate what N must count. Traditionally, it counts comparisons, swaps, or other algorithmic events. Nowadays, these operations are practically free, and what matters are cache misses and pipeline stalls.

Because each miss is so devastating, it is essential to keep plenty of work in the pipeline. Interfaces should provide batched operations, because often it takes no more time to, say, produce four hashes than one.

Measurement is necessary, but since it is so hard to know whether a thing is fast, reading out the hardware performance counters is essential. How many pipeline stalls, how many cache misses, how many branch mispredictions?

Transforming "if (c) ++i;" to "i += c;" can make a huge difference, in an inner loop, when c is unpredictable. Similarly, "if (c) a -= b;" may become "a -= (-c & b);". Compilers usually will not make these transformations because they guess that c is predictable. When it isn't, you get a pipeline stall every other iteration, with 10-15 cycles down the drain each time, enough for ~40-60 useful instructions.

a1369209993 · on March 7, 2020

> [...] what N must count. Traditionally, it counts comparisons, swaps, or other algorithmic events.

Er, N is counting the input elements (eg number of nodes in a graph, array length, etc), O() is what counts algorithmic or microarchitectural events.

> Similarly, "if (c) a -= b;" may become "a -= (-c & b);".

Reminds me of speeding up a ECC loop by almost two orders of magnitude with (basically):

  - if(i&(1<<k)) p ^= parity(x[i])<<k;
  + p32[k] ^= x[i] & -(ii&1); ii>>=1;

einpoklum · on March 7, 2020

GP mis-spoke, s/he meant "O(N) of what?" rather than "N of what?"

ncmncm · on March 7, 2020

Errr, right, N counts input, N^2 or whatever should count stalls. I should learn not to post from on the train.

jfkebwjsbx · on March 7, 2020

That transformation seems like should always be performed by the compiler unless something very close afterwards depends on the value of i.

ncmncm · on March 7, 2020

Branches break dependency chains. So, correctly predicted branches enable more instruction-scheduling parallelism, essentially running multiple loop iterations in parallel. So, reasoning about this is very uncertain. You have to measure both ways and see.

jfkebwjsbx · on March 7, 2020

Yes, that is why I say it should not be done if something is very close that depends on i.

But in many algorithms and in a lot of user loops you don't need data from previous iterations; or if you do, they may be far enough eg. if you are running something complex in each iteration.

So you have to be in the case where you are implementing a 1) non-trivial loop that is 2) small enough and in which 3) branches can be predicted. It indeed happens, but I bet most code is not like that. (I have no clue though).

We also have to remember that every added branch may be destroying performance for some other branch out there (and this is way harder to know unless you measure your particular program). That is why I feel removing the branch is almost always a good idea unless proven otherwise; rather than the other way around.

ncmncm · on March 8, 2020

All very true.

The partition used in quicksort has a textbook example of a result--the possibly swapped elements--not used in the next iteration. (The conditional increment is used immediately, though.) But replacing the innards of partition's loop with

  bool c = *left <= pivot
  T v[2] = { std::move(*left), std::move(*right) };
  *right = std::move(v[1-c]), *left = std::move(v[c]);
  left += c;

almost doubles the speed. Persuade the compiler to substitute cmov instructions, and you get more than double.

Gcc cannot be persuaded, under any circumstances, to put more than one cmov in a basic block. Clang can be persuaded, with a bit of subterfuge:

  *right = (v[0] & -c) | (v[1] & c-1);
  *left  = (v[1] & -c) | (v[0] & c-1);

or (less fun) with ?: expressions. Probably the optimizers should recognize the array index trick, which is formally correct for all types.

Pressure on the cache of predicted branches is hard to quantify. They don't tell us how many branch prediction slots we get. But you definitely can blow that cache.

We don't know what all dodgy gimcracks are in modern CPU cores. I do know that loops with FILE getc/putc and streambuf sgetc/sputc are faster than they have any business being. If you can get your loops to look enough like those, they will be improbably faster.

dgellow · on March 7, 2020

What tools exist to analyze use of pipelines and caches?

H8crilA · on March 7, 2020

https://en.m.wikipedia.org/wiki/Perf_(Linux)

It is all powered by hardware support in CPUs. Intel CPUs have 3 programmable counters, you can set them to trace events, like instructions retired or Ln cache misses.

flqn · on March 7, 2020

Cachegrind on linux is part of valgrinds analysers

BubRoss · on March 7, 2020

Domain knowledge is great and I think 30 years ago this was much more true. These days I think most software has so many mind melting inefficiencies that there is an avalanche of fat waiting to be shaken loose.

Even without nonsense O(n^2) accidentally lurking or huge amounts of time going to granular operations depending on something external that has a large latency, there are still usually gratuitous memory allocations and of course pointer chasing.

Taking out allocations from inner loops can easily be a 7x speedup. Taking out pointer chasing, linked lists, virtual pointers etc. and just looping through an array of structs can be another 25x for the areas that use them.

From there it's off to SIMD and multi-threading but I don't expect that from most software.

Interestingly, none of these rely on programming directly in assembly.

H8crilA · on March 7, 2020

It is a good point that software has a lot of fat these days, therefore optimization is an easier task (because it's done much more rarely). Windows 98, for however many bugs it had, was so much more responsive.

kitsuac · on March 7, 2020

Low level optimization (assembly) is mostly only useful with algorithms which have instruction level parallelism. Modern compilers are incredibly good at optimizing serial algorithms.

Making use of compile time branch reduction can be massive. For example template parameters which are evaluated at compile time skipping lots of work at runtime.

dang · on March 7, 2020

Small threads from 2016 and 2015:

https://news.ycombinator.com/item?id=13258254

https://news.ycombinator.com/item?id=10187043

https://news.ycombinator.com/item?id=8874206

ncmncm · on March 7, 2020

Agner is an absolute treasure. I hang on his every word.

One essential reference not cited is https://danluu.com/branch-prediction/ .

saagarjha · on March 7, 2020

A treasure he is, but he writes a lot…I can never seem to be able to use his stuff except as reference when I need to know the IPC of something.

renox · on March 9, 2020

I tried recently to optimise a loop looking for a value in a small array of short integers (8 elements), not so easy, there are so many choices: tight loop (a priori not an optimal choice but use only one instruction cache line), unrolled loop, using cmovs, SIMD?

loeg · on March 7, 2020

The code is mostly or entirely GPL, if that is something you might be interested in.

SloopJon · on March 8, 2020

The code that looks most interesting is the vector class library, which is licensed under the Apache License 2.0.

loeg · on March 8, 2020

The string.h libc functions ("Subroutine library") and Object file converter also look interesting to me. Those are GPL.