I was going to post this: If you do an average of `N` independent trials, your c...

Matumio · on Dec 30, 2014

Taking the average is a bad idea for performance benchmarks like this running under an operating system. The minimum is much more stable. (Easy to verify experimentally: do 10x10 runs of a short benchmark and compare the variance of 10 minima to the variance of 10 averages.)

stefantalpalaru · on Dec 30, 2014

True, but the minimum might also be dependent on the various caches being primed from previous runs of the same code - a condition that will not appear during normal use.

Matumio · on Dec 30, 2014

Caches being primed is a concern, but not when choosing between minimum and average. Neither is affected much if the first run takes longer than the other nine.

gcr · on Jan 9, 2015

IPython's %timeit magic works by running the function in three loops. It returns the minimum of the three averages. This seems like a good compromise.

perlgeek · on Dec 30, 2014

... and since you don't know how big the constant factor is, you still have to do the experiment a few times.

Also there might be factors (that you are not aware of) that differ between several executions of the same program (maybe address space randomization got extremely lucky or unlucky in one execution? Or your OS has a bug the prioritizes processes differently when the PID is a power of 2? Who knows?), or some background processes (cron?) woke up, so even if your number of in-process repetitions is sufficiently high, you still run the thing a few times and observe if the numbers are stable.

Matumio · on Dec 30, 2014

Don't forget CPU frequency, which may slowly step up during the first few runs.

stefantalpalaru · on Dec 30, 2014

You're not supposed to run benchmarks on anything but the 'performance' CPU governor.

ambrop7 · on Dec 30, 2014

Though, sometimes you want to benchmark with 'ondemand' just to see how the cpu scaling affects your code.

It is known that 'ondemand' may make multi-threaded programs run slower when run on multi-core system. The problem which I have myself observed and reproduced is when two threads are alternately scheduled to run - one thread runs for some time, then wakes up the other and goes to sleep, ad infinitum. With such a workload, the scheduler/governor may enter a situation where the workload moves from one CPU core to the other (*), with no core reaching enough average load to switch it to a higher frequency.

I can't find a reference for this, but it was easy to reproduce by writing a simple program with pthreads doing what is described above. My actual application where I encountered this problem consisted of an event loop thread and a worker thread doing computational tasks (encryption). The most interesting observation was how disabling one of the two CPU cores (yes Linux can do that at runtime) sped up my application and benchmark considerably.