Taking the average is a bad idea for performance benchmarks like this running under an operating system. The minimum is much more stable. (Easy to verify experimentally: do 10x10 runs of a short benchmark and compare the variance of 10 minima to the variance of 10 averages.)
True, but the minimum might also be dependent on the various caches being primed from previous runs of the same code - a condition that will not appear during normal use.
Caches being primed is a concern, but not when choosing between minimum and average. Neither is affected much if the first run takes longer than the other nine.
... and since you don't know how big the constant factor is, you still have to do the experiment a few times.
Also there might be factors (that you are not aware of) that differ between several executions of the same program (maybe address space randomization got extremely lucky or unlucky in one execution? Or your OS has a bug the prioritizes processes differently when the PID is a power of 2? Who knows?), or some background processes (cron?) woke up, so even if your number of in-process repetitions is sufficiently high, you still run the thing a few times and observe if the numbers are stable.
Though, sometimes you want to benchmark with 'ondemand' just to see how the cpu scaling affects your code.
It is known that 'ondemand' may make multi-threaded programs run slower when run on multi-core system. The problem which I have myself observed and reproduced is when two threads are alternately scheduled to run - one thread runs for some time, then wakes up the other and goes to sleep, ad infinitum. With such a workload, the scheduler/governor may enter a situation where the workload moves from one CPU core to the other (*), with no core reaching enough average load to switch it to a higher frequency.
I can't find a reference for this, but it was easy to reproduce by writing a simple program with pthreads doing what is described above. My actual application where I encountered this problem consisted of an event loop thread and a worker thread doing computational tasks (encryption). The most interesting observation was how disabling one of the two CPU cores (yes Linux can do that at runtime) sped up my application and benchmark considerably.
Hence with the 10^6 trials done in the article, we must expect a couple of points deviation on the third digit.