Hmm, that seems a bit misleading. The SSE implementation will of course be much ...

onan_barbarian · on Nov 2, 2011

It is your comment that is (mildly) misleading.

The SSE version is 16x faster than the x87 FPU version and almost 4x faster than the marvelous method. Further, the marvelous method is a loop and is much more resource-hungry than an instruction that goes away for 5-10 cycles (reciprocal throughput is now 1 cycle) but leaves you with most of your execution resources to do other useful work.

Vectorizing the marvelous method is very likely to be painful, as it has the treatment of a value as both a float and an int - while the xmm registers are dual-purpose operating on one successively as a int and a float causes some extra latency. The use of bit shifts and float operations require these cross-domain transfers...