Hmm, that seems a bit misleading. The SSE implementation will of course be much faster than a non-vectorized implementation... it's working on 4 floats at a time.
A vectorized version of the marvelous method would be a fairer comparison. If your data is not very vectorizable and you don't need high precision, it's still fairly marvelous.
The SSE version is 16x faster than the x87 FPU version and almost 4x faster than the marvelous method. Further, the marvelous method is a loop and is much more resource-hungry than an instruction that goes away for 5-10 cycles (reciprocal throughput is now 1 cycle) but leaves you with most of your execution resources to do other useful work.
Vectorizing the marvelous method is very likely to be painful, as it has the treatment of a value as both a float and an int - while the xmm registers are dual-purpose operating on one successively as a int and a float causes some extra latency. The use of bit shifts and float operations require these cross-domain transfers...
A vectorized version of the marvelous method would be a fairer comparison. If your data is not very vectorizable and you don't need high precision, it's still fairly marvelous.