It was faster before 2000, but is it still faster now?
On my computer it is indeed four times faster, but the marvellous method is on its own again four times slower than SSE, using the rsqrtps instruction, with the advantage of a way lower maximum error.
To take the square root of an array of 1024 floats:
Naive sqrtf()
CPU cycles used: 48160, error: 0.000000
Vectorized SSE
CPU cycles used: 2970, error: 0.000392
Marvellous Carmack method
CPU cycles used: 11330, error: 0.002186
No, it's not faster now and it's no longer really used probably (don't know if smartphones, tablets, various consoles, etc. offer access to any equivalent of rsqrtps), but it was brilliant at its time (well, still is, it's just not very useful).
Remember, Quake was released in 1996, SSE came out in 1999.
Hmm, that seems a bit misleading. The SSE implementation will of course be much faster than a non-vectorized implementation... it's working on 4 floats at a time.
A vectorized version of the marvelous method would be a fairer comparison. If your data is not very vectorizable and you don't need high precision, it's still fairly marvelous.
The SSE version is 16x faster than the x87 FPU version and almost 4x faster than the marvelous method. Further, the marvelous method is a loop and is much more resource-hungry than an instruction that goes away for 5-10 cycles (reciprocal throughput is now 1 cycle) but leaves you with most of your execution resources to do other useful work.
Vectorizing the marvelous method is very likely to be painful, as it has the treatment of a value as both a float and an int - while the xmm registers are dual-purpose operating on one successively as a int and a float causes some extra latency. The use of bit shifts and float operations require these cross-domain transfers...
I didn't make the code, nor do I want to take credit for it. I only changed a few lines to make it compile. The code explicitly say one can redistribute the code as one wants, so here it is
On my computer it is indeed four times faster, but the marvellous method is on its own again four times slower than SSE, using the rsqrtps instruction, with the advantage of a way lower maximum error.
To take the square root of an array of 1024 floats:
Naive sqrtf() CPU cycles used: 48160, error: 0.000000
Vectorized SSE CPU cycles used: 2970, error: 0.000392
Marvellous Carmack method CPU cycles used: 11330, error: 0.002186