It was faster before 2000, but is it still faster now? On my computer it is inde...

Tomek_ · on Nov 2, 2011

No, it's not faster now and it's no longer really used probably (don't know if smartphones, tablets, various consoles, etc. offer access to any equivalent of rsqrtps), but it was brilliant at its time (well, still is, it's just not very useful).

Remember, Quake was released in 1996, SSE came out in 1999.

liamkf · on Nov 2, 2011

Hmm, that seems a bit misleading. The SSE implementation will of course be much faster than a non-vectorized implementation... it's working on 4 floats at a time.

A vectorized version of the marvelous method would be a fairer comparison. If your data is not very vectorizable and you don't need high precision, it's still fairly marvelous.

onan_barbarian · on Nov 2, 2011

It is your comment that is (mildly) misleading.

The SSE version is 16x faster than the x87 FPU version and almost 4x faster than the marvelous method. Further, the marvelous method is a loop and is much more resource-hungry than an instruction that goes away for 5-10 cycles (reciprocal throughput is now 1 cycle) but leaves you with most of your execution resources to do other useful work.

Vectorizing the marvelous method is very likely to be painful, as it has the treatment of a value as both a float and an int - while the xmm registers are dual-purpose operating on one successively as a int and a float causes some extra latency. The use of bit shifts and float operations require these cross-domain transfers...

jemfinch · on Nov 2, 2011

Would you mind posting your bench framework for this?

kmm · on Nov 2, 2011

I didn't make the code, nor do I want to take credit for it. I only changed a few lines to make it compile. The code explicitly say one can redistribute the code as one wants, so here it is

http://pastebin.com/EnJi86hF

Would you mind posting your results here? I'm quite interested if this has the same pattern everywhere.

jemfinch · on Nov 3, 2011

Here's what I get:

Input array size is 1024

Naive sqrtf() CPU cycles used: 54514, error: 0.000000

Vectorized SSE CPU cycles used: 900, error: inf

Marvelous CPU cycles used: 21514, error: inf

I'm not sure how to interpret the errors. I had to make a few changes myself for this to compile on my machine.