Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It was faster before 2000, but is it still faster now?

On my computer it is indeed four times faster, but the marvellous method is on its own again four times slower than SSE, using the rsqrtps instruction, with the advantage of a way lower maximum error.

To take the square root of an array of 1024 floats:

Naive sqrtf() CPU cycles used: 48160, error: 0.000000

Vectorized SSE CPU cycles used: 2970, error: 0.000392

Marvellous Carmack method CPU cycles used: 11330, error: 0.002186



No, it's not faster now and it's no longer really used probably (don't know if smartphones, tablets, various consoles, etc. offer access to any equivalent of rsqrtps), but it was brilliant at its time (well, still is, it's just not very useful).

Remember, Quake was released in 1996, SSE came out in 1999.


Hmm, that seems a bit misleading. The SSE implementation will of course be much faster than a non-vectorized implementation... it's working on 4 floats at a time.

A vectorized version of the marvelous method would be a fairer comparison. If your data is not very vectorizable and you don't need high precision, it's still fairly marvelous.


It is your comment that is (mildly) misleading.

The SSE version is 16x faster than the x87 FPU version and almost 4x faster than the marvelous method. Further, the marvelous method is a loop and is much more resource-hungry than an instruction that goes away for 5-10 cycles (reciprocal throughput is now 1 cycle) but leaves you with most of your execution resources to do other useful work.

Vectorizing the marvelous method is very likely to be painful, as it has the treatment of a value as both a float and an int - while the xmm registers are dual-purpose operating on one successively as a int and a float causes some extra latency. The use of bit shifts and float operations require these cross-domain transfers...


Would you mind posting your bench framework for this?


I didn't make the code, nor do I want to take credit for it. I only changed a few lines to make it compile. The code explicitly say one can redistribute the code as one wants, so here it is

http://pastebin.com/EnJi86hF

Would you mind posting your results here? I'm quite interested if this has the same pattern everywhere.


Here's what I get:

Input array size is 1024

Naive sqrtf() CPU cycles used: 54514, error: 0.000000

Vectorized SSE CPU cycles used: 900, error: inf

Marvelous CPU cycles used: 21514, error: inf

I'm not sure how to interpret the errors. I had to make a few changes myself for this to compile on my machine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: