The table in this Coding Horror blog helps to show the point: http://blog.coding...

flgr · on May 8, 2015

Or phrased with a bit more seasoning:

"Reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), main system memory is taking a 4-minute walk down the hall to buy a Twix bar, and waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months." — Gustavo Duarte (http://duartes.org/gustavo/blog/post/what-your-computer-does...)

vvanders · on May 8, 2015

That's not even the half of it though.

One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms and why the way you access data is the most critical performance thing you should focus on.

It's also incredibly hard to retrofit when you find out you really need that performance back.

wfunction · on May 8, 2015

> One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms

Can you elaborate? Radix sort seems like the most unpredictable sorting algorithm out there last time I checked... it jumps all over memory.

vvanders · on May 8, 2015

Radix sort on a fixed width key is actually one of the most predictable sorting algorithm. In a traditional Radix sort the complexity is O(kn) where k is how much bucketing you want to do on the key size. Buckets are fixed at 2^n bit size so more buckets = less space, less buckets = fewer passes.

Say for a sorting 32bit values you might split your buckets by a factor of 4 aligning to a byte boundary. This would give you 4, 256 entry buckets.

The sort then becomes a linear walk of the data doing the standard Radix sort. This is where the prefetcher and packed arrays really start working to your advantage. Your memory access is trivially predictable and the prefetcher gets to work eliminating all DRAM cache misses. Most algorithms will hint at the fetches but it usually takes just a few loops for the prefetcher to do just as well. If you want to some parts of the passes can be batched to get better than O(kn).

Now if you want to get fancy you can take your key, and find out what bit width for your buckets line up nicely with your cache line size and watch Radix really scream.

This used a lot in computer graphics for determining draw order, intersection tests, etc(in fact the book "Realtime Collision Detection" is one of the best data structure literature I've read, it just happens to be about 2D/3D problem spaces).

This is why understanding the implications of memory access is so important. Paraphrased from the talk: Big O notation is really a blunt tool and sometimes those constant factors can be incredibly painful.

[edit] They also have the nice property of being stable which can have some useful benefits.

wfunction · on May 8, 2015

> on a fixed width key

Okay well that changes things a bit. I was thinking on general data, like strings.

> The sort then becomes a linear walk of the data doing the standard Radix sort.

Does it? What about all the buckets which you write randomly to?

vvanders · on May 8, 2015

Strings are going to be bad anyway because there's a high chance you're hitting a cache miss just to get access to the heap allocated string data.

Yes, only the histogram pass has random writes and if you pick your bucket size appropriately(like I mentioned above) then the surface area for those random writes can land within a cache line or two.

Summing the histograms are a very nice linear read/write pass.

TheLoneWolfling · on May 8, 2015

Radix sort does relatively random writes, but utterly predictable reads.

wfunction · on May 8, 2015

And you're saying that's better than doing both predictable (though perhaps not linear) reads and writes, which most sorting algorithms do?

I can believe Radix Sort is faster, but I have a harder time believing that it's due to better cache performance.

TheLoneWolfling · on May 8, 2015

Name me another sorting algorithm that does no reads that cannot be predicted several hundred cycles in advance, and O(1) writes to locations that are read from within several hundred cycles. I don't know of any.

Remember: a cache miss costs easily several hundred cycles.

And, to ask a question: if radix sort is faster, and not due to better cache performance, what is it due to?

wfunction · on May 11, 2015

> if radix sort is faster, and not due to better cache performance, what is it due to?

Uhm, perhaps its time complexity?

TheLoneWolfling · on May 12, 2015

Unless you're sorting >2<number of bits / element> elements, that logarithmic factor will be better than the constant of radix sort.

Asymptotic complexity with (sub)logarithmic factors is iffy at best, and this is a shining example thereof.