"Reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), main system memory is taking a 4-minute walk down the hall to buy a Twix bar, and waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months." — Gustavo Duarte (http://duartes.org/gustavo/blog/post/what-your-computer-does...)
One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms and why the way you access data is the most critical performance thing you should focus on.
It's also incredibly hard to retrofit when you find out you really need that performance back.
> One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms
Can you elaborate? Radix sort seems like the most unpredictable sorting algorithm out there last time I checked... it jumps all over memory.
Radix sort on a fixed width key is actually one of the most predictable sorting algorithm. In a traditional Radix sort the complexity is O(kn) where k is how much bucketing you want to do on the key size. Buckets are fixed at 2^n bit size so more buckets = less space, less buckets = fewer passes.
Say for a sorting 32bit values you might split your buckets by a factor of 4 aligning to a byte boundary. This would give you 4, 256 entry buckets.
The sort then becomes a linear walk of the data doing the standard Radix sort. This is where the prefetcher and packed arrays really start working to your advantage. Your memory access is trivially predictable and the prefetcher gets to work eliminating all DRAM cache misses. Most algorithms will hint at the fetches but it usually takes just a few loops for the prefetcher to do just as well. If you want to some parts of the passes can be batched to get better than O(kn).
Now if you want to get fancy you can take your key, and find out what bit width for your buckets line up nicely with your cache line size and watch Radix really scream.
This used a lot in computer graphics for determining draw order, intersection tests, etc(in fact the book "Realtime Collision Detection" is one of the best data structure literature I've read, it just happens to be about 2D/3D problem spaces).
This is why understanding the implications of memory access is so important. Paraphrased from the talk: Big O notation is really a blunt tool and sometimes those constant factors can be incredibly painful.
[edit]
They also have the nice property of being stable which can have some useful benefits.
Strings are going to be bad anyway because there's a high chance you're hitting a cache miss just to get access to the heap allocated string data.
Yes, only the histogram pass has random writes and if you pick your bucket size appropriately(like I mentioned above) then the surface area for those random writes can land within a cache line or two.
Summing the histograms are a very nice linear read/write pass.
Name me another sorting algorithm that does no reads that cannot be predicted several hundred cycles in advance, and O(1) writes to locations that are read from within several hundred cycles. I don't know of any.
Remember: a cache miss costs easily several hundred cycles.
And, to ask a question: if radix sort is faster, and not due to better cache performance, what is it due to?
http://blog.codinghorror.com/the-infinite-space-between-word...
1 CPU cycle 0.3 ns 1 s
Level 1 cache access 0.9 ns 3 s
Level 2 cache access 2.8 ns 9 s
Level 3 cache access 12.9 ns 43 s
Main memory access 120 ns 6 min