Sorry for the blatant plug, but it might be relevant — in my M.Sc. thesis I surveyed main memory optimized index techniques and also provided some background for why traditional binary search is not very optimized for modern hardware. There's lots of illustrations and I cite some great material. :)
"Reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), main system memory is taking a 4-minute walk down the hall to buy a Twix bar, and waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months." — Gustavo Duarte (http://duartes.org/gustavo/blog/post/what-your-computer-does...)
One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms and why the way you access data is the most critical performance thing you should focus on.
It's also incredibly hard to retrofit when you find out you really need that performance back.
> One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms
Can you elaborate? Radix sort seems like the most unpredictable sorting algorithm out there last time I checked... it jumps all over memory.
Radix sort on a fixed width key is actually one of the most predictable sorting algorithm. In a traditional Radix sort the complexity is O(kn) where k is how much bucketing you want to do on the key size. Buckets are fixed at 2^n bit size so more buckets = less space, less buckets = fewer passes.
Say for a sorting 32bit values you might split your buckets by a factor of 4 aligning to a byte boundary. This would give you 4, 256 entry buckets.
The sort then becomes a linear walk of the data doing the standard Radix sort. This is where the prefetcher and packed arrays really start working to your advantage. Your memory access is trivially predictable and the prefetcher gets to work eliminating all DRAM cache misses. Most algorithms will hint at the fetches but it usually takes just a few loops for the prefetcher to do just as well. If you want to some parts of the passes can be batched to get better than O(kn).
Now if you want to get fancy you can take your key, and find out what bit width for your buckets line up nicely with your cache line size and watch Radix really scream.
This used a lot in computer graphics for determining draw order, intersection tests, etc(in fact the book "Realtime Collision Detection" is one of the best data structure literature I've read, it just happens to be about 2D/3D problem spaces).
This is why understanding the implications of memory access is so important. Paraphrased from the talk: Big O notation is really a blunt tool and sometimes those constant factors can be incredibly painful.
[edit]
They also have the nice property of being stable which can have some useful benefits.
Strings are going to be bad anyway because there's a high chance you're hitting a cache miss just to get access to the heap allocated string data.
Yes, only the histogram pass has random writes and if you pick your bucket size appropriately(like I mentioned above) then the surface area for those random writes can land within a cache line or two.
Summing the histograms are a very nice linear read/write pass.
Name me another sorting algorithm that does no reads that cannot be predicted several hundred cycles in advance, and O(1) writes to locations that are read from within several hundred cycles. I don't know of any.
Remember: a cache miss costs easily several hundred cycles.
And, to ask a question: if radix sort is faster, and not due to better cache performance, what is it due to?
Another thing that matters enormously is branch misprediction. For example, super scalar sample sort beats GNU's (stupefyingly fast) std::sort by 35-50% on average. I benchmarked skewed quicksort vs std::sort vs super scalar sample sort in a totally artificial setting (sorting a permutation of 0 to n-1) for a lecture: https://github.com/lorenzhs/quicksort-pivot-imbalance
(Repository does not contain the super scalar sample sort code, I don't have the rights to publish that unfortunately. I used the authors' old Pentium 4 optimized code from the paper linked at the bottom (incl PDF).)
Edit: super scalar sample sort also has large data locality benefits over quicksort, the "wavy" line for ssssort shows the caches' impact. I haven't studied that aspect in detail yet.
Very cool. In my experience, using a skewed pivot doesn't actually help in practice. Outside of the case you mention (where the input is a random permutation), sampling must be used to choose a skewed pivot, which has a fairly significant overhead.
But using a skewed pivot is not my point. It would not be particularly clever to respond to the problem of branch mispredicitions by doing suboptimal things. The solution is using algorithms that avoid hard to predict branching and exploit data locality. Thus, super scalar sample sort, which is significantly faster than skewed quicksort even in the case where quicksort gets the pivot for free.
Regarding overhead for sampling -- super scalar sample sort samples sqrt(n) elements, sorts them, and picks a fixed number of pivots from them (e.g. 256 in the implementation I used). But that is worth it, as no branches are needed to find in which "bucket" (between which two pivots) to put an element, using predicated instructions. So the overhead is not the problem, and as you can see in the other plots (in /plots), ssssort executes more instructions. It is still significantly faster than the other algorithms because it exploits the properties of the CPU.
Sure. I didn't actually mean ssssort. Sometimes it's worth distinguishing between comparison based sorts, and those that make use of key structure, assume a fixed size key universe etc For that reason, I think considering sampling overhead is worth thinking about.
Deciding memory layout is a sub-task of designing cache-oblivious algorithms. Cache-oblivious algorithms optimize the number of memory transfers. I found this very interesting http://erikdemaine.org/papers/BRICS2002/paper.pdf
A while back I did some rough tests on search of arrays in depth first order. The hope was that the better contiguity (the next element in the array is the next element to test 50% of the time) would lead to better search performance, but I wasn't able to observe much of a difference in practice.
I also found that while writing the search operation was very easy for arrays in this order, insertion was much more difficult.
See here: https://www.researchgate.net/profile/Florian_Gross/publicati...
Along those lines:
* CSS Trees: Pointerless b-Trees with a layout optimized for cache lines (http://www.vldb.org/conf/1999/P7.pdf)
* Intel & Oracle's fast architecture-sensitive tree search (combines huge pages, cache line blocking, and SIMD in an optimal layout): http://www.researchgate.net/profile/Jatin_Chhugani/publicati...
* Adaptive radix trees (http://codematch.muehe.org/~leis/papers/ART.pdf)