Memory Layouts for Binary Search

flgr · on May 8, 2015

Sorry for the blatant plug, but it might be relevant — in my M.Sc. thesis I surveyed main memory optimized index techniques and also provided some background for why traditional binary search is not very optimized for modern hardware. There's lots of illustrations and I cite some great material. :)

See here: https://www.researchgate.net/profile/Florian_Gross/publicati...

Along those lines:

* CSS Trees: Pointerless b-Trees with a layout optimized for cache lines (http://www.vldb.org/conf/1999/P7.pdf)

* Intel & Oracle's fast architecture-sensitive tree search (combines huge pages, cache line blocking, and SIMD in an optimal layout): http://www.researchgate.net/profile/Jatin_Chhugani/publicati...

* Adaptive radix trees (http://codematch.muehe.org/~leis/papers/ART.pdf)

preetamjinka · on May 8, 2015

> ResearchGate is currently down for maintenance, but we'll be back online very soon.

Do you have a mirror?

flgr · on May 9, 2015

Since I haven't set up any other domain to put it right now (should get to that), I put it here:

http://webclonk.flgr.me/index-search-modern-cpus.pdf

danbruc · on May 8, 2015

Here is a really good article on the topic [1], »Binary Search Is a Pathological Case for Caches«.

[1] http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-patholo...

vvanders · on May 8, 2015

Data locality matters so much.

This talk from Herb Sutter at 29:00 shows this wonderfully: http://channel9.msdn.com/Events/Build/2014/2-661

melling · on May 8, 2015

The table in this Coding Horror blog helps to show the point:

http://blog.codinghorror.com/the-infinite-space-between-word...

1 CPU cycle 0.3 ns 1 s

Level 1 cache access 0.9 ns 3 s

Level 2 cache access 2.8 ns 9 s

Level 3 cache access 12.9 ns 43 s

Main memory access 120 ns 6 min

flgr · on May 8, 2015

Or phrased with a bit more seasoning:

"Reading from L1 cache is like grabbing a piece of paper from your desk (3 seconds), L2 cache is picking up a book from a nearby shelf (14 seconds), main system memory is taking a 4-minute walk down the hall to buy a Twix bar, and waiting for a hard drive seek is like leaving the building to roam the earth for one year and three months." — Gustavo Duarte (http://duartes.org/gustavo/blog/post/what-your-computer-does...)

vvanders · on May 8, 2015

That's not even the half of it though.

One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms and why the way you access data is the most critical performance thing you should focus on.

It's also incredibly hard to retrofit when you find out you really need that performance back.

wfunction · on May 8, 2015

> One of the main points from Herb Sutter's talk is your prefetcher is a cache of infinite size. This is why Radix destroys other in-memory sorting algorithms

Can you elaborate? Radix sort seems like the most unpredictable sorting algorithm out there last time I checked... it jumps all over memory.

vvanders · on May 8, 2015

Radix sort on a fixed width key is actually one of the most predictable sorting algorithm. In a traditional Radix sort the complexity is O(kn) where k is how much bucketing you want to do on the key size. Buckets are fixed at 2^n bit size so more buckets = less space, less buckets = fewer passes.

Say for a sorting 32bit values you might split your buckets by a factor of 4 aligning to a byte boundary. This would give you 4, 256 entry buckets.

The sort then becomes a linear walk of the data doing the standard Radix sort. This is where the prefetcher and packed arrays really start working to your advantage. Your memory access is trivially predictable and the prefetcher gets to work eliminating all DRAM cache misses. Most algorithms will hint at the fetches but it usually takes just a few loops for the prefetcher to do just as well. If you want to some parts of the passes can be batched to get better than O(kn).

Now if you want to get fancy you can take your key, and find out what bit width for your buckets line up nicely with your cache line size and watch Radix really scream.

This used a lot in computer graphics for determining draw order, intersection tests, etc(in fact the book "Realtime Collision Detection" is one of the best data structure literature I've read, it just happens to be about 2D/3D problem spaces).

This is why understanding the implications of memory access is so important. Paraphrased from the talk: Big O notation is really a blunt tool and sometimes those constant factors can be incredibly painful.

[edit] They also have the nice property of being stable which can have some useful benefits.

wfunction · on May 8, 2015

> on a fixed width key

Okay well that changes things a bit. I was thinking on general data, like strings.

> The sort then becomes a linear walk of the data doing the standard Radix sort.

Does it? What about all the buckets which you write randomly to?

vvanders · on May 8, 2015

Strings are going to be bad anyway because there's a high chance you're hitting a cache miss just to get access to the heap allocated string data.

Yes, only the histogram pass has random writes and if you pick your bucket size appropriately(like I mentioned above) then the surface area for those random writes can land within a cache line or two.

Summing the histograms are a very nice linear read/write pass.

TheLoneWolfling · on May 8, 2015

Radix sort does relatively random writes, but utterly predictable reads.

wfunction · on May 8, 2015

And you're saying that's better than doing both predictable (though perhaps not linear) reads and writes, which most sorting algorithms do?

I can believe Radix Sort is faster, but I have a harder time believing that it's due to better cache performance.

TheLoneWolfling · on May 8, 2015

Name me another sorting algorithm that does no reads that cannot be predicted several hundred cycles in advance, and O(1) writes to locations that are read from within several hundred cycles. I don't know of any.

Remember: a cache miss costs easily several hundred cycles.

And, to ask a question: if radix sort is faster, and not due to better cache performance, what is it due to?

wfunction · on May 11, 2015

> if radix sort is faster, and not due to better cache performance, what is it due to?

Uhm, perhaps its time complexity?

TheLoneWolfling · on May 12, 2015

Unless you're sorting >2<number of bits / element> elements, that logarithmic factor will be better than the constant of radix sort.

Asymptotic complexity with (sub)logarithmic factors is iffy at best, and this is a shining example thereof.

lorenzhs · on May 8, 2015

Another thing that matters enormously is branch misprediction. For example, super scalar sample sort beats GNU's (stupefyingly fast) std::sort by 35-50% on average. I benchmarked skewed quicksort vs std::sort vs super scalar sample sort in a totally artificial setting (sorting a permutation of 0 to n-1) for a lecture: https://github.com/lorenzhs/quicksort-pivot-imbalance

(Repository does not contain the super scalar sample sort code, I don't have the rights to publish that unfortunately. I used the authors' old Pentium 4 optimized code from the paper linked at the bottom (incl PDF).)

Edit: super scalar sample sort also has large data locality benefits over quicksort, the "wavy" line for ssssort shows the caches' impact. I haven't studied that aspect in detail yet.

nicknash · on May 8, 2015

Very cool. In my experience, using a skewed pivot doesn't actually help in practice. Outside of the case you mention (where the input is a random permutation), sampling must be used to choose a skewed pivot, which has a fairly significant overhead.

lorenzhs · on May 8, 2015

But using a skewed pivot is not my point. It would not be particularly clever to respond to the problem of branch mispredicitions by doing suboptimal things. The solution is using algorithms that avoid hard to predict branching and exploit data locality. Thus, super scalar sample sort, which is significantly faster than skewed quicksort even in the case where quicksort gets the pivot for free.

Regarding overhead for sampling -- super scalar sample sort samples sqrt(n) elements, sorts them, and picks a fixed number of pivots from them (e.g. 256 in the implementation I used). But that is worth it, as no branches are needed to find in which "bucket" (between which two pivots) to put an element, using predicated instructions. So the overhead is not the problem, and as you can see in the other plots (in /plots), ssssort executes more instructions. It is still significantly faster than the other algorithms because it exploits the properties of the CPU.

nicknash · on May 8, 2015

Sure. I didn't actually mean ssssort. Sometimes it's worth distinguishing between comparison based sorts, and those that make use of key structure, assume a fixed size key universe etc For that reason, I think considering sampling overhead is worth thinking about.

codepie · on May 8, 2015

Deciding memory layout is a sub-task of designing cache-oblivious algorithms. Cache-oblivious algorithms optimize the number of memory transfers. I found this very interesting http://erikdemaine.org/papers/BRICS2002/paper.pdf

agumonkey · on May 9, 2015

Everything Demaine works on look so new, simple and interesting. Succint DS, origami/folds proofs, now this.

gsg · on May 8, 2015

Interesting.

A while back I did some rough tests on search of arrays in depth first order. The hope was that the better contiguity (the next element in the array is the next element to test 50% of the time) would lead to better search performance, but I wasn't able to observe much of a difference in practice.

I also found that while writing the search operation was very easy for arrays in this order, insertion was much more difficult.