Log Structured Merge Trees

MaulingMonkey · on July 26, 2016

>> Yahoo, for example, reports a steady progression from read-heavy to read-write workloads, driven largely by the increased ingestion of event logs and mobile data.

> Sorry if this an ignorant question. Are these applications really that sensitive to write throughput ? event logs are only read of-band, so whats the rush. Seems like any benefits are offset by the GC stall anyways.

There's two variables here: Latency, and volume.

I'm not sure how much of an issue latency is for event logs / analytic metrics (RE: "whats the rush"), but if you've written e.g. Pokemon Go with several million installs, there's a sheer volume of inbound data that - no matter how little you care about latency - leaves you with one of two options: throttle it and throw some of it away, or write something that can scale to handle that much sheer volume.

And the first option is not one: Throwing away player gameplay progression will piss them off.

russell_sears · on July 26, 2016

Hi. First author of the bLSM (Yahoo) paper here. You've pretty much nailed it: The trick is getting latency and volume at the same time. If you're willing to wait an hour to look at event logs, then Hadoop is good enough, but where's the fun in that?

At the time, we were looking for the best of both worlds: Hadoop and log processing give high throughput writes, but latencies are much too high to act on user intent within a single session. We had a lot of applications in mind that needed low latency access to event log data, but didn't have a suitable storage engine to back them.

Existing LSM trees suffered from write stalls and excessive read amplification. We were targeting hard disks back then, and didn't want to pay for the extra seeks. We got random read access down to about one seek per read, and figured out how to eliminate write stalls. Surprisingly, we still ended up with good write throughput.

More importantly, from an application perspective, we provide efficient range scans, which lets you reason about groups of data with matching or contiguous keys. If you squint hard enough, this is all MapReduce really does, except it precomputes all the answers up front, and we do it in real time. On the one hand we have lower throughput, since we perform more random reads / sequential writes, and also do more thread synchronization. On the other, we only perform the computations for data that's actually being used for something, which means we do a lot less work for the right applications.

Thanks!

cmrdporcupine · on July 26, 2016

Why is sequential access still faster than random access on an SSD (author claims this)? I understand why for mechanical magnetic media but I do not understand why this is still a thing for SSD? I remember reading about LSM a long time ago, and thinking that with the advent of cheap SSD that this kind of optimization technique wouldn't be as relevant.

Is the advantage of sequential reading a product of the interface and protocol used to speak to the SSD device? Or is it actually a product of the physical media?

EDITED: (I had written sequential slower than random meant to write the opposite)

banachtarski · on July 26, 2016

Regardless of the physical medium (SSD vs spinning disk) data is read in pages, in much the same way data from main memory is read to the CPU caches in chunks (called cache lines). Reading from either source byte by byte is extremely slow. Intuitively then, reading adjacent bytes on a disk will be faster. No matter how fast reading the page is, reading one page is always going to be faster than reading two. SSDs make random access faster than compared to a spinning disk (lower latency/seek time) but it can't break the laws of physics and make two seeks the same cost as one.

cmurphycode · on July 26, 2016

Just to clarify, I think you meant to ask why random access is still slower than sequential on ssd.

Writes smaller than the ssd erase size are more expensive per byte because of write amplification. This is not nearly as big of a factor as for disks, but it does seem to matter even on enterprise flash, if you benchmark properly (truly random and enough data so that it cannot be buffered by clever controllers). And as my sibling comment mentioned, the write amplification wastes flash lifetime, too!

On enterprise flash, I've found that random reads top out at the same bandwidth ad sequential reads, but it takes a LOT of parallelism to get there, so sequential reading, if possible, is still nice (let the controller do the parallelism for you!)

russell_sears · on July 26, 2016

There are two effects at play. If you're reading a multiple of a flash page at a time, then you can get random bandwidth to match sequential. Under the hood, decent SSDs actually round robin sequential reads across all of the underlying hardware, so they can run as fast as whatever their bottleneck resource is. The bottleneck could be error correction, the SATA link, or the channels (copper traces) between the NAND and the controller, for example. Random reads with enough parallelism give the SSD enough scheduling freedom to have the same throughput as sequential.

The second effect comes from the fact that application data is rarely exactly the same size as a flash page. If you group all the related data on one page, then you don't waste time transferring bits you won't look at. B-Trees sort of do this, but they leave holes in pages so they can support efficient inserts (a good SSD or database will compress the holes, but that doesn't help random reads much, since it's still going to transfer entire flash pages to the SSD controller).

LSM-Trees push things a bit further, since new data ends up being grouped on higher levels of the tree. When I was benchmarking bLSM's predecessor, I worked out the expected throughput on the back of an envelope while I was waiting for results. It did much better than expected, since the benchmark heavily favored recently created data! YMMV, of course.

cmrdporcupine · on July 26, 2016

Oops yeah thanks I meant to write the opposite, edited.

Guvante · on July 26, 2016

SSD's write at the page level. If you are flipping bits in a page it has to read the page, update the bits you asked it to update and then write that page somewhere else (for wear leveling purposes).

Thus the cost to do a write that doesn't completely replace the page is a per page cost, not a per byte cost. Thus if you have very small writes you are going to get terrible performance. By extension this also leads to write amplification from a wear perspective.

cmrdporcupine · on July 26, 2016

Thanks, makes sense. Though there isn't the cost of seek time associated with magnetic media, there is the cost of rewriting the whole page.

mutex007 · on July 26, 2016

I am curious how an inverted index will be implemented efficiently atop an LSM tree. Maybe batch parts of the posting list under a numeric key and when performing an intersection operation between two lists, you interate using a range scan. Could work but i wonder how fast Read operations will be. Anybody attempted yet?

btrask · on July 26, 2016

LSM trees are ideal for posting lists, because every time you add a new document, you need to touch potentially hundreds of different lists. LSM trees are able to batch all of these writes together. That's why Google originally developed BigTable, AFAIK.

teraflop · on July 26, 2016

In fact, Lucene (the library behind Solr and Elasticsearch) is basically a special-purpose inverted-index-as-LSM-tree database engine.

abecedarius · on July 26, 2016

https://github.com/kragen/500lines/tree/master/search-engine explains a simple version of this in Python.

Arkanosis · on July 26, 2016

Done this years ago, still being used in production. In addition to the inverted lists, we have a bitmap per level, which tells whether an id is present in that level. All bitmaps are merged in memory in a structure mapping each id to the last level it is valid in. Then, lists of different levels are merged using a binary min heap. Entries that are not valid in the map are kept off the merge heap.

onetwotree · on July 26, 2016

Thanks for this. I really feel like I understand the algorithm, and more importantly, the tradeoffs involved, after reading that.

My company makes a product that involves a sort of primary feature that's insanely read heavy (something like 10k to 1 read to write ratio), and a secondary feature that's exactly the opposite. We just use Postgres with for both (and it's obviously the winner for the read heavy workload) but I think I'll have to look more into this stuff.

One interesting thing to note about our write heavy workload is that the reads are usually in chronological order, with a filter. Anyone know what that implies for LSM based stores?

thesz · on July 26, 2016

Tune LSM merging algorithm parameter.

If LSM merges levels at the 1:10 ration and more, then the result will be less levels and faster reads. The merges will happen quite often relatively to reads. If it merges levels at the ratio of 1:1.0001, then there will be more levels and less merges - less delays in write-heavy application.

cpitman · on July 26, 2016

If reads are chronological, a simple journal is both simpler and faster. See Kafka for an example of this in practice.

biokoda · on July 26, 2016

> insanely read heavy

Then LMDB is the engine for you

_pgmf · on July 26, 2016

I wrote some python bindings to the sqlite4 lsm store:

http://lsm-db.readthedocs.io

serialx · on July 26, 2016

It would be more interesting to see more modern approaches with KV storages like forestdb[1]. Couchbase already replaced their storage engine with forestdb[2].

[1]: https://www.computer.org/csdl/trans/tc/preprint/07110563.pdf [2]: https://github.com/couchbase/forestdb

rdtsc · on July 26, 2016

What's modern about it?

mavelikara · on July 26, 2016

Not sure why you are getting downvoted, but thanks for the reference to the ForestDB paper.

bogomipz · on July 26, 2016

In the section on read workloads the author states:

"External file: leave the data as a log/heap and create a separate hash or tree index into it."

I am confused by this because a heap with a separate hash or tree index would be a "clustered index" and not a heap at all correct? Or am I misunderstanding?

kristianp · on July 26, 2016

Wow that thin font-weight is hard to read. In chrome; Hit f12, in the styles tab change the font-weight to 400.

uuoc · on July 26, 2016

I use this book-marklet to fix pages that decide to use a small font weight, that results in this hard to read text:

javascript:(function(){var%20all=document.getElementsByTagName("*");for%20(var%20i=0,max=all.length;i<max;i++){all[i].style.fontWeight='400';};void(0);})();

Fixes those broken pages to be quite readable.

prirun · on July 27, 2016

I sent the guy a note last night and he changed it. Very easy to read now.