You're asking for 2 major tech steppings in a single product. We don't even have...

chime · on March 4, 2016

We put a couple of Intel 750s in our primary DB server and so many of our issues just went away instantly. Reading at 2GB/s is amazing. On a 10gb network, our network backups now happen at 1+GB/s. Of course we optimize our DB queries as much as possible but sometimes, you just hit a brick wall and can't speed things up because of how the data is structured. Instead of spending 100 developer/dbadmin hours on reorganizing our tables for some query that runs once a week, we put $2500 of drives and solved more problems than I imagined.

The easiest problems are the ones that go away if you throw money at them and NVMe drastically expands the set of such problems. Most small-mid-sized companies have DBs in the 10Gb-1TB range. If you have a single table that's 100GB in size, you can parse through every single row in just under a minute! This means you can actually use an easy to implement O(n) algorithm instead of trying to make O(1) or O(log n) fit your problem. NVMe SSDs are not that advantageous for companies that are built to scale horizontally on AWS. They are amazing when you have a monolithic DB that you can't partition/shard/cluster easily.

rjurney · on March 4, 2016

Just wanted to say that this kind of real-world experience is why I read Hacker News.

Razengan · on March 4, 2016

Me as well. Discovering HN was such a refreshing respite from the thinly-veiled political soapbox that is Reddit.

uxcn · on March 4, 2016

> If you have a single table that's 100GB in size, you can parse through every single row in just under a minute!

You're forgetting seek latency. It's orders of magnitude better with SSD, but it's still not necessarily zero. Depending on how the data is laid out and queried you can pay the seek cost per row, which multiplied by the number of rows (100GB+) isn't trivial.

Dylan16807 · on March 4, 2016

It's a pretty bad database that can't queue up enough reads to keep the drive constantly active during a full-table scan.

uxcn · on March 4, 2016

It's a throughput problem, not an activity problem. Normally, since the rows are a fixed size the dbms will lay them out sequentially on disk. So, when the database reads from disk unless you read every column in the row, the dbms has to skip over data. This is kind of a high level picture, but hopefully it illustrates the point.

Nearly all of the tick (>4GB/day) databases I've used aren't laid row oriented.

_wmd · on March 4, 2016

This is so far off the mark..

Even in the absence of variable-width fields, the presence of nullable fields causes the majority of database tables to have variable-width rows. In any case, neither of these are reasons why common databases do or do not lay rows out sequentially on disk (some do, some don't).

Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.

Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.

Database servers additionally are very likely to have their own larger-than-a-byte-sized buffers in order to avoid system call latency, so the requests they make are never going to be quite so small.

The logic being that in the days of spinning media, evicting 124kb of cold page cache in favour of avoiding a seek a few microseconds later was definitely worth it (a seek being a ~14ms stall on rotating disks)

uxcn · on March 4, 2016

> Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.

This is why I said it was high level, but hopefully illustrated the point. In addition to the disk page size, you also have all the various metadata associated with the file(s). So, reading a byte from a page can imply reading even more data than the block size (4KiB current).

> Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.

AFAIK, Linux only reads ahead if it detects a sequential pattern, or if you specify POSIX_FADV_SEQUENTIAL (double normal). But, as far as the query is concerned, all of the data read that isn't necessary is effectively subtracted from the overall throughput.

I was trying to illustrate the importance of seek latency (~80us vs. ~9-14ms), but yes there are a myriad of other concerns when you're trying to maximize disk throughput.

scurvy · on March 4, 2016

Not trying to nitpick, but don't most people running MySQL with innodb set O_DIRECT?

Dylan16807 · on March 4, 2016

It doesn't have to skip over data if that would slow it down. I would expect your typical database to have some kind of index or bitmap that can tell it what to grab fast enough to saturate the disk while avoiding unused data, but if it has to fall back to vacuuming up 1GB at a time then so be it.

jacquesm · on March 4, 2016

If you want to go even faster simply add as much RAM to your machine as your tables will consume when they're all in cache. That too is one of those tricks that makes problems just 'go away', it does still require a periodic flush but that can happen in a totally transparent fashion.

DanBC · on March 4, 2016

A blog post with some details and graphs would be appreciated by the HN readership.

scurvy · on March 4, 2016

> Has NVMe seen much enterprise adoption yet?

Have you tried to buy one in the past year? If you're not in the Fortune 100, fat effing chance! Here's how it goes, Samsung announces NVMe product, companies beat down their door screaming "take my money!" and Samsung conveniently "cancels" the product for sale. They made it. They just sold their entire production run to a handful of customers. Maybe, just maybe you can get a couple hundred units if you're willing to wait 3-4 months and someone returns some or changes an order after delivery (and you get the returns).

Even second tier (non Intel, non Samsung) suppliers are sold out. About the only thing you can buy now is HGST because no one wanted their stuff in the first place, and they jacked up their prices in response to other vendors' product shortages.

Yes, NVMe is on fire right now. Everyone wants it. I wouldn't put new tech like this into a 3-4 year old system because of a sunk cost fallacy. NVMe is not also exactly "new". They're already on version 1.2 (or 1.3) of the spec. Intel has gone through 2 major NVMe product revisions (with the third out in 3 months).

Also, Samsung isn't exactly breaking 3D NAND ground here. Novachips did it last year, and in a NVMe interface, too.

ianamartin · on March 4, 2016

Prices seem to be pretty volatile, indicating some difficulty filling demand.

But, you know, I'm not a fortune 100 company, and this seems to have a reasonable shipping time:

http://www.newegg.com/Product/Product.aspx?Item=N82E16820167...

virtuallynathan · on March 4, 2016

That's a consumer model - and I'd guess the 2.5in versions are selling faster than the PCIe add-on cards.

scurvy · on March 4, 2016

Yep, exactly this. You can generally find the add-on cards available, but most don't like them for replacement and quantity in a box reasons. They also don't fit in some of today's high density, multi node server chassis.

SFF-8639 (or U.2 as the branding is) is the way forward.