You're asking for 2 major tech steppings in a single product. We don't even have many mass market NVMe drives yet, and that aside, they're undoubtedly pushing the limits of their existing controller tech just to handle the capacity. Has NVMe seen much enterprise adoption yet? No surprise they chose an interface to match their intended customers' existing product lines
We put a couple of Intel 750s in our primary DB server and so many of our issues just went away instantly. Reading at 2GB/s is amazing. On a 10gb network, our network backups now happen at 1+GB/s. Of course we optimize our DB queries as much as possible but sometimes, you just hit a brick wall and can't speed things up because of how the data is structured. Instead of spending 100 developer/dbadmin hours on reorganizing our tables for some query that runs once a week, we put $2500 of drives and solved more problems than I imagined.
The easiest problems are the ones that go away if you throw money at them and NVMe drastically expands the set of such problems. Most small-mid-sized companies have DBs in the 10Gb-1TB range. If you have a single table that's 100GB in size, you can parse through every single row in just under a minute! This means you can actually use an easy to implement O(n) algorithm instead of trying to make O(1) or O(log n) fit your problem. NVMe SSDs are not that advantageous for companies that are built to scale horizontally on AWS. They are amazing when you have a monolithic DB that you can't partition/shard/cluster easily.
> If you have a single table that's 100GB in size, you can parse through every single row in just under a minute!
You're forgetting seek latency. It's orders of magnitude better with SSD, but it's still not necessarily zero. Depending on how the data is laid out and queried you can pay the seek cost per row, which multiplied by the number of rows (100GB+) isn't trivial.
It's a throughput problem, not an activity problem. Normally, since the rows are a fixed size the dbms will lay them out sequentially on disk. So, when the database reads from disk unless you read every column in the row, the dbms has to skip over data. This is kind of a high level picture, but hopefully it illustrates the point.
Nearly all of the tick (>4GB/day) databases I've used aren't laid row oriented.
Even in the absence of variable-width fields, the presence of nullable fields causes the majority of database tables to have variable-width rows. In any case, neither of these are reasons why common databases do or do not lay rows out sequentially on disk (some do, some don't).
Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.
Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.
Database servers additionally are very likely to have their own larger-than-a-byte-sized buffers in order to avoid system call latency, so the requests they make are never going to be quite so small.
The logic being that in the days of spinning media, evicting 124kb of cold page cache in favour of avoiding a seek a few microseconds later was definitely worth it (a seek being a ~14ms stall on rotating disks)
> Even if the DB server selectively read columns of each row (none of the common open source SQL databases do), they do so via the OS, which works in terms of pages. Reading a single byte of a page will cause a minimum of 4kb of IO to be made to the disk.
This is why I said it was high level, but hopefully illustrated the point. In addition to the disk page size, you also have all the various metadata associated with the file(s). So, reading a byte from a page can imply reading even more data than the block size (4KiB current).
> Now, unless you're using a DB server that uses O_DIRECT or POSIX_FADV_RANDOM (I just checked and Postgres doesn't), Linux will aggressively readahead at least (it's tunable) 128kb for any random read by default, so even issuing a one byte read to the kernel, device IO will still only occur in a minimum of 128kb chunks, with the remainder living in the page cache until userspace requests it.
AFAIK, Linux only reads ahead if it detects a sequential pattern, or if you specify POSIX_FADV_SEQUENTIAL (double normal). But, as far as the query is concerned, all of the data read that isn't necessary is effectively subtracted from the overall throughput.
I was trying to illustrate the importance of seek latency (~80us vs. ~9-14ms), but yes there are a myriad of other concerns when you're trying to maximize disk throughput.
It doesn't have to skip over data if that would slow it down. I would expect your typical database to have some kind of index or bitmap that can tell it what to grab fast enough to saturate the disk while avoiding unused data, but if it has to fall back to vacuuming up 1GB at a time then so be it.
If you want to go even faster simply add as much RAM to your machine as your tables will consume when they're all in cache. That too is one of those tricks that makes problems just 'go away', it does still require a periodic flush but that can happen in a totally transparent fashion.
Have you tried to buy one in the past year? If you're not in the Fortune 100, fat effing chance! Here's how it goes, Samsung announces NVMe product, companies beat down their door screaming "take my money!" and Samsung conveniently "cancels" the product for sale. They made it. They just sold their entire production run to a handful of customers. Maybe, just maybe you can get a couple hundred units if you're willing to wait 3-4 months and someone returns some or changes an order after delivery (and you get the returns).
Even second tier (non Intel, non Samsung) suppliers are sold out. About the only thing you can buy now is HGST because no one wanted their stuff in the first place, and they jacked up their prices in response to other vendors' product shortages.
Yes, NVMe is on fire right now. Everyone wants it. I wouldn't put new tech like this into a 3-4 year old system because of a sunk cost fallacy. NVMe is not also exactly "new". They're already on version 1.2 (or 1.3) of the spec. Intel has gone through 2 major NVMe product revisions (with the third out in 3 months).
Also, Samsung isn't exactly breaking 3D NAND ground here. Novachips did it last year, and in a NVMe interface, too.
Yep, exactly this. You can generally find the add-on cards available, but most don't like them for replacement and quantity in a box reasons. They also don't fit in some of today's high density, multi node server chassis.
SFF-8639 (or U.2 as the branding is) is the way forward.