It's both. Basic things like submitting a command to the drive requires fewer ro...

joenathanone · on Nov 14, 2021

I get that, but with NVME being designed from the ground up specifically for SSD's wouldn't using it for an HDD present extra overhead for the controller to deal with an HDD, negating any theoretical protocol advantages?

wtallis · on Nov 14, 2021

NVMe as originally conceived was still based around the block storage abstraction implemented by hard drives. Any SSD you can buy at retail is still fundamentally emulating classic hard drive behavior, with some optional extra functionality to allow the host and drive to cooperate better (eg. Trim/Deallocate). But out of the box, you're still dealing with reading and writing to 512-byte LBAs, so there's not actually much that needs to be added back in to make NVMe work well for hard drives.

The low-level advantages of NVMe 1.0 were mostly about reducing overhead and improving scalability in ways that were not strictly necessary when dealing with mechanical storage and were not possible without breaking compatibility with old storage interfaces. Nothing about eg. the command submission and completion queue structures inherently favor SSDs over hard drives, except that allowing multiple queues per drive each supporting queue lengths of hundreds or thousands of commands is a bit silly in the context of a single hard drive (because you never actually want the OS to enqueue 18 hours worth of IO at once).

londons_explore · on Nov 14, 2021

> because you never actually want the OS to enqueue 18 hours worth of IO at once

As a thought experiment, I think there are usecases for this kind of thing for a hard drive.

The very nature of a hard drive is that sometimes accessing certain data happens to be very cheap - for example, if the head just happens to pass over a block of data on the way to another block of data I asked to read. In that case, the first read was 'free'.

If the drive API could represent this, then very low priority operations, like reading and compressing dormant data, defragmentation, error checking existing data, rebuilding RAID arrays etc. might benefit from such a long queue. Pretty much, a super long queue of "read this data only if you can do so without delaying the actual high priority queue".

wtallis · on Nov 14, 2021

When a drive only has one actuator for all of the heads, there's only a little bit of throughput to be gained from Native Command Queueing, and that only requires a dozen or so commands in the queue. What you're suggesting goes a little further than just plain NCQ, but I'd be surprised if it could yield more than another 5% throughput increase even in the absence of high-priority commands.

But the big problem with having the drive's queue contain a full second or more worth of work (let alone the hours possible with NVMe at hard drive speeds) is that you start needing the ability to cancel or re-order/re-prioritize commands that have already been sent to the drive, unless you're working in an environment with absolutely no QoS targets whatsoever. The drive is the right place for scheduling IO at the millisecond scale, but over longer time horizons it's better to leave things to the OS, which may be able to fulfill a request using a different drive in the array, or provide some feedback/backpressure to the application, or simply have more memory available for buffering and combining operations.

c_o_n_v_e_x · on Nov 15, 2021

There's definitely use cases but they're quite niche. Magnetic storage is still far cheaper per TB than solid state. Also, depending on workload, magnetic can handle heavy writes better. SATA is a dead man walking with no plans for SATAIV or V.

HDD manufacturers get to keep selling their same tech with a different interface. From an end user perspective, a drive like this lets you buy future proof server equipment with the newer interfaces. You can make the plunge to full SSDs once the market's providing what you need.

trasz · on Nov 15, 2021

Curiously, this appears to already exist, but hard drives implement it kind of backwards. You might find comments on https://reviews.freebsd.org/D26912 interesting.