Anatomy of a Solid-state Drive

Luyt · on Oct 20, 2012

TRIM/UNMAP. A common problem with SSDs is that a host file system could erase free data but didn't have a way of telling the storage device it no longer needed that data. The TRIM/UNMAP interface lets the SSD clear the LBA (logical block addressing) entries in the FTL, giving it more free space to use in garbage collection and reducing write amplification. OS/X, Microsoft Windows, and Linux have implemented TRIM.

FreeBSD has TRIM support, too.

jethroalias97 · on Oct 20, 2012

I feel like what they are describing is more defragmentation than garbage collection. The article uses garbage collection to mean taking sparsely filled blocks and consolidating them into a single block to free the sparsely filled ones. Garbage collection in the OO context is where you free memory the program can no longer interact with. Perhaps it means different things in different contexts?

wmf · on Oct 20, 2012

A sparsely-filled block contains some valid data and some old invalid data, i.e. garbage.

JoeAltmaier · on Oct 19, 2012

A picture would have been worth 1000 words here.

revelation · on Oct 20, 2012

I wonder why there aren't more products that use DDR3 memory for storage. Constant power supply and a reasonable small emergency battery seem no-brainers for server operations and you don't have these creeping failure modes as with SSD. Is it the lack of error correction? The need to refresh the state?

wmf · on Oct 20, 2012

DRAM is so low-density and complex to use safely that it's really not worth it; flash works fine.

weaksauce · on Oct 20, 2012

lack of permanence really. I have seen servers that go down for more than a few hours or even a few days. It's basically a game of chicken with your data. What happens when your power goes out for longer than you can provide backup supply to? You lose all your data that was on the DDR3.

The basic way that they deal with that is to use a two phase system. Lots of DDR3 and servers running a caching layer on them(memcached, etc...). You can buffer your writes in ram as well but that as always is a tradeoff.

ghshephard · on Oct 20, 2012

"What happens when your power goes out for longer than you can provide backup supply to?" - You flush to Spinning disk.

tisme · on Oct 20, 2012

Tough to flush to spinning disk after your power goes down. So you'd have to do it continuously, which would remove most of the performance gains.

ghshephard · on Oct 20, 2012

You include a battery in the package. You only need enough power to spin up the disk, flush from your RAM cache, and then power down. In fact, the DDR3 Disk Drive could include all three, RAM+Battery+Disk. Say, 64 GB RAM in DDR3 PC3-10600 1024Meg x 64 ( Crucial Part #: CT2KIT102464BA1339) for $600, 2.5" 80 Seagate ST980815A for $44, and a 10.8v, 4800mah (overkill for flushing, but they are cheap) PA3534U-1BRS battery for $19.81.

Add a charging element for battery $3.75, a controller for $5, Ram Sockets board for $4.25, Disk Interface connector for $2.00, case for $7.00, assembly for $6.50, assorted screws/packaging for $1.00.

You could sell a 64 GB DDR3 disk with backup disk for $693.5, 87% of which would be the cost of the memory itself - large such systems would be mostly the cost of the memory, as the other components (disk) don't increase in cost much, and except for the ram sockets, none of the others increase in cost at all.

Battery would need to be swapped out every four years, or so, but that would only be $20 cost each time.

gus_massa · on Oct 21, 2012

Looks good, but people will forget to change the batteries and then fill the forums with complains when they lost all their data.

wmf · on Oct 20, 2012

You keep a capacitor charged with enough energy to write the DRAM back to flash after the power goes out.

akandiah · on Oct 20, 2012

There is an application of DDR RAM in storage. It was developed by Gigabyte: http://en.wikipedia.org/wiki/I-RAM

ck2 · on Oct 19, 2012

Does it make sense anymore for SSD to go through a hard drive interface?

Even SATA3 cannot keep up with a raid of SSD.

wmf · on Oct 20, 2012

Not really, hence native PCIe SSDs. http://www.anandtech.com/show/6371/micron-p320h-pcie-ssd-700...

The hard disk form factor does make hot swapping easier, though.

ChuckMcM · on Oct 20, 2012

No it doesn't. It never did. However from a marketing / market acceptance perspective, it was a requirement. When you introduce a technology like flash you have to ask "Who is going to use this?" and "How?" The first mass market winner for flash was smartphones and digital cameras. Remember when Apple bought all the available Flash to launch the second generation iPhone? That flash capacity wasn't needed when people weren't building a new brand of camera or phone so some folks started putting this tech into USB sticks, and those were really popular. So the USB sticks and digital cameras and USB based card readers all made the flash appear like a disk drive and that meant that it could immediately be put to use by consumers. And that sort of cemented the idea that "Flash is for disks" into the minds of many people and folks have built a huge market around that.

Of course what started this revolution was memory, or EPROM to be precise. Back when dinosaurs walked the earth you stored firmware in a chip that physically had a window on the top of it. This was a "memory" chip that you would write by injecting charge into a transistor gate by forcing electrons to tunnel across to it (or out of it depending on the technology). But you needed a special programmer to do that, and to erase it, you had to get rid of those charges so you literally shined an ultraviolet light thru the window and the photons kicked the electrons right out of dodge. It was painful and the chip companies responded with something called "EEPROM" or electrically erasable programmable readonly memory. Which you could erase with a special high voltage signal on the motherboard doing the work of the ultraviolet lamp in previous generations. As density grew and the erase time shortened, manufacturers added the ability to only erase part of the memory but they could do that reasonably quickly "in a flash" as it were, and to distinguish memories that could be quickly erased from those which used older, and slower, technologies they started calling them "Flash" memories.

Of course as the article points out, Flash is nothing at all like a Hard drive, that people use it that way is an artifact. It much more closely represents the characteristics of something called "Drum Memory" [1] which, back when actual random access memory was very expensive to produce, made computers better. The reason was that a drum had a lot of heads and spun a piece of ferro magnetic material under those fixed heads. What this meant was that there was no 'seek' time, you picked the head you wanted electronically and you could read and write a few hundred to a couple of kilobytes of data. This enabled virtual memory in a big way because if you matched the amount of data on the drum with a 'page' of memory you could simply write out a page of memory or read in a page of memory faster than either tape or disk. The only problem was that you had to read all of the drum's track and write all of the track so a read-modify-write cycle meant reading in the track, modifying it, and the rewriting the entire track. Sound familiar? It should that is exactly how flash ended up working.

So here you have a random access memory that started life as a memory, but gained commercial acceptance as a pseudo disk drive, and a generation of programmers and system designers who had never heard about drum memory or considered it as something other than a curious artifact of the "before time" when people stored data as dots on a cathode ray tube for heavens sake. But they should have paid attention, because that is exactly where flash belongs. Sitting "beside" really fast dynamic ram, and even faster static ram (which is on chip and usually called level 1, level 2, or level 3 cache memory. If you remember Jeff Dean's observation about latencies every programmer should know [2], you would notice that reading 4K bytes from DRAM was on the order of .5 to 1 micro-seconds, and reading / writing 4K of memory from the network was 10 uS, and 4k read/write from disk - closer to 15000 microseconds (or 15 milleseconds). Reading 4K from flash on the PCI bus is on the order of 4 uS. Somewhere between getting it from the network and getting it from RAM. But the reason it is so much faster from the PCI bus is that you just map the PCI address space into memory space and memcpy from flash to RAM.

If you compare latency and bandwidth to going through an SSD interface you find that not only do you have a whole bunch of kernel between you and the SATA chip, which has its own protocol and drivers, you are constrained to a 6 gbit pipe which is probably shared with other SATA ports on the same SATA controller chip. From a systems architecture perspective attaching flash to your machine through an SSD plug is lame.

Now that said folks have started figuring this out. So people like Intel make PCI Flash cards, big chunks o memory like things. And all of the wear leveling is built into the flash controller just like the dynamic memory refresh logic is built into the DRAM controller. The processor sees something that looks like memory, occasionally operations take longer than expected if the controller is in the middle of something. The current challenge though is that so far folks think selling you the same flash chips that you can buy for $2/GB as an SSD should cost $200/GB as a PCI card. That math is seriously holding back flash, and the fact that the best slot to use for flash is the one your video card is sitting in (16x PCIe) and Intel has yet to add another 16x PCIe port for non-volatile memory cards, or added architectural support for putting PCI address resources into the page table. That will happen though, when I can't predict but it will because people keep asking for it and it makes some really killer server architectures possible.

[1] A pretty decent write up on drum memory from Wikipedia - http://en.wikipedia.org/wiki/Drum_memory

[2] A github copy of Jeff Dean's observation updated with Flash SSDs - https://gist.github.com/2841832

confluence · on Oct 20, 2012

This comment is God like - do you have a blog or the like where you keep details of your thoughts?

I think I just got a blast of the future of server architecture, future latency and the end of HDDs reading this - many thanks :D

tisme · on Oct 20, 2012

Thank you very much for this comment.

Comments like these are one of the main reasons I keep coming back to HN. Keep it up!

HarryHirsch · on Oct 20, 2012

It's not just the interface, it's the file system as well. A harddisk can be written to infinitely many times, but the cells in an SSD eventually wear out. That's why you have wear levelling, trim, an understanding of common file systems in the SSD firmware, the works. I keep asking: when will the low-level access be moved to a low-level SSD driver, where it belongs.

jbri · on Oct 20, 2012

The file system doesn't do wear-levelling. Trim has nothing to do with having a finite number of write cycles. The low-level SSD internals are already part of the low-level SSD firmware like you seem to be asking for.

r00fus · on Oct 20, 2012

It does economically for storage vendors and will for the next several years.

rll · on Oct 20, 2012

Mojibake from acm.org?

Confusion · on Oct 20, 2012

I don't see any mojibake?

rll · on Oct 20, 2012

Yeah, it looks like they fixed it. All the apostrophes were misconverted most likely from Windows-1252 when I read this yesterday but they are fine now.

BIair · on Oct 20, 2012

When I see anatomy in a title, I expect to see pictures.