The gotchas of NAND
orbist 060000HPM5 Comments (4) Visits (11779)
While catching up on some old and new posts out here I came across an interesting article from Chris Mellor discussing a report about various PCIe card benchmarks performed by the Swiss National Computing Centre (CSCS).
I was going to reply to his post but as I started typing realised I had a lot to say and was worthy of a blog in itself.
SSD, and in particular NAND flash has various foibles as we all know, it has wear out characteristics, it has distinctly different read and write characteristics, and basically it grows old.
While we, as vendors, do everything we can to make sure the SSD and PCIe card devices we sell will last for at least as long as the warranty period, not to mention perform as well as possible, you need to think about what you are trying to test and how you will use the device over its lifetime.
SLC (Single-Level Cell) devices are generally disappearing, mainly because of their relatively high cost and low density, when compared with. MLC and eMLC (enterpise Multi-Level Cell) NAND. For those that aren't aware of the technical difference, SLC means each cell in the NAND flash stores one bit. With MLC, each cell stores multiple bits, by using different voltages to access each bit within the cell. Most MLC devices are dual-bit cells, however tri and quad-bit devices are in the works.
The achilles heel of NAND as a storage technology is the way we write to it. Note I specifically state NAND rather than Solid State - NAND just happens to be the best non-volatile Solid State technology we have at our disposal today. In order to write a cell, or even a bit, we have to re-write all the multiple bits in the cell, but not only that, you can't write just one cell, you have to write a whole block, typically 128KB. Before you can write that block, you need to first erase the block and its contents. While you erase that block, the whole chip is frozen for other operations... The number of times each block can be erased and written is limited, something around the 10,000 writes per cell in todays 20-30nm cell technology.
All of these "nuances" mean that the devil really is in the detail - or more to the point - in the quality of the algorithms that are managing the raw NAND cells and chips themselves. This means that each SSD or PCIe card needs a controller. So where standard HDD devices are accessed through drive firmware that manages the deviceand accesses in a semi-sequential manner, tracks, sectors etc, a NAND storage device is much more randomly accessed as there is no concept of a contiguous LBA. Generally NAND is managed via a log structured array, or directory.
We are now therefore embedding the "traditional storage controller functions" into the raw devices themselves. The SSD or PCIe card itself has its own data placement, management, free block, bad block, defrag, wear leveling, QoS and even wear out algorithms. Today, the most successful vendors of NAND devices have either a custom ASIC to run these functions, or dedicated firmware running on an FPGA. Maybe this is the future for storage ASICs Nigel :)
All of this leads me to where I started, things to think about when benchmarking, or choosing your next NAND storage device... and the same thing applies to storage controllers....
Just as all storage array/device vendors quote measured, but possibly unrealistic performance numbers.... 100% read cache hits for example. Yes it shows you the capability of the device, but is that really the workload your applications will be driving? Nope.
Like cache hits, benchmark a NAND flash device out the box and it will appear to be super super super fast. (Especially if you read from it before you've written it!) Think of the device as thin provisioned by its very nature, it only uses capacity when you write to it, so read something that doesn't yet exist in the directory and you get zeros returned back without any data read happening.
Before running any kind of synthetic benchmark to a NAND device, make sure you've written a random small block pattern to the device. Generally you want to write the entire capacity of the device at least 3 or 4 times to ensure any over provisioning has been taken into account. Say 4KB random writes, or if you want to be really mean, 3KB, or 7KB random writes! This ensures the maximum fragmentation on the device. Thus any reads back from the device will be almost worst case, rather than best case (all zeros!)
Normal disk systems take most of their time to do the seek, and the data xfer is a small part of the latency, thus most operations <32KB will return almost the same IOPs, only when you get to 64KB and above does the actual data read or write latency start to become the bigger part of the picture.
NAND devices generally have a linear response time based on the data xfer size. Thus a 512 byte read will give double the IOPs of a 1KB read, and so on. Find what your average application xfer size is and run your tests at that size. If its writing at that size, then "format" the device as described above to the same block size.
When it comes to MB/s, an SSD device may only give you 2 or 3x the MB/s of an equivalent HDD. Now PCIe cards do have much more bandwidth available, and so you should be expecting GB/s per device, but its still only 5 to 10x the bandwidth of an HDD. However, compared to IOPs, where an enterprise 15K SFF drive may get 500 iops at a push, an SSD can easily achieve 50,000 IOPs - 100x the performance.
Thus, make sure you understand your workload, and test the device accordingly. Your mileage will vary and thats the key point. If you know what your application asks of its storage, then you can work out, even on paper, if SSD will help. In most cases there are three use cases for NAND today.
1. You need improved latency. Then using a feature like Easy Tier can cost effectively improve latency across the whole estate, with a minimal cost outlay.
2. You need more TPS or IOPS, then if you are lucky and you can fit the data on a few SSDs then you will see much higher IOPs than traditional disk, but the ultimate gain is IOPS for small block (<16KB) random workloads.
3. If you need very high MB/s from a small capacity, then a direct attached PCIe card maybe the answer, but does the data only need accessing by one system, and does the cost of the device mean you are gaining vs 10 or 20 HDD devices.
When you go direct attached, you need to think about backup, replication, and so on.
Writing this I think I maybe painting NAND in a bad light, but I'm just trying to explain where and when NAND works, but most importantlythe kind of things to look out for. Ultimately it doesn't suprise me that in the tests performed by CSCS didn't meet the vendors claims, as they didn't fit the tests into the small window that showed the device at its best.
IBM today provides SLC based SSD drives in SVC, eMLC drives in Storwize V7000 and DS8000 and Easy Tier automated migration across all those products. Our Power and X Systems can be configured with locally attached PCIe cards and any performance numbers we provide, benchmark or estimate are always based on the real life, sustained performance of the devices.
I'd love to hear end user experiences of NAND devices, where its working for you, where its not lived up to expectations, and where you think you could make it work?