1M IOPs from Flash - actions speak louder than words
orbist 060000HPM5 Comments (28) Visits (15187)
Its been so difficult to keep all this to myself over the last few months, especially last month when we actually reached our goal. 1 Million real life IOPs at less than 1ms - no caching, just pure raw performance.
You may, or may not have seen the IBM press release, covering the first part of IBM's flash and SSD strategy. Unlike some vendors, we didn't want to just jump on the band wagon, and make some vague statements about following in EMC's footsteps and simply plugging them into an existing box that wasn't designed for them, limits the top end performance potential and almost definitely will not scale to many hundreds of the devices.
There are three plays here, SSD (drive form factor) that plug into a traditional storage controller, which all the major vendors are working on supporting - thats great, you get a much reduced response time (assuming the controller can cope) and a new tier of storage in the box that you can manually manage. The second play is the addition of non standard storage form factors to a host, most likely PCIe adapters, maybe even DIMM form factors or the like, the Systems folks in IBM are actively investigating how this can be used to best benefit our customers, as Sun and Intel are too. The third play is to build a storage controller that is truly optimized for this game changing technology. Thats what we have prototyped and is described in today's press release.
Myself and a few SVC developers in Hursley have been working closely with the storage research team in Almaden (yes, some of the same people that helped us bring you SVC in the first place) to build a highly scalable modular controller system that can not only be optimised for flash based devices, but by using this in addition to SVC can attach traditional HDD based controllers, with of course the ability to migrate data seemlessly in an online manner between the tiers/controllers.
One of the great things about the SVC software base is that it is extremely flexible, both with how we can add new features and functions (as discussed before) but also with the hardware configuration. We needed to build a cluster capable of sustaining 1 Million cache miss IOPs. So we did, and once the hardware was on-site, it took the team less than a week to get it running. This gave us the host attachment we needed, and the ability to attach flash based storage to our high end Power system hosts (which would be needed to generate that much I/O).
Now we have the functionality, features and host attachment, all we need is some flash storage. One thing that EMC won't have told you what can be the achilies heel of flash, mixed workloads. We all know that reads are stunning, but writes are much more complicated. That is why most of your 'laptop' SSDs don't actually give that much of a performance benefit over HDDs (Someone told me recently Windows does 100's of thousands of writes during boot...) Write endurance - I'm not going to go there, its not an issue, any problem can be solved, most enterprise flash vendors have worked around these problems in one way or another, but at what cost? Usually performance. Depending on how good the low level ASIC or FPGA code is, mixing reads and writes can be problematic. The search was on for a suitable flash device.
Robin Harris posted an interesting article last year that sent me off investigating the FusionIO - ioDrive, and after some discussions at a corporate level we got hold of a couple of cards to evaluate. They provided an interesting angle from a storage controller perspective and IBM has been working closely with FusionIO over the last few months to help both companies squeeze every last drop of performance out of the low level flash hardware.
Now we have the flash, we have the virtualizer, we need to connect them together. Internally on our testfloor, we use a huge numbers of different vendors storage controllers, but we also use SVC to test SVC (using SVC software and hardware that to SVC 'looks' like a storage controller). Take this a step further and we used the Fibre-Channel host attachment side of the SVC code to build a prototype modular flash controller. Using a high performance System X server, some ioDrives in the PCIe slots, add our Fibre-Channel HBA and modify the SVC code to read and write from the flash device... et voila - you have a very high performance flash controller. (SVC has already proved the Intel based System X hardware can handle huge numbers of IOPs and MB/s)
Now our critics tell us SVC adds latency... and that its going to cause you problems. So with a device that gives super low latency, if that were true we'd be in trouble. Not so.
The benchmark we ran was a pretty typical Open Systems database style workload. 70% read, 30% write at 4K transfer size. With no cache hits. We actually turned off, or created the vdisks as cache disabled, the SVC cache, guaranteeing the IO was going down to flash storage. It was with this workload we provided over 1 Million IOPs (1.1M at peak) with a response time maintained under 700us (microseconds) for many hours - so much for SVC being a slouch or adding latency. I have seen BarryB's customer pitch regarding EMC flash drives (I love the way they call them EMC flash drives, gives the impression they designed them, when they are actually made by STEC)... its probably not politically correct of me to divulge details of what he discusses, but if you are thinking about SSDs, and are contacted by EMC, ask them for a comparison... how many EFD's would they need to maintain the same rate of IOPs, and can they even match the same response time (with or without cache) more importantly how many DMX quadrants would be needed to reach 1M iops at such a low response time... we did this in just over one and a half EIA racks. About 71U to be precise, this includes the SVC nodes, and the UPS's needed for SVC.
Despite the obvious comparison we could draw, and would be recognised by all - i.e. SPC-1, one of the rules of performing Storage Performance Council benchmarks is that the product must be GA'd. As this is a proof of concept system at present, we cannot publish any SPC results relating to this system. SVC is audited by the SPC and has the world record disk based benchmark1. In order to ground the flash based 70/30 4K test against a known industry recognised product, we ran the same 70/30 4K 100% miss test againt the SVC configuration used to produce the SPC result. This gave us the comparisons used in the press release, the modular flash controller system provided 3.5x the IOPs at 1/20th of the response time.
I'm sure our critics will be along to say thats all very well and good, but is it shipping now... the whole point of this work was to look to the future and to show that a modular and flash optimised controller approach (rather than monolithic) is capable of providing the best that flash can bring. Combining this with storage virtualization enables existing SATA, SAS or FCAL devices to be added to the picture and most importantly you gain the ability to online migrate hot/cold data at will from flash to traditional storage (and vice versa), maybe even autonomically... an investment in a product like SVC is an investment in the future of your infrastructure... the future is bright... the future is virtual... the future is modular...
1 : Details of the SAN Volume Controller SPC-1 Results are available at: http