Technical Blog Post

Abstract

Performance Myth Busting

Body

There are a set of myths out there about performance that I hear over and over, so I thought I would talk about some of the more common ones. If anyone else has heard of any strange rules then let me know and I can put together another post.

The rule of 4 must be obeyed - Busted

For anyone unaware the rule of 4 comes from the fact that the Storwize V7000 gen1 had 4 cores for handling io, volumes (which were still called vdisks back then) where assigned a code, so were RAID arrays. Having a single volume running from a single RAID array could result in the cpu maxing out at about 25% and performance being poor as a result. So to get optimal performance you wanted to have 4 arrays and ideally 8 volumes (the volumes are also round-robined between the 2 nodes in the IO group. This got published in a red book and has caused concern ever since.

So why is it not important? A lot of workloads are not CPU limited, a sequential write stream for example is rarely CPU limited. If you run random small io on a performance stand then you will manage to max out the CPU, but very few customers are doing that. The rule is also out of date, with the arrival of the gen 2 there are 7 cores used for volumes and arrays so even when using the rule of 4 you don't have a fully distributed system. Then came distributed RAID, in order to make a single large distributed RAID array faster we changed the code so that for a RAID array some of the processing is done on cores other than the one the array is assigned to, this means whilst the core the array is assigned to is used more heavily the others get used a lot more. This enhancement applies to all RAID arrays not just distributed arrays. And in 7.7.1 we have just announced multi-threaded Distributed RAID, this means a single Distributed Array is using all of the cores. And for sequential workloads a smaller number of larger RAID arrays gives better performance as it manages to keep more drives spinning.

If you want to eek out the absolute best performance then still having volumes that match the number of io processing cores gives you slightly better performance, but sticking to the idea that everything must be in the correct multiple then you can end up with systems that are harder to maintain. Think of the rule as more of a bonus, if your setup happens to fit the rule nicely then congratulate yourself with a doughnut.

CPU utilization must stay below 50% - Busted

When compression was added to the Storwize V7000 gen 1 there was a rule that CPU utilization must be below 25% to turn on compression. This is because compression uses 3 cores resulting in only 1 core being available for processing IO. This does make the rule of 4 change to a rule of 1 which is much easier to follow, but it also made people focus more on CPU usage. So around the same time I started hearing mutterings that if your CPU was above 50% then that would mean if a node failed the remaining node wouldn't be able to cope as it's CPU usage would want to double.

The problem with that idea is that a lot of the CPU usage is actually due to the fact that the data has to be mirrored between the two nodes, if the io comes in and simply gets written to the backend controller or RAID array things are a lot simpler. Even with RAID arrays there is a lot of processing taking place to make sure that both nodes aren't trying to write to the same drive LBAs at the same time so there are messages being sent back and forth little ticks being made in tables just in case one of the nodes does die etc. There is obviously a point where a single node wont be able to cope with a workload that required two nodes, but it isn't at 50%. I don't know exactly where that cut-off is, but I've seen systems running at 75% cope fine when a node fails.

RAID 6 is too slow - Busted

You will hear about the write penalty for RAID-6 being thrown around a lot. People are worried about RAID-6 being too slow and that they need to keep with RAID-5. The truth is that RAID 6 is typically slower than RAID-5. The difference is the amount of parity data being written, a small write to a RAID-5 array requires 1 data read, 1 parity read, 1 data write and 1 parity write. The same small write to a RAID-6 array requires 1 data read, 2 parity reads, 1 data write and 2 parity writes. But the important thing is those 2 parity reads are happening in parallel to the data read, and the 2 parity writes are happening in parallel and can complete after the write has already been considered complete from the host point of the system issuing the write (typically in Spectrum Virtualize that write is coming from cache but when running with a single controller or cache disabled then it comes from the host).

With the addition of Distributed RAID you get much better distribution of IO to drives even if hammering a single extent on a single vdisk. Even when using traditional RAID most customers don't have systems that are so drive limited that the extra io required for the parity is going to have a meaningful impact.

So for those customers with RAID 5, I'd recommend going up to RAID 6 (preferably distributed RAID 6) it gives you much better protection. If you are using drives larger than 1TB and using RAID 5 it's a data loss disaster waiting to happen.

Cache Disabled disks can be faster - True

Cache in a controller (or virtualization engine) has a couple of different objectives, firstly it tries to collate io, to either avoid having to write data and then immediately write over the top of it or to group adjacent writes and issue a single write down, secondly it tries to hide latency from RAID arrays either by caching a write giving sub-milisecond response to a host and then destaging it later, or by prefetching read data so it's ready for when the host wants it.

So imagine you have the best sub-millisecond all flash product available behind your SVC (you could put a fancy plastic front panel on it and call it a Storwize V9000). So you now have a backend subsystem that doesn't care about gluing adjacent writes together, it's got state of the art technology to maximize the life of the flash so that repeatedly over-writing the data isn't really an issue, and it gives such low latency that exposing that latency to a host isn't a bad thing. Putting a cache in-front of that doesn't get you anything, in fact SVC has to replicate the data between the nodes via Fibre Channel (or in the future iSCSI) where it might have to compete for bandwidth with host io. So by adding a cache that you don't need, you can actually see higher latency and lower bandwidth.

This doesn't mean all volumes on a Storwize V9000 should be cache disabled, cache still plays an important part in hiding the latency introduced by the many copy services that you may be utilizing, but for a very boring vanilla volume that isn't replicating, isn't compressed and you don't want snapshots of then you might want to think about disabling cache. Of course if all of your volumes are like that then you are missing out on some of the major advantages that the Storwize V9000 brings you over just buying a FlashSystem 900.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SS4S7L","label":"IBM Spectrum Virtualize Software"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

UID

ibm16164223

Tips