SVC - Clustered Storage from the ground up
orbist 060000HPM5 Comments (4) Visits (10960)
As we approach SVC's 6th birthday its interesting to see the rest of the industry starting to catch up and realise not only that a modular commodity storage controller is the way forward, but that also clustering of said modular controllers has many benefits.
We all await todays announcement of HDS where various people have speculated they will be enabling users to cluster new USP-V monoliths. I guess a similar story to the way in which EMC tried to make out that
I thought it was worth going back and looking over SVC and its clustering technology not to mention some of the implications of clustering storage controllers to give a single system image of the storage they manage. This really does have implications at a very low and fundamental level.
SVC was designed from the ground up to be a cluster of modular storage nodes. This section from the 'Compass Architecture' document outlines the primary reasoning behind the original Compass research project :
This document was originally written in June 2000, some 9 years ago. It would seem everyone else is simply catching up.
SVC clustering is based around other HA solutions developed by IBM over many year for the RS/6000 and SP/2 systems. The code running on each node is divided into two distinct areas. The cluster state, and agent state. The cluster state is replicated across all nodes in the cluster and is essentially a set of control state machines. These are all guaranteed to make the same decisions based on the same input set of data. The members of the cluster form the voting set, this is all controlled by a single component that maintains the voting set and handles nodes that expire out of the cluster, or join the cluster. Quorum must be maintained at all times, and in the event of an even split (half of the previous voting set remain in contact with each other) an external quorum tie-break is used. This ensures that both halves of the cluster don't continue in the event of a split.
The division of cluster and agent state means that software upgrade processing and be divided and run essentially two different sets of code (the agent being new and the cluster state being old) until such time as all nodes agree to commit to the upgrade. Here they all roll forward in lock step - performing any cluster state modifications required by the new code. Should any one node decide it isn't happy to commit, the nodes can all roll back to the previous level.
Its the overall architecture and up-front thinking that allows SVC to be so successful, and more reliable than many 'enterprise' controllers - especially when it comes to complex cluster upgrade scenarios. With the 5000+ clusters running in production, with more than 12 major code streams, and many more minor upgrades over the last 6 years, not to mention the 5 hardware ramps during the 6 years that go to show how such a solution benefits from being designed from the ground up to be clustered, modular, use commodity parts and allow both concurrent software and hardware upgrades.
Some vendors try to say that SVC is not a cluster...? Because we use a 2-way caching model, that is a virtual disk is managed by a node pair - they say this is just a 2-way controller, not a cluster. But the concept of SVC was to provide a single management point, with a generic access to a single set of storage pools (tiers). It does this perfectly, whereby all nodes in the cluster can access and pool all the storage that is visible to the cluster. An n-way caching model has much more overheads on the node to node links to ensure that consistency is maintained across all nodes. In the end, at least 2 nodes have to 'own' any cache track - you could make this n-way, but then your cache capacity is reduced by a factor of n, and the overhead on a write is increased by n. No, its better to have ownership of a given track to just a pair of nodes, thus only reduce your write caching space by 2 and add overheads of cache mirroring (i.e. by a factor of 2). The other way to achieve this is to have all your nodes connected by some expensive backbone, and then you are back to slow and costly hardware development. A global cache maybe of value inside a monolith, but for truely modular clusters its a hinderance.
Again some vendors may claim that SVC creates islands of virtualization due to scalability. We've supported 8 nodes for sometime now - the architecture is (as you can guess) designed to scale way beyond that. However as each node hardware model roughly doubles in performance with each generation - so far we've not needed to scale SVC beyond the 8 nodes from a performance perspective. I'd question anyone how references SVC as an island and ask just what their solution does today? USP-V (even with a potential new cluster of USP-V's) will be an island, and of course DMX-5 is still an island. The difference is that they are both very expensive islands where you can put cheap storage into the expensive hardware. USP-V can of course virtualize other storage, but its always been an afterthought, and then you are stuck with the port limits they impose.
If you are being entertained by the sales teams of one of these 'new kids on the block' that claim to have superior clustering techniques, greater scalability and a real cluster - ask yourself if you'd rather have a product that was designed to be a cluster, provides true heterogeneous open storage management and can be upgraded with ease... If you haven't looked at SVC before, or even for a while, maybe now is the time to investigate further, as it seems the competition are finally confirming what we at IBM have known for almost 10 years...