Shared Storage Pools (SSP3) and Disaster Recovery in 30 seconds
nagger 100000MRSJ Comments (2) Visits (12025)
Shared Storage Pools - just got even more interesting !
SSP Clusters with disks pools and super fast disk allocation
VIOS Shared Storage Pools (phase 3) allows 16 VIOS on different machines to operate as a VIOS cluster with a set of SAN LUNs in the pool - think in terms of a number of TBs of disk space in the pool. The VIOS systems administrator can then allocate disk space to a new or existing Virtual Machine (LPAR) in around a second which can be thin or thick provisioning regardless of the underlying disks. This drastically reduces the time to implement a new Virtual Machine. If we operate dual VIO Servers (normal) this still gives us a 8 machine cluster and we can with MPIO, use the dual VIO Servers for redundancy.
LPM ready by default
Assuming we use virtual networks too (which is normal these days) then we are 100% ready for Live Partition Mobility (LPM) as our Shared Storage Pool based Virtual Machine disks are available across the SSP3 cluster with no mucking about with LUNs and SAN Zones as the disks are already online on every VIOS. This does assume the same VLANs are available on the target VIOS but that is fairly normal too.
But what about disaster recovery with SSP3?
For example, one of the machines of the SSP3 cluster fails - like a sudden power down. Do we have the technology to rebuild the Virtual Machine.
We need to know a few things
Items 1, 2 and 3 are pretty easy to work out ... if you are regularly saving your configs off the machine or if the HMC is still running which might not be the case for a complete site failure.
I have been experimenting with two really cool tools that I think every Power Systems Administrator should have
IMHO: If you have not got these marvelous freely downloadable tools then you need to ensure you have other tools that can match the functionally and automated documentation generation.
The Shared Storage Pool disks (or LU as they are called) are not normally saved by the above tools as they are not stored on the HMC but are on the VIOS and the boot disk on the boot records of the service processor (FSP).
But wait items 4 is known about by the surviving VIO Servers of the same SSP3 cluster
Go to a possible target VIOS to resurrect the failed Virtual Machine and run the lssp command. Here is an example from my SSP3 VIOS:
$ lssp -clustername galaxy -sp atlantic -bd
My failed Machine is called gold and the Virtual Machine I badly need to recover is called gold6.
The names of the LUs is completely determined by the user - I use a simple vdis
In light of these experiments, I think my naming convention could be improved (i.e. it sucks):
So obviously the SSP LU's I need to recover are vdisk_gold6a and vdisk_gold6b. As the second disk (b) it used 0% it is clear which is the boot disk.
So what do I need to do to recover my gold6 Virtual Machine?
Here are the steps:
This should take about five minutes!!
Not bad for recovering an important service. Of course, once AIX starts it will have to replay any JFS2 logs (usually in seconds) and then you need to start the application or RDBMS and it may need to recover incomplete transactions, so the service may take a little longer to be fully ready.
If that is not fast enough for you then you can have a recovery Virtual Machine setup in advance on a target machine. In which case, you run steps 1 to 3 in advance.
Then you just have to start the Virtual Machine (LPAR) which takes about 30 to 60 seconds.
One word of warning, don't have the original AND the recovery Virtual Machines (LPARs) running at the same time.
I think that would corrupt the file systems within a couple of seconds.
I did check this concept/idea with the SSP developers they said "Of course Nigel, that will work fine."