Shared Storage Pools - just got even more interesting !
SSP Clusters with disks pools and super fast disk allocation
VIOS Shared Storage Pools (phase 3) allows 16 VIOS on different machines to operate as a VIOS cluster with a set of SAN LUNs in the pool - think in terms of a number of TBs of disk space in the pool. The VIOS systems administrator can then allocate disk space to a new or existing Virtual Machine (LPAR) in around a second which can be thin or thick provisioning regardless of the underlying disks. This drastically reduces the time to implement a new Virtual Machine. If we operate dual VIO Servers (normal) this still gives us a 8 machine cluster and we can with MPIO, use the dual VIO Servers for redundancy.
LPM ready by default
Assuming we use virtual networks too (which is normal these days) then we are 100% ready for Live Partition Mobility (LPM) as our Shared Storage Pool based Virtual Machine disks are available across the SSP3 cluster with no mucking about with LUNs and SAN Zones as the disks are already online on every VIOS. This does assume the same VLANs are available on the target VIOS but that is fairly normal too.
But what about disaster recovery with SSP3?
For example, one of the machines of the SSP3 cluster fails - like a sudden power down. Do we have the technology to rebuild the Virtual Machine.
We need to know a few things
- LPAR CPU: dedicated CPU count or shared: Entitlement, Virtual Processors, weight
- LPAR memory: size in GB
- LPAR network(s)
- LPAR SSP3 disks - actually the names of the SSP3 LU resources
- LPAR boot disk
Items 1, 2 and 3 are pretty easy to work out ... if you are regularly saving your configs off the machine or if the HMC is still running which might not be the case for a complete site failure.
I have been experimenting with two really cool tools that I think every Power Systems Administrator should have
- HMCscanner - http://tinyurl.com/HMCscanner which generates a Excel Spreadsheet of HMC data on AIX or you workstation
- LPAR2RRD - http://lpar2rrd.com for all the information and a demo but this too only need ssh to read-only user to document your configuration and graph your LPAR performance
These two tools or a saved HMC System Plan have the LPAR CPU, memory and network details.
IMHO: If you have not got these marvelous freely downloadable tools then you need to ensure you have other tools that can match the functionally and automated documentation generation.
The Shared Storage Pool disks (or LU as they are called) are not normally saved by the above tools as they are not stored on the HMC but are on the VIOS and the boot disk on the boot records of the service processor (FSP).
But wait items 4 is known about by the surviving VIO Servers of the same SSP3 cluster
Go to a possible target VIOS to resurrect the failed Virtual Machine and run the lssp command. Here is an example from my SSP3 VIOS:
$ lssp -clustername galaxy -sp atlantic -bd
Lu Name Size(mb) ProvisionType %Used Unused(mb) Lu Udid
vdisk_diamond3a 16384 THIN 51% 7971 0d0f6526326f906077b3b2c9c6c42343
vdisk_diamond4a 16384 THIN 40% 13309 8ff7f4e74244ced56d6353247c3f8ca1
Snapshot
diamond4a_SP11.snap
diamond4a_with_wp22.snap
diamond4a_ISD_WPAR_ready.snap
vdisk_diamond6a 131072 THIN 17% 108400 5a47d7a731bf85ef59fbbe6c19e43768
vdisk_diamond7a 16384 THIN 18% 13292 1b93e1c46e0cfdec310087b4180fc3d2
vdisk_diamond7b 131072 THIN 31% 89410 a082bbbb69b069ae3931f171341342f6
vdisk_diamond8a 16384 THIN 25% 12196 dbb870c0ed55791fa75ea2352c237966
vdisk_diamond9a 16384 THIN 20% 13105 378030fb9f0f6b3f15b6aab74fe617da
vdisk_gold2a 16384 THIN 18% 13383 3335cb9729e6f139301b871ab5d2ae72
vdisk_gold3a 16384 THIN 19% 13193 b3da2b4256f897a3a2c048504bd3d80f
vdisk_gold4a 16384 THIN 19% 13218 921839bd8566b55da1744c32d347c43e
vdisk_gold5a 16384 THIN 18% 13342 f9e542e36b5cff11c854f461fbf61361
vdisk_gold6a 16384 THIN 18% 13364 2e1b6657e85846f06716a1bf4eaf6057
vdisk_gold6b 16384 THIN 0% 16385 01ab722c4b41b4f61f88dff3cba96779
vdisk_red2 16384 THIN 18% 13336 e826fbe6b0b97e905ef4a8ba10bf1cda
vdisk_red3 16384 THIN 30% 11454 f389b8b8c42ce02dc06353e622645b84
vdisk_red4 16384 THIN 17% 13533 9605f8227d6846b46fc315d992131de6
. . .
My failed Machine is called gold and the Virtual Machine I badly need to recover is called gold6.
The names of the LUs is completely determined by the user - I use a simple vdisk_<LPAR-name><letter> convention. The letter is used when we have multiple LUs.
In light of these experiments, I think my naming convention could be improved (i.e. it sucks):
- "vdisk" is largely pointless
- Our Virtual Machine (LPAR) names include the name of the machine (like gold), which in a LPM environment is quite dumb and the machine can change every day!
- It is not clear which is the current boot disk - although the first disk is a pretty sure bet :-) but I could highlight the boot disk with the name. Or we need to record this fact.
So obviously the SSP LU's I need to recover are vdisk_gold6a and vdisk_gold6b. As the second disk (b) it used 0% it is clear which is the boot disk.
So what do I need to do to recover my gold6 Virtual Machine?
Here are the steps:
- Select a target machine and go to its HMC and create a new AIX LPAR with similar CPU, memory ~ 2 minutes
- With the same virtual network(s)
- With a virtual SCSI connection to the VIOS (may need to add this to the VIOS too)
- Go to the VIOS and assuming the new Virtual Machine vSCSI on the VIOS end is vhost42 and my cluster name galaxy and SSP atlantic. Run these two commands: ~2 minutes
- mkbdsp -clustername galaxy -sp atlantic -bd vdisk_gold6a -vadapter vhost42
- mkbdsp -clustername galaxy -sp atlantic -bd vdisk_gold6b -vadapter vhost42
- Use the HMC boot to SMS and select the first disk as the boot disk ~1 minute
- Use the HMC to start the Virtual Machine
This should take about five minutes!!
Not bad for recovering an important service. Of course, once AIX starts it will have to replay any JFS2 logs (usually in seconds) and then
you need to start the application or RDBMS and it may need to recover
incomplete transactions, so the service may take a little longer to be fully
ready.
If that is not fast enough for you then you can have a recovery Virtual Machine setup in advance on a target machine. In which case, you run steps 1 to 3 in advance.
Then you just have to start the Virtual Machine (LPAR) which takes about 30 to 60 seconds.
One word of warning, don't have the original AND the recovery Virtual Machines (LPARs) running at the same time.
I think that would corrupt the file systems within a couple of seconds.
I did check this concept/idea with the SSP developers they said "Of course Nigel, that will work fine."
Tags: 
disaster
power7
pools
storage
crash
power
vios
ssp3
shared
recovery
aix