REDUNDANCY? YOU CAN SAY THAT AGAIN.
In a dual-path VIO Server configuration, your SAN LUNs are (hopefully) protected from disk failures by being part of a RAID set of some kind. Ideally you've also got redundancy of adapters, switches, power and .... well a lot can go wrong, can't it? That's why you protect yourself from any component being a Single Point of Failure (except yourself, of course.)One VIOS down, all LUNs up
But what about the VIO server itself? If you have an outage of one VIO Server, you still want your VIO clients to see all of their LUNS through the second VIO server. (A dual VIO Server configuration is only available if you have an HMC. IVM is by its nature, a single VIO system). With two VIO Servers, you can present each LUN to both VIO servers
via MPIO (Multi-Path I/O). Then the VIO client (for example, the AIX LPAR) can see the same LUN through two paths. If a VIO server should get rebooted, usually for an upgrade, the disk at the VIO client will simply lose one path to the LUN, and still be able to access the LUN through the other path."I've lost my path"
Supposing one VIO server goes down. On the AIX LPAR, the lspath command
would show that one path has failed, such as the one on virtual scsi adapter vscsi0. In the following example, vscsi0 is the virtual SCSI adapter which goes through VIO server 1:
Failed hdisk0 vscsi0
Enabled hdisk0 vscsi1
Once the VIO server is back up again, and its access to the LUNs is reinstated, you could log onto the LPAR and set the status of the path to "enable" again using the chpath command
chpath -l hdisk0 -p vscsi0 -s enable
and do the same for each LUN on the other 30 LPARs until your fingers drop off.
Or you could
"What little man?
wake up that little man inside
You studied computer science and they didn't tell you? There's a little man inside your computer system who will wake up every
60 seconds and look out the window to see if that second path has turned
up again. I say make him sing for his supper. You don't buy a dog and bark yourself, do you?
To get him to do this, you just have to ask him nicely by setting the health check interval on the disk. You do this on the VIO client (the LPAR) using chdev (not chpath):
chdev -l hdisk0 -a hcheck_interval=60
If you get an error on that command saying that the device is busy, it's probably because it's in a volume group. Here are a few ways around it:Option 1: set it for the next reboot
You can set the change in the ODM using the -P flag. This sort of means permanent change
but really it means postpone
the change until the next reboot. This is a bit like buying something on mail order. You can pay for it now but you won't get it until the LPAR gets bounced.
chdev -l hdisk0 -a hcheck_interval=60 -POption 2: just do it
With this option you bring down the volume group (varyoffvg
), change the setting and activate the volume group again (varyonvg
). Here are the steps:
- unmount file systems after stopping any processes using them
- deactivate any paging spaces in that volume group (swapoff or just use SMIT)
- deactivate the volume group using varyoffvg
- then run the chdev command without the -P flag.
chdev -l hdisk0 -a hcheck_interval=60
- activate the volume group using varyonvg
- mount the file systems, reactivate paging space if applicable.
As you can't varyoff a rootvg volume group while it's in use, this option only works for non-rootvg volume groups, and of them, only the ones that you can really vary off.Option 3: prevention is better than cure
If you think ahead of time, you can change the health check interval before
you add the disk to a volume group.
IBM has a very good step-by-step explanation of implementing MPIO
. It mentions the need for setting the reserve_policy to no_reserve, and gives some other helpful hints. Although it does say you have to do a reboot of the LPAR, that's probably because it assumes you're setting this up for a rootvg disk.
Update: Watch out for that little man
If you think I'm joking about a little man inside your server, here's a little experiment. Next time you stick an insensitive "LEGACY" label on a computer and announce to the world that it's getting retired after 15 years of outstanding service, see if that little man doesn't protest by breaking the computer just before it really is no longer needed, and just enough to remind you who's the boss.