A customer did a migration from AIX 5.3 to 6.1 and then called me to report a strange set of symptoms. Some file systems didn't mount following a reboot. When the file systems in the volume group (let's call it datavg) went to mount, they returned the error that there was no such device. If the customer ran an exportvg and an importvg, all the datavg file systems became available. But then another reboot was done and the datavg file systems didn't mount.
Update: I have an idea of what might have happened. The customer (as I should have mentioned) did the migration to AIX 6.1 after restoring from an AIX 5.3 mksysb. I suspect the bosinst.data file had the option to Import user volume groups (such as datavg) set to No. The system had been migrated from 5.3 to 7.1, then when they ran into difficulties, they chose to restore the 5.3 mksysb and then upgrade just to 6.1. Perhaps I should have mentioned that background in the original post. Pretty poor omission if this was a detective story that you had paid hard cold cash for. But it isn't. And you didn't. So read on.
As I only had support over the phone, I had to do some detective work without being able to run any commands or look at screens for myself. (Technology is still pretty primitive down here in the antipodes).
Nested File System - not the culprite
I immediately thought it was a case of nested file systems that were attempting to mount before their parent file systems. This typically happens when the sub-file system (e.g. /tsm/log) appears in the file /etc/filesystems before the parent file system (/tsm). As /tsm hasn't been mounted, there is no mount point directory called /tsm/log, and so the /tsm/log file system fails to mount. This is usually as a result of someone manually editing /etc/filesystems.
As the customer pointed out to me, the missing device was the logical volume /dev/fslv00, not the file system's mount point directory. So I buried my nested file systems theory to be pulled out for next time a customer rings me with missing file systems.
Just, the facts, Sir
With my nested file systems theory soundly demolished, it was time to take a good, hard look at the facts so far:
- After a reboot, exportvg worked without any errors
- importvg then worked and made the missing file systems available
- when rebooting, the problem reappeared
- the VM had been rebooted recently, prior to the AIX migration, and there were no problems.
The fact (fact #1) that the exportvg worked without any errors indicated that the volume group's file systems were not mounted before running the exportvg. If they had been mounted, we would have got some warnings that the file systems were in use. So a successful exporting actually hinted at a problem.
The fact that the importvg worked indicated that the volume group's disks were available to the OS. So there was no issue with SAN connectivity or some volume group corruption.
Since the volume group didn't come up after the reboot, perhaps the problem was with the volume group itself, rather than any one of the file systems or logical volumes belonging to it.
If the volume group had a problem with it, this seems to have occurred during the migration.
Patterns of Expertise
We looked for a pattern, and noticed that all the missing file systems belonged to the one volume group (datavg). The volume group could be imported / varied on from the command line, but didn't come up after a reboot. Then it twigged: the volume group had somehow been set not to varyon automatically following a reboot.
Sure enough, when we ran lsvg datavg we saw that datavg was set not to auto varyon. Nothing wrong with the volume group. It's just that somewhere along the way, someone, or perhaps a bug, had set it to varyon=no. Easily fixed, as the chvg command explains:
chvg -a y vg03
Which is exactly what we did for datavg.
Getting to the root cause
We rebooted and found the problem was certainly fixed. What caused the volume group not to varyon automatically as part of a reboot? A bit hard to know. Perhaps someone ran the importvg with the -n option, which Causes the volume not to be varied at the completion of the volume group import into the system.
Maybe there was a bug in AIX, where the migration didn't auto-varyon volume groups.
We may never get to the root cause. It's not a big concern, as we've set the varyon correctly now, and everything comes up fine after a reboot.