Question & Answer
Question
How to free up a failing disk that's part of a PowerVM Virtual I/O Server, mirrored rootvg in preparation to replace the disk. This applies to VIOS 3.1
Cause
Mirrored VIOS rootvg disk is failing.
Note: to determine if the disk may need to be replaced, contact your local Hardware Support Representative.
Answer
In the following example, hdisk0 is the failing disk that needs to be removed from the mirrored rootvg in order to physically replace it, and hdisk15 is the "good" disk.
$ lspv hdisk15 ->STALE PARTITIONS: 3
To synchronized a logical volume
$ syncvg -lv LVname
To synchronize the entire volume group
$ syncvg -vg rootvg
Verify stale partitions were synchronized (check LV state)
$ lsvg -lv rootvg
or
$ lspv hdisk15 ->STALE PARTITIONS: 0
3. Unmirror the volume group from the failing disk.
$ unmirrorios <failing_hdisk#>
4. If you are able to successfully break the mirror from the failing disk, check to see if the dump device may have been left on the failing disk. By default, the dump device is not mirrored. Therefore, it may be expected to be left in the failing disk if that was the initial disk the VIOS was installed to as it is the case in this example:
$ lsvg -pv rootvg > TOTAL PPs is not equal FREE PPs
$ lspv -map hdisk0 >lg_dumplv
a. Verify lg_dumplv is set as primary dump device. If it is, temporarily set the primary dump device to /dev/sysdumpnull in order to try migrating lg_dumplv to the good disk.
To check primary dump device
$ oem_setup_env
# sysdumpdev -l
primary /dev/lg_dumplv
secondary /dev/sysdumpnull
...
To temporarily set primary dump device /dev/sysdumpnull
# sysdumpdev -Pp /dev/sysdumpnull
# exit (back to padmin)
b. Attempt to move lg_dumplv to the good disk using migratepv command.
$ migratepv -lv lg_dumplv <failing_hdisk> <good_hdisk>
This command may or may not complete depending on whether or not we can read the PPs from the failing PV.
If the command completes successfully, verify the failing disk is empty (TOTAL PPs equals FREE PPs):
$ lsvg -pv rootvg
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk15 active 63 16 01..00..00..02..13
hdisk0 active 558 558 111..96..80..111..112
If the command fails, remove lg_dumplv from the failing disk and recreate it in the good disk using the same # of LPs in 'lsvg -lv rootvg' output:
$ rmlv lg_dumplv
$ mklv -lv lg_dumplv -type sysdump rootvg 1 <good_hdisk>
c. Once the dump device is on the good disk, set it back as primary:
$ oem_setup_env
# sysdumpdev -Pp /dev/lg_dumplv
5. Verify failing disk is empty
$ lspv -map <failing_hdisk>
6. Remove the failing disk from the volume group
$ reducevg rootvg <failing_hdisk>
7. At this point, you can remove the disk definition and have the disk physically replaced:
$ rmdev -dev <failing_hdisk>
1. Verify PVs in rootvg
$ lsvg -pv rootvg
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk15 active 63 16 01..00..00..02..13
hdisk0 active 558 510 111..96..80..111..112
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk15 active 63 16 01..00..00..02..13
hdisk0 active 558 510 111..96..80..111..112
2. Verify rootvg is mirrored and ensure there are NO stale PPs in the good disk.
If the LV is mirrored, the # of PPs will be doubled the # of LPs, except for the dump device, as shown below:
$ lsvg -lv rootvg
rootvg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
hd5 boot 1 2 2 closed/syncd N/A
hd6 paging 1 2 2 open/syncd N/A
paging00 paging 1 2 2 open/syncd N/A
hd8 jfs2log 1 2 2 open/syncd N/A
hd4 jfs2 1 2 2 open/syncd /
hd2 jfs2 6 12 2 open/syncd /usr
hd9var jfs2 1 2 2 open/syncd /var
hd3 jfs2 5 10 2 open/syncd /tmp
hd1 jfs2 12 24 2 open/syncd /home
hd10opt jfs2 3 6 2 open/staled /opt
hd11admin jfs2 1 2 2 open/syncd /admin
livedump jfs2 1 2 2 open/syncd /var/adm/ras/livedump
lg_dumplv sysdump 1 1 1 open/syncd N/A
$ lsvg -lv rootvg
rootvg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
hd5 boot 1 2 2 closed/syncd N/A
hd6 paging 1 2 2 open/syncd N/A
paging00 paging 1 2 2 open/syncd N/A
hd8 jfs2log 1 2 2 open/syncd N/A
hd4 jfs2 1 2 2 open/syncd /
hd2 jfs2 6 12 2 open/syncd /usr
hd9var jfs2 1 2 2 open/syncd /var
hd3 jfs2 5 10 2 open/syncd /tmp
hd1 jfs2 12 24 2 open/syncd /home
hd10opt jfs2 3 6 2 open/staled /opt
hd11admin jfs2 1 2 2 open/syncd /admin
livedump jfs2 1 2 2 open/syncd /var/adm/ras/livedump
lg_dumplv sysdump 1 1 1 open/syncd N/A
If LV state shows "stale", determine which PV (hdisk#) has the stale PPs.
$ lspv hdisk0 ->STALE PARTITIONS: 0$ lspv hdisk15 ->STALE PARTITIONS: 3
If there are stale PPs on the failing disk but not on the good disk, proceed to step 3 to unmirror the volume group from the failing disk.
If the state PPs are on the known good disk (hdisk15 in this example) but none on the failing disk, attempt to synchronize them using syncvg padmin. Note 1: All stale PPs on the known good disk must be synchronized before unmirroring the volume group from the failing disk (hdisk0, in this case). In order to synchronize stale PPs on the good disk, the corresponding PPs on the failing, mirrored disk must be accessible. Depending on the status of the failing disk, syncvg may complete successfully, or it may fail. If the command fails, a restore from backup may be way to go.To synchronized a logical volume
$ syncvg -lv LVname
To synchronize the entire volume group
$ syncvg -vg rootvg
Verify stale partitions were synchronized (check LV state)
$ lsvg -lv rootvg
or
$ lspv hdisk15 ->STALE PARTITIONS: 0
3. Unmirror the volume group from the failing disk.
$ unmirrorios <failing_hdisk#>
- Note 2:
In some case, unmirrorios and subsequent commands may fail with errors similar to the following:- ksh: The file system has read permission only.
0516-070 : LVM system call found an unaccountable
internal error.
or
ksh: The file system has read permission only.
ksh: <command>: 0403-006 Execute permission denied.
The VIOS errlog may also show LVM errors in addition to the disk operation errors, i.e.
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
80D3764C 1101093817 U H LVDD PV NO LONGER RELOCATING NEW BAD BLOCKS
E86653C3 1101093817 P H LVDD I/O ERROR DETECTED BY LVM
DCB47997 1101093817 T H hdisk0 DISK OPERATION ERROR
DCB47997 1101093417 T H hdisk0 DISK OPERATION ERROR
DCB47997 1101093217 T H hdisk0 DISK OPERATION ERROR
If commands fail making it impossible to clean up prior to the disk replacement, the only alternative at that point may be to reboot the VIOS and go from there. Depending on the state of rootvg, the VIOS may or may not come back. If this is the only VIOS in the managed system, a maintenance window will need to be scheduled to bring the clients down before rebooting the VIOS. If the environment involves dual VIO Servers, and the clients' storage and network are fully redundant via a second VIOS, then rebooting the VIOS in question should have no impact on the clients.
If the VIOS comes back, the failing disk may be put in a Defined state during boot up, and you can try to break the mirror again and remove that failing disk from rootvg at that point (granted there are no stale partitions in the known "good" disk).
If the VIOS does NOT come back, pay close attention to see if the boot up process hangs at a particular reference code as a maintenance mode might be helpful for further diagnostics. For example if the disk failure has caused filesystem corruption, you may be attempt to clean that up and break the mirror from maintenance mode and go from there. Please, note that if the failing disk caused loss of quorum, rootvg may not be accessible even from maintenance mode. In such case, the bad disk may need to be replaced first. Then, you can try booting to SMS to select the mirror copy as the first boot device and attempt to boot from it. The VIOS may need to be restored from a backup image as a very last option. - ksh: The file system has read permission only.
4. If you are able to successfully break the mirror from the failing disk, check to see if the dump device may have been left on the failing disk. By default, the dump device is not mirrored. Therefore, it may be expected to be left in the failing disk if that was the initial disk the VIOS was installed to as it is the case in this example:
$ lsvg -pv rootvg > TOTAL PPs is not equal FREE PPs
$ lspv -map hdisk0 >lg_dumplv
a. Verify lg_dumplv is set as primary dump device. If it is, temporarily set the primary dump device to /dev/sysdumpnull in order to try migrating lg_dumplv to the good disk.
To check primary dump device
$ oem_setup_env
# sysdumpdev -l
primary /dev/lg_dumplv
secondary /dev/sysdumpnull
...
To temporarily set primary dump device /dev/sysdumpnull
# sysdumpdev -Pp /dev/sysdumpnull
# exit (back to padmin)
b. Attempt to move lg_dumplv to the good disk using migratepv command.
$ migratepv -lv lg_dumplv <failing_hdisk> <good_hdisk>
This command may or may not complete depending on whether or not we can read the PPs from the failing PV.
If the command completes successfully, verify the failing disk is empty (TOTAL PPs equals FREE PPs):
$ lsvg -pv rootvg
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk15 active 63 16 01..00..00..02..13
hdisk0 active 558 558 111..96..80..111..112
If the command fails, remove lg_dumplv from the failing disk and recreate it in the good disk using the same # of LPs in 'lsvg -lv rootvg' output:
$ rmlv lg_dumplv
$ mklv -lv lg_dumplv -type sysdump rootvg 1 <good_hdisk>
c. Once the dump device is on the good disk, set it back as primary:
$ oem_setup_env
# sysdumpdev -Pp /dev/lg_dumplv
5. Verify failing disk is empty
$ lspv -map <failing_hdisk>
6. Remove the failing disk from the volume group
$ reducevg rootvg <failing_hdisk>
7. At this point, you can remove the disk definition and have the disk physically replaced:
$ rmdev -dev <failing_hdisk>
Related Information
[{"Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU054","label":"Systems w/TPS"},"Product":{"code":"SSAVQG","label":"PowerVM VIOS Enterprise Edition"},"ARM Category":[{"code":"a8m50000000L0L6AAK","label":"PowerVM VIOS->ROOTVG"}],"ARM Case Number":"TS004672052","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.1.0;3.1.1;3.1.2"}]
Document Information
Modified date:
14 December 2020
UID
isg3T1025981