IBM Support

How to Replace Failing Disk in VIOS, Mirrored Rootvg

Question & Answer


Question

How to free up a failing disk that's part of a PowerVM Virtual I/O Server, mirrored rootvg in preparation to replace the disk. This applies to VIOS 3.1

Cause

Mirrored VIOS rootvg disk is failing.
Note: to determine if the disk may need to be replaced, contact your local Hardware Support Representative.

Answer

In the following example, hdisk0 is the failing disk that needs to be removed from the mirrored rootvg in order to physically replace it, and hdisk15 is the "good" disk.

1. Verify PVs in rootvg

$ lsvg -pv rootvg
rootvg:
PV_NAME   PV STATE  TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk15   active    63          16          01..00..00..02..13
hdisk0    active    558         510         111..96..80..111..112

2. Verify rootvg is mirrored and ensure there are NO stale PPs in the good disk.

If the LV is mirrored, the # of PPs will be doubled the # of LPs, except for the dump device, as shown below:
$ lsvg -lv rootvg
rootvg:
LV NAME     TYPE     LPs   PPs   PVs  LV STATE      MOUNT POINT
hd5         boot     1     2     2    closed/syncd  N/A
hd6         paging   1     2     2    open/syncd    N/A
paging00    paging   1     2     2    open/syncd    N/A
hd8         jfs2log  1     2     2    open/syncd    N/A
hd4         jfs2     1     2     2    open/syncd    /
hd2         jfs2     6     12    2    open/syncd    /usr
hd9var      jfs2     1     2     2    open/syncd    /var
hd3         jfs2     5     10    2    open/syncd    /tmp
hd1         jfs2     12    24    2    open/syncd    /home
hd10opt     jfs2     3     6     2    open/staled   /opt
hd11admin   jfs2     1     2     2    open/syncd    /admin
livedump    jfs2     1     2     2    open/syncd    /var/adm/ras/livedump
lg_dumplv   sysdump  1     1     1    open/syncd    N/A

If LV state shows "stale", determine which PV (hdisk#) has the stale PPs.

$ lspv hdisk0  ->STALE PARTITIONS:   0
$ lspv hdisk15 ->STALE PARTITIONS:   3

If there are stale PPs on the failing disk but not on the good disk, proceed to step 3 to unmirror the volume group from the failing disk.

If the state PPs are on the known good disk (hdisk15 in this example) but none on the failing disk, attempt to synchronize them using syncvg padmin. Note 1: All stale PPs on the known good disk must be synchronized before unmirroring the volume group from the failing disk (hdisk0, in this case). In order to synchronize stale PPs on the good disk, the corresponding PPs on the failing, mirrored disk must be accessible. Depending on the status of the failing disk, syncvg may complete successfully, or it may fail. If the command fails, a restore from backup may be way to go.

To synchronized a logical volume
$ syncvg -lv LVname

To synchronize the entire volume group
$ syncvg -vg rootvg

Verify stale partitions were synchronized (check LV state)
$ lsvg -lv rootvg
or
$ lspv hdisk15 ->STALE PARTITIONS:   0

3. Unmirror the volume group from the failing disk.
$ unmirrorios <failing_hdisk#>
  • Note 2:
    In some case, unmirrorios and subsequent commands may fail with errors similar to the following:
    • ksh: The file system has read permission only.
      0516-070 : LVM system call found an unaccountable
      internal error.
      or
      ksh: The file system has read permission only.
      ksh: <command>: 0403-006 Execute permission denied.
    These error details are commonly seen as a result of disk/adapter failure.
    The VIOS errlog may also show LVM errors in addition to the disk operation errors, i.e.
    IDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTION
    80D3764C   1101093817 U H LVDD           PV NO LONGER RELOCATING NEW BAD BLOCKS
    E86653C3   1101093817 P H LVDD           I/O ERROR DETECTED BY LVM
    DCB47997   1101093817 T H hdisk0         DISK OPERATION ERROR
    DCB47997   1101093417 T H hdisk0         DISK OPERATION ERROR
    DCB47997   1101093217 T H hdisk0         DISK OPERATION ERROR

    If commands fail making it impossible to clean up prior to the disk replacement, the only alternative at that point may be to reboot the VIOS and go from there. Depending on the state of rootvg, the VIOS may or may not come back. If this is the only VIOS in the managed system, a maintenance window will need to be scheduled to bring the clients down before rebooting the VIOS. If the environment involves dual VIO Servers, and the clients' storage and network are fully redundant via a second VIOS, then rebooting the VIOS in question should have no impact on the clients.
    If the VIOS comes back, the failing disk may be put in a Defined state during boot up, and you can try to break the mirror again and remove that failing disk from rootvg at that point (granted there are no stale partitions in the known "good" disk).
    If the VIOS does NOT come back, pay close attention to see if the boot up process hangs at a particular reference code as a maintenance mode might be helpful for further diagnostics. For example if the disk failure has caused filesystem corruption, you may be attempt to clean that up and break the mirror from maintenance mode and go from there. Please, note that if the failing disk caused loss of quorum, rootvg may not be accessible even from maintenance mode. In such case, the bad disk may need to be replaced first. Then, you can try booting to SMS to select the mirror copy as the first boot device and attempt to boot from it. The VIOS may need to be restored from a backup image as a very last option.

4. If you are able to successfully break the mirror from the failing disk, check to see if the dump device may have been left on the failing disk. By default, the dump device is not mirrored. Therefore, it may be expected to be left in the failing disk if that was the initial disk the VIOS was installed to as it is the case in this example:
$ lsvg -pv rootvg > TOTAL PPs is not equal FREE PPs
$ lspv -map hdisk0 >lg_dumplv

a. Verify lg_dumplv is set as primary dump device. If it is, temporarily set the primary dump device to /dev/sysdumpnull in order to try migrating lg_dumplv to the good disk.
To check primary dump device
$ oem_setup_env
# sysdumpdev -l                    
primary              /dev/lg_dumplv 
secondary            /dev/sysdumpnull
...
To temporarily set primary dump device /dev/sysdumpnull
# sysdumpdev -Pp /dev/sysdumpnull
# exit (back to padmin)

b. Attempt to move lg_dumplv to the good disk using migratepv command.
$ migratepv -lv lg_dumplv <failing_hdisk> <good_hdisk>
This command may or may not complete depending on whether or not we can read the PPs from the failing PV.
If the command completes successfully, verify the failing disk is empty (TOTAL PPs equals FREE PPs):
$ lsvg -pv rootvg
rootvg:
PV_NAME   PV STATE  TOTAL PPs   FREE PPs    FREE DISTRIBUTION
hdisk15   active    63          16          01..00..00..02..13
hdisk0    active    558         558         111..96..80..111..112

If the command fails, remove lg_dumplv from the failing disk and recreate it in the good disk using the same # of LPs in 'lsvg -lv rootvg' output:
$ rmlv lg_dumplv
$ mklv -lv lg_dumplv -type sysdump rootvg 1 <good_hdisk>

c. Once the dump device is on the good disk, set it back as primary:
$ oem_setup_env
# sysdumpdev -Pp /dev/lg_dumplv

5. Verify failing disk is empty
$ lspv -map <failing_hdisk>

6. Remove the failing disk from the volume group
$ reducevg rootvg <failing_hdisk>

7. At this point, you can remove the disk definition and have the disk physically replaced:
$ rmdev -dev <failing_hdisk>

[{"Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU054","label":"Systems w/TPS"},"Product":{"code":"SSAVQG","label":"PowerVM VIOS Enterprise Edition"},"ARM Category":[{"code":"a8m50000000L0L6AAK","label":"PowerVM VIOS->ROOTVG"}],"ARM Case Number":"TS004672052","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.1.0;3.1.1;3.1.2"}]

Document Information

Modified date:
14 December 2020

UID

isg3T1025981