How to Replace Failing Disk in VIOS, Mirrored Rootvg

Question & Answer

Question

How to free up a failing disk that's part of a PowerVM Virtual I/O Server, mirrored rootvg in preparation to replace the disk. This applies to VIOS 3.1

Cause

Mirrored VIOS rootvg disk is failing.

Note 1: To determine whether the disk might need to be replaced, contact your local Hardware Support Representative. If your local Hardware Support Representative determined a disk replacement is needed, read this document to its entirety before implementing the procedure.

Answer

In the following example, rootvg was mirrored before the disk failure issue.

1. Verify the current physical volume (PV) state in rootvg.

$ lsvg -pv rootvg
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk15 active 63 16 01..00..00..02..13
hdisk0 active 558 510 111..96..80..111..112

where hdisk0 is the failing disk that needs to be physically replaced and hdisk15 is the known "good" disk.

2. Verify rootvg is mirrored and ensure there are NO stale PPs in the good disk.

To determine whether the logical volume (LV) is mirrored, check if the # of PPs is doubled the # of LPs, except for the dump device. In the following example, most LVs are mirrored with the exception of the system dump:
$ lsvg -lv rootvg
rootvg:
LV NAME TYPE LPs PPs PVs LV STATE MOUNT POINT
hd5 boot 1 2 2 closed/syncd N/A
hd6 paging 1 2 2 open/syncd N/A
paging00 paging 1 2 2 open/syncd N/A
hd8 jfs2log 1 2 2 open/syncd N/A
hd4 jfs2 1 2 2 open/syncd /
hd2 jfs2 6 12 2 open/syncd /usr
hd9var jfs2 1 2 2 open/syncd /var
hd3 jfs2 5 10 2 open/syncd /tmp
hd1 jfs2 12 24 2 open/syncd /home
hd10opt jfs2 3 6 2 open/staled /opt
hd11admin jfs2 1 2 2 open/syncd /admin
livedump jfs2 1 2 2 open/syncd /var/adm/ras/livedump
lg_dumplv sysdump 1 1 1 open/syncd N/A

Note 2: If the Virtual Media Repository, /dev/VMLibrary with mount point name /var/vio/VMLibrary, is in rootvg, it is commonly not mirrrored due to its size. If VMLibrary resides on the failing disk, you can attempt to migrate it to another disk in rootvg using migratepv command. However, if migratepv hangs or fails, the Virtual Media Library may need to be removed and recreated if the state of the failing disk no longer allows disk I/O to succeed.

For those logical volumes that are mirrored, if the LV STATE shows "stale", determine which PV (hdisk#) has the stale PPs.

$ lspv hdisk0 ->STALE PARTITIONS: 0
$ lspv hdisk15 ->STALE PARTITIONS: 3

If there are stale PPs on the failing disk but not on the good disk, proceed to step 3 to unmirror the volume group from the failing disk.

If the state PPs are on the known good disk (hdisk15 in this example) but none on the failing disk, attempt to synchronize them using syncvg padmin.

Note 3: All stale PPs on the known good disk must be synchronized before unmirroring the volume group from the failing disk (hdisk0, in this case). To synchronize stale PPs on the good disk, the corresponding PPs on the failing (mirrored) disk must be accessible. Depending on the status of the failing disk, syncvg may complete successfully, or it may fail. If the command fails, a restore from backup may be needed.

In some cases, the "good" disk may show no stale PPs. However, when re-establishing the mirror, syncvg fails because of bad blocks on the "good" disk. Bad blocks can easily go undetected until a syncvg operation is done because syncvg reads every PP of every LV, which does not happen in normal operation. This can occur during disk replacements, but also when splitting mirrors for alt_disk upgrades etc. To check for undetected bad blocks, scan the "good" disk using dd command via oem_setup_env shell before unmirroring the volume group, and check the errlog for errors while dd is running:

$ oem_setup_env
# dd if=/dev/rhdiskX of=/dev/null bs=2m

To synchronized a logical volume
$ syncvg -lv LVname

To synchronize the entire volume group
$ syncvg -vg rootvg

Verify stale partitions were synchronized (check LV state)
$ lsvg -lv rootvg
or
$ lspv hdisk15 ->STALE PARTITIONS: 0

3. Unmirror the volume group from the failing disk.
$ unmirrorios <failing_hdisk#>

Note 4:
In some case, unmirrorios and subsequent commands may fail with errors similar to the following:
- ksh: The file system has read permission only.
  0516-070 : LVM system call found an unaccountable
  internal error.
  or
  ksh: The file system has read permission only.
  ksh: <command>: 0403-006 Execute permission denied.
These error details are commonly seen as a result of disk or adapter failure.
The VIOS errlog may also show LVM errors in addition to the disk operation errors, e.g.
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
80D3764C 1101093817 U H LVDD PV NO LONGER RELOCATING NEW BAD BLOCKS
E86653C3 1101093817 P H LVDD I/O ERROR DETECTED BY LVM
DCB47997 1101093817 T H hdisk0 DISK OPERATION ERROR
DCB47997 1101093417 T H hdisk0 DISK OPERATION ERROR
DCB47997 1101093217 T H hdisk0 DISK OPERATION ERROR

If commands fail making it impossible to clean up before replacing the disk, the only alternative at that point may be to restart the VIOS and go from there. Depending on the state of rootvg, the VIOS may or may not come back. If this is the only VIOS in the managed system, a maintenance window will need to be scheduled to bring the clients down before rebooting the VIOS. If the environment involves dual VIO Servers, and the clients' storage and network are fully redundant via a second VIOS, then rebooting the VIOS in question should have no impact on the clients.
If the VIOS restarts successfully, the failing disk may be put in a Defined state, and you can try to break the mirror again and remove the failing disk from rootvg at that point. (Granted there are no stale partitions in the known "good" disk).
If the VIOS fails to restart, pay close attention to the console messages and determine if it hangs at a particular reference code as a maintenance shell might be needed for further diagnostics. For example, if the disk failure has caused filesystem corruption, you may be attempt to clean that up and break the mirror from a maintenance shell and go from there. Note that if the failing disk caused loss of quorum, rootvg may not be accessible even from maintenance mode. In such case, the bad disk may need to be replaced first. Then, you can try activating the VIOS partition to SMS to select the mirror copy as the first bootable device and attempt to boot from it. The VIOS may need to be restored from a backup image as a very last option.

4. If you are able to successfully break the mirror from the failing disk, check to see if the dump device may have been left on the failing disk. By default, the dump device is not mirrored. Therefore, it may be expected to be left in the failing disk if that was the initial disk the VIOS was installed to as it is the case in this example:
$ lsvg -pv rootvg > TOTAL PPs is not equal FREE PPs
$ lspv -map hdisk0 >lg_dumplv

a. Verify lg_dumplv is set as primary dump device. If it is, temporarily set the primary dump device to /dev/sysdumpnull and try migrating lg_dumplv to the good disk.
To check primary dump device
$ oem_setup_env
# sysdumpdev -l
primary /dev/lg_dumplv
secondary /dev/sysdumpnull
...
To temporarily set primary dump device /dev/sysdumpnull
# sysdumpdev -Pp /dev/sysdumpnull
# exit (back to padmin)

b. Attempt to move lg_dumplv to the good disk using migratepv command.
$ migratepv -lv lg_dumplv <failing_hdisk> <good_hdisk>
This command may or may not complete depending on whether we can read the PPs from the failing PV.
If the command completes successfully, verify the failing disk is empty (TOTAL PPs equals FREE PPs):
$ lsvg -pv rootvg
rootvg:
PV_NAME PV STATE TOTAL PPs FREE PPs FREE DISTRIBUTION
hdisk15 active 63 16 01..00..00..02..13
hdisk0 active 558 558 111..96..80..111..112

If the command fails, remove lg_dumplv from the failing disk and re-create it in the good disk using the same # of LPs in 'lsvg -lv rootvg' output:
$ rmlv lg_dumplv
$ mklv -lv lg_dumplv -type sysdump rootvg 1 <good_hdisk>

c. When the dump device is on the good disk, set it back as primary:
$ oem_setup_env
# sysdumpdev -Pp /dev/lg_dumplv

5. Verify failing disk is empty
$ lspv -map <failing_hdisk>

6. Remove the failing disk from the volume group
$ reducevg rootvg <failing_hdisk>

Note 5: Sometimes a disk is removed from the system without first running "reducevg rootvg PhysicalVolume". This causes "lsvg -pv rootvg" to list the Physical Volume ID (PVID) instead of the hdisk# in Missing state. The VGDA still has this removed disk in its memory, but the PhysicalVolume name no longer exists or has been reassigned. To remove references to this missing disk you can still use reducevg, but with the PVID instead of the disk name:

$ reducevg rootvg PVID

7. At this point, you can remove the disk definition using rmdev command.

Note 6: If the failing disk is a SAS disk configured as a RAID0 array, see How to replace a failed SAS disk (pdisk) in a RAID 0 array before removing the disk definition. Failure to do so may lead to undesirable results.
$ rmdev -dev <failing_hdisk>

8. Physically replace the failing disk. To discover the new disk, run cfgdev command.

Related Information

Mirroring rootvg for PowerVM Virtual I/O Server

lsvg shows VIOS rootvg hdisk with PV STATE "missing" or "removed"

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSPHKW","label":"PowerVM Virtual I\/O Server"},"ARM Category":[{"code":"a8m50000000L0L6AAK","label":"PowerVM VIOS-\u003EROOTVG"}],"ARM Case Number":"TS004672052","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.1.1;3.1.2;3.1.3"}]

Was this topic helpful?

Document Information

More support for:
PowerVM Virtual I/O Server

Component:
PowerVM VIOS->ROOTVG

Software version:
3.1.1, 3.1.2, 3.1.3

Document number:
632411

Modified date:
16 May 2024

UID

isg3T1025981

IBM Support

Tips