Multipathing and disk resiliency with VSCSI in a dual VIOS configuration.

Question & Answer

Question

What are the best settings for my vscsi disk to optimize storage resiliency?

Note : This technote is subject to change anytime we discover a modification is needed, i'd recommend that you check it from time to time and confirm the value you are using are still accurate.

Answer

In a dual VIOS configuration, it’s always good to build a redundant (better resiliency) and load balanced (better performance) environment. This techdoc provides some insights to guide user through vscsi adapters and disks tuning. It covers the configuration and settings on the client LPAR side only, all the points mentioned hereunder are related to vscsi client adapters and disks.

In this techdoc we consider the following configuration:

There are 2 VIOS, and they both have access to the same SAN target LUN. Those LUNs are configured as hdisk on VIOS, and all hdisk have reserve policy = no_reserve.
The client lpar has 2 vscsi adapters. Each vscsi adapter is mapped to one of the 2 VIOS previously mentioned, as follow:
- vscsi0 is attached to one vhost of VIOA
- vscsi1 is attached to one vhost of VIOB
The same SAN disks are mapped through VIOA & VIOB to the client lpar.

With this setup, each SAN disk shared to client lpar is configured as a single hdisk with 2 paths, one coming from vscsi0 – VIOA and the other coming from vscsi1 – VIOB.

The driver of a disk shared through vhost-vscsi attachment does not depend on the end target device type (for instance: File backed device, Logical Volume, SSP LU, SAS disk, SAN disk…), in all case it is configured as a « Virtual SCSI Disk Drive »:

[(0)root@earth]: $ lsdev -Cc disk

...

hdisk10 Available Virtual SCSI Disk Drive

hdisk11 Available Virtual SCSI Disk Drive

The Path Control Module (PCM) (here « PCM/friend/vscsi ») is reported in the disk attributes:

[(0)root@earth]: $ lsattr -El hdisk10

PCM PCM/friend/vscsi Path Control Module False

PR_key_value none N/A True

algorithm fail_over Algorithm True

hcheck_cmd test_unit_rdy Health Check Command True+

hcheck_interval 0 Health Check Interval True+

hcheck_mode nonactive Health Check Mode True+

max_transfer 0x40000 Maximum TRANSFER Size True

pvid 00f610e1d3f4abd10000000000000000 Physical volume identifier False

queue_depth 8 Queue DEPTH True+

reserve_policy no_reserve Reserve Policy True

This PCM provides multipathing capabilities through the MPIO (MultiPathing IO) device driver so that there are 2 different paths to reach the disk:

[(0)root@earth]: $ lsmpio -l hdisk10

name path_id status path_status parent connection

hdisk10 0 Enabled Sel vscsi0 820000000000

hdisk10 1 Enabled vscsi1 810000000000

MPIO provides redundancy, and failure detection capabilities.

Both parent adapters (vscsi0 & vsci1) also has feature providing failure detection and recovery mechanisms which is tunable through vscsi adapter attribute:

[(0)root@earth]: $ lsattr -El vscsi0

rw_timeout 0 Virtual SCSI Read/Write Command Timeout True

vscsi_err_recov delayed_fail N/A True

vscsi_path_to 0 Virtual SCSI Path Timeout True

We cover some of those attributes in this techdoc as to have the highest availability of the disk and try to recover from as much failure as possible.

Note: Starting with 7.2 TL05, the rw_timeout value is not be related to the vscsi adapter. Indeed this value is now being moved as an attribute of the storage device (disk, …).

Multipathing considerations

As previously indicated, MPIO offer the possibility to access a device through multiple paths, it means that MPIO needs an algorithm to determine which path must be used to process an IO request.

For VSCSI disk, there’s only one MPIO algorithm available: « fail_over »

With this algorithm, all the IOs are sent through one single path at a time. It does not provide load-balancing capability. The PCM keeps track of all available path in an ordered list, and send IO request through the first path available. If this path fails or is disabled, then IOs are be sent through the next one.

By default, the order of the list is the same as the parent adapter device number. In our previous example, hdisk10 is configured through vscsi0 and vscsi1, and as you can see in the lsmpio output, vscsi0 is the path 0 and vscsi1 is the path 1. In most case vscsi0 is the adapter with the smallest slot number and is usually attached to the 1st VIOS (here VIOA), while vscsi1 has a higher slot number and is usually attached to VIOB.

If there are let say 100 lpars with the same configuration, each one with 10 disks. It means during normal operation VIOA handles the IO request of 1000 disks, while VIOB has no IO request to deal with.

This can be easily modified just by changing the « path priority » of the disks on client lpar. By default all paths for a VSCSI disk have the same priority « 1 ». The lspath command allows user to check various information on the path including the priority, this value can be changed with chpath command.

To check current path priority for hdisk10 on path_id 0:

[(0)root@earth]: $ lspath -E -l hdisk10 -i 0 -a priority

priority 1 Priority True

To change the path priority:

[(0)root@earth]: $ chpath -l hdisk10 -p vscsi0 -a priority=2

path Changed

The higher the value is, the least this path is prioritized.

If we check the lspath output, we can see the priority is now 2:

[(0)root@earth]: $ lspath -E -l hdisk8 -i 0

priority 2 Priority True

Note: When there’s only 2 paths for a disk, there’s no need to set the priority to a higher number, the algorithm always choose the available path with the smallest priority.

This change is effective immediately, there’s no need to reboot or reconfigure the disk.

By using the path priority we can easily spread the VSCSI IO workload through the 2 VIOS. If the lpar id number is consistent and you have half partition with even number and other are odd, you just need to run the previous command on partition with odd number for all disks on this lpar. So all even lpar IO request are going through VIOA by default, and all odd lpar IO request are going through VIOB.

Note: Both attributes « algorithm » and « path priority » are fully independent from the disk config on VIOS. While algorithm on the client lpar is « fail_over », on the VIOS the physical disk can have a different algorithm (round_robin, shortest_queue… anything else depending on the device driver of this disk).

Disk settings for better resiliency

In previous topic we saw that « fail_over » algorithm sends all IO through a single adapter, and automatically redirect IO through the next available when it fails. But if we have 2 paths for the disks, and the second one also fails soon after the first one, there won’t be no more path available to process IO request.

Here the « health check » feature comes into place. There are 3 attributes for health check:

hcheck_cmd test_unit_rdy Health Check Command True+

hcheck_interval 0 Health Check Interval True+

hcheck_mode nonactive Health Check Mode True+

The « hcheck_cmd » is the command type sent to the disk for health check purpose, by default this command is a « Test Unit Ready (TUR) ». We usually do not recommend changing this value.

The « hcheck_mode » determines against which path the health check command is sent. By default it is sent to « nonactive », which are all paths with no active IO, including the path in failed state.

Note: The health checker never probes the path in « Missing » or « Disabled » state, those paths requires manual intervention through the chpath command. Also note, health checker probes opened disk only(for instance disk with a varied on volume group).

The « hcheck_interval » is the delay between 2 health check command send for a path. The default value « 0 » indicate that no health check process will be done for this disk. As per "IBM AIX MPIO Best practices and consideration", we recommend to change this setting to a value greater than the current rw_timeout value. So assuming the rw_timeout value is set to 120, the hcheck_interval should be set to 150.

Why is it not recommended to set health check interval to a lower value ?

The reason for not setting this value too low is to avoid that the whole VIOS/SAN infrastructure to be overloaded just by health checking command. Something to keep in mind is that every 150 seconds we will send a command through all paths of all opened disks. Those commands require some bandwidth for all components between client lpar adapter and the end target device. Moreover it’s not always a good idea to recover too rapidly, for instance when intermittent and repeated error occurs. If we recover the path too quickly, it increases the risk this path fails again soon, a longer health check interval reduces the use of this path.

VSCSI settings to detect and recover IO error

From client lpar perspective, there are various issues that might lead to IO error, the VIOS can fail or might be rebooted, or a problem on SAN could occur causing IO error or timeout.

There are multiple attributes on the vscsi adapter to address those different issues:

AIX before 7.2 TL05

rw_timeout 0 Virtual SCSI Read/Write Command Timeout True

vscsi_err_recov delayed_fail N/A True

vscsi_path_to 0 Virtual SCSI Path Timeout True

AIX 7.2 TL05 and later

vscsi_err_recov delayed_fail N/A True

vscsi_path_to 0 Virtual SCSI Path Timeout True

The « vscsi_err_recov » on vscsi adapter is similar to the « fc_err_recov » on fcs adapter. This setting allows the vscsi adapter to detect link error between vscsi adapter and the vhost or the end target device. For a multi path disk, it is usually better to enable fast failure for IO so that the failover to another adapter happens as soon as possible. It is recommended to set this attribute to « fast_fail ». (While in a single path configuration, the default value delayed_fail is recommended.)
Starting AIX 7.2 TL05, the new default value for vscsi_err_recov is fast_fail. While it is still recommended to set this value to delayed_fail in case of a single VIOS configuration, the default value fast_fail does not introduce any new challenge.

The « vscsi_path_to » is the amount of time the driver waits for an IO to be serviced before this IO is ended. When IO duration reaches this timeout value, one more attempt to contact the vhost adapter is done, waiting 60 seconds for a response. If there’s still no answer from the server adapter, all IOs through this adapter are declared failed. The PCM then attempts to issue those failed IO through another path, or if there’s no more path available, the IO is returned as failed to the application. The recommended value for « vscsi_path_to » is 30.

The Read Write timeout

The « rw_timeout » value was introduced in AIX 6.1 TL9, it was first designed as an attribute of the vscsi adapter. But starting AIX 7.2 TL5 the rw_timeout is now a disk attribute. This change was introduced because depending on the backing device type the recommended value can be different, and a single vscsi adapter might address different backing device.

The goal of rw_timeout feature is to allow timing of an individual IO command at VSCSI layer, and helps detecting possible hung IO condition. It does not rely on the VIOS availability as in some case the VIOS can be healthy and correctly answer to the ping from vscsi, but IO request might take too much time to complete (SAN issue, performance problem…). When enabled, all IOs are timed and if any takes longer than the « rw_timeout » value the VSCSI adapter times out the IO. The VSCSI adapter is then closed, and an attempt to reopen it is performed. As the IO failed through this path, MPIO PCM then retries the IO through another path.

When it was first released the default value for « rw_timeout » was 0, which means it is disabled and it does not monitor the duration of any IO request. The correct range to enable « rw_timeout » was 120 - 3600 (this value is in seconds). While user might set it to a value lower than 120, the VSCSI device driver automatically sets it to 120 for any lower value.
With AIX 7.1 TL05, AIX 7.2 TL02 or higher, the lowest value for rw_timeout was reduced down to 45 seconds instead of the former value 120 seconds. 45 became the new default value until SSP LU was introduced as backing device, when the new default became 300.

There's no rw_timeout value that would suit all configuration and avoid all possible issue. While we could think the lower value is the better, it's actually not completely true. Indeed the recovery process for a timed out IO is more disruptive than an IO failure. (As stated above, it implies a full reconfiguration of the adapter and thus will impact all devices using this adapter.) The rw_timeout value should be set at value high enough to make sure that IO survives any potential failure in the SAN fabric or on the storage device. (For instance, rw_timeout should be longer than the delay SAN switch needs to perform a reboot, or the time it takes for storage port to restart...), in that case it's usually better to set the value somewhere around 120.
Assuming the application on LPAR has its own failover mechanism for failed IO or it has a lower timeout you may still need to set it to 45. But keep in mind that in that case any IO timeout, this will implies more disruptive recovery operation, than an IO failure.

Important note about rw_timeout:

It was recently discovered that setting a too low value for the rw_timeout on a client LPAR vscsi adapter, might cause some critical issue with all the client LPAR attached to the same VIOS/Physical Adapter. Indeed, when the client LPAR adapter disconnect from the VIOS, in an attempt to recover connection, the VIOS detects this disconnect and might as well close the connection to the storage. Which leads to IO issue for all other client LPAR related to this VIOS or physical adapter. So if the client LPAR triggers a rw_timeout before the VIOS detects it, you might be facing this issue.
To avoid this issue happening we recommend that the rw_timeout value on the client LPAR vscsi adapter is set to a value significantly higher than the rw_timeout on the VIOS. In an attempt to make sure VIOS detects the IO error before the client LPAR.
A good starting point for setting the rw_timeout on the vscsi adapter is to choose a value 4 times the longest value of the physical disks attached to the VIOS.
For instance, if the disks on VIOS have rw_timeout set to 30, a reasonable value for the vscsi adapter is rw_timeout=120.

Which are the recommended value:

For AIX older than 7.2 TL05
- If all the disks configured are coming from « standard / legacy» backing device (physical volume, logical volume), the vscsi adapter have rw_timeout set to 4 times the rw_timeout of the physical device. (for instance set it to 120, if physical disk is set with 30)
- If any of the disks from the vscsi adapter is backed by an SSP LU, it is recommended the vscsi adapter has rw_timeout set to 180
- Other device, such as cdrom, can be set to 45.
- If there are different backing device types, use the greater of all the value calculated.
For AIX 7.2 TL05 and later
- The default recommended value remains the same as before, but they are now set individually on the device itself and not anymore on the vscsi adapter. So if a single vscsi adapter has disk coming from different type of backing device they can all have their own recommended value.
- For any disk backed by legacy device (physical volume, logical volume), set the rw_timeout value to to 120 (or 45 in case you are running application with a short timeout or with its own failover mechanism.
- For all disks backed by an SSP LU, the default and recommended value is 180.
- Virtual Tape devices have uniquetype "tape/vscsi/ost" or "tape/vscsi/scsd". For virtual tape devices of type "tape/vscsi/ost", time out is already defined (attribute name defined in ODM is “rwtimeout”) and the default value is set to 144 seconds. Where for scsd tape devices, the time out is defined in scsd vpd page and is not user configurable.
- Other devices have a default value of 45.

Those specified default values for rw_timeout are based on the assumption that the vscsi devices provisioned on the client are from dual vios. If only one VIOS is serving the client, then it is recommended that rw_timeout value is longer than the default.

Default recommended values summary:

The recommended values in multi path setup are:

* Before AIX 7.2 TL05:

For VSCSI disks:

Leave the MPIO PCM Algorithm unchanged: « fail_over »
Change path priority on some of the lpars with « chpath » command
Leave « hcheck_cmd » unchanged: « test_unit_rdy »
Leave « hcheck_mode » unchanged: « nonactive »
Enable health check by setting « hcheck_interval » to 150 (or 210 in case rw_timeout is 180 for SSP disks):
- chdev -l hdiskXX -a hcheck_interval=150 -U

For VSCSI adapters:

Enable fast failure on error detection:
- chdev -l vscsiYY -a vscsi_err_recov=fast_fail -P
Enable vhost adapter polling with the « vscsi_path_to »:
- chdev -l vscsiYY -a vscsi_path_to=30 -P
Enable IO timing mechanism with the « rw_timeout »
- If all backing devices are physical volume or legacy backing devices (assuming backing device is set with rw_timeout 30):
  - chdev -l vscsiYY -a rw_timeout=120 -P
- If some of the backing devices are SSP LUs:
  - chdev -l vscsiYY -a rw_timeout=180 -P
- or for AIX before 7.1 TL04 / 7.2 TL02:
  - chdev -l vscsiYY -a rw_timeout=120 -P

* After AIX 7.2 TL05:

For VSCSI disks:

Leave the MPIO PCM Algorithm unchanged: « fail_over »
Change path priority on some of the lpars with « chpath » command
Leave « hcheck_cmd » unchanged: « test_unit_rdy »
Leave « hcheck_mode » unchanged: « nonactive »
enable IO timing mechanism with rw_timeout:
- If backing device is a physical volume or logical volume:
  - For application with low timeout and/or providing their own failover mechanism (disk rw_timeout on VIOS should be 10, if supported by storage vendor): chdev -l hdiskX -a rw_timeout=45 -P
  - For all other cases (disk rw_timeout is 30 on the VIOS) : chdev -l hdiskX -a rw_timeout=120 -P
- If the backing device is an SSP LU:
  - chdev -l hdiskX -a rw_timeout=180 -P
Enable health check by setting « hcheck_interval » to 150 (or 210 in case rw_timeout is 180 for SSP disks):
- chdev -l hdiskXX -a hcheck_interval=150 -U

For VSCSI adapters:

Enable fast failure on error detection:
- chdev -l vscsiYY -a vscsi_err_recov=fast_fail -P
Enable vhost adapter polling with the « vscsi_path_to »:
- chdev -l vscsiYY -a vscsi_path_to=30 -P

[{"Type":"MASTER","Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z000000cwwHAAQ","label":"PowerVM"}],"ARM Case Number":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"6.1.0;7.1.0;7.2.0;7.3.0"},{"Product":{"code":"SWG10","label":"AIX"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}},{"Product":{"code":"SWG10","label":"AIX"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Tips

Multipathing and disk resiliency with VSCSI in a dual VIOS configuration.

Question & Answer

Question

Answer

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?