IBM Support

How to analyze Automated LPM snap collection to troubleshoot LPM performance issue

Question & Answer


Question

How can i check LPM performance for dedicated Stream id

Answer


Starting with VIOS 2.2.2.0, newer LPM RAS (Reliability, availability and serviceability) Infrastructure has been implemented

When a migration completes, before ending the migration VIOS code creates a “mini-snap”.
A mini-snap is basically a snapshot of all the RAS data for a single migration.
Mini-snaps are taken for both successful and failed migrations.

All the data is collected into a compressed tar file in either the /home/ios/logs/LPM/minisnap_success or /home/ios/logs/minisnap_fail directory.
Mini-snaps files are named <data>_<time>.<streamid>.<src/dst>.tar.gz to make it easy to identify when the migration was completed and whether it was for the source or destination MSP.

NOTE:

• Taking mini-snaps for successful migrations is enabled by default and its controlled by the lpm_msnsap_succ ODM attribute.

• If a migration fails before a migration can be started (before src_start or dst_start are at least partly executed) no mini-snap will be taken and a stream specific log will be stored in /home/ios/logs/LPM/.


To prevent mini-snaps and other RAS files from eventually filling up /home a generic file cleanup method was created. A cron job which reads a file cleanup configuration file and removes any files that meet the criteria specified in the configuration file. The configuration file, cleaning.conf, can be found in the /usr/ios/fcleaning directory.

The cleaning criteria are the following :

Stream specific migmover logs from /home/ios/logs/LPM
After 30 days
Any number over 50
Any number over 17 if they are older than 7 days and there is less than 999 megabytes of free space in the filesystem

Mini-snaps for successful migrations from /home/ios/logs/LPM/minisnap_success
After 10 days
Any number over 17
As a safeguard, mini-snap tar file for successful migrations (mini-snaps that have been uncompressed) are removed from /home/ios/logs/LPM/minisnap_success After 7 days

Mini-snaps for failed migrations from /home/ios/logs/LPM/minisnap_fail
After 30 days
Any number over 33
As a safeguard, mini-snap tar files from failed migrations are removed from /home/ios/logs/LPM/minisnap_fail After 14 days

In the below sequence, mini_snap collection will be review to address a Performance issue

The first step is to locate the Source MSP by checking errlog from _all_ VIOS involved and retrieve « LABEL : MVR_MIG_COMPLETED »

Virtual I/O Server errlog will provide something similar :

_______________________________________________
LABEL: MVR_MIG_COMPLETED
IDENTIFIER: 3EB09F5A
Date/Time: Wed Dec 7 07:19:53 2016
Resource Name: Migration
STREAM ID
D591 66C0 A5FC 8502
SERVICES (Source or Target)
Source MSP
_______________________________________________

From "/home/ios/logs/LPM/minisnap_success" directory of Virtual I/O Server which was identified as MSP Source, locate the Stream ID directory to analyze:

/home/ios/logs/LPM/minisnap_success $ ls
20160901_082944.e7a737e8048f0dba.80.dst.tar.gz
20160901_084112.c6d60c1a76109543.77.dst.tar.gz
20160901_090013.8255467d65ab8789.84.dst.tar.gz
20160901_083257.ea17435448d2e6ae.78.dst.tar.gz
20160901_085326.d026da467cda68b3.82.dst.tar.gz
20160902_083332.d59166c0a5fc8502.76.src.tar.gz


Then untar/gunzip the tar file to be review under /home/ios/logs/LPM/minisnap_success

Now proceed by reviewing data under 20160902_083332.d59166c0a5fc8502.76.src

1. Review Final migration Statistics

The « Final migration statistics » are displayed at the end of global LPM alog file : lpm.log_alog



Above "Effective network throughput" confirms that data exchanged between the SOURCE & DESTINATION MSP are not good enough ( 42,67 Mbytes/s )

LPM Performance issue are usually caused by VIOS sizing or network throughput

2. To Eliminate possible CPU sizing constraint, check "Number of donate cycles" from lpm_<STREAMID>_<ID>.log_alog

[0 12714008 12/07/16-07:11:11 migmover_cmd.c 1.103 1244] Number of donate cycles 13491.
[0 19005538 12/07/16-07:11:26 migmover_cmd.c 1.103 1244] Number of donate cycles 15053.
[0 13893782 12/07/16-07:11:41 migmover_cmd.c 1.103 1244] Number of donate cycles 16592.
[0 10944580 12/07/16-07:11:56 migmover_cmd.c 1.103 1244] Number of donate cycles 18127.

This is the number of times we donated cycles to PHYP.
Conclusion:
If this number is not going up, preferably by at least a couple of thousand every 30 seconds then the problem is likely that the VIOS is not donating enough cycles to PHYP. Make sure the VIOS has enough CPU assigned.


3. If the donated cycles is going up as expected, it's now time to check if there is no network performance issue.

In the past this can often be confirmed by running a basic bandwidth test using ftp

This FTP Test will be better if it's run in both directions source to target and target to source as they are not always the same.

SRC MSP> ftp <DEST MSP IP>
ftp> bin
ftp> put "|dd if=/dev/zero bs=64k count=50000" /dev/null
ftp>quit

This will give you the number of bytes sent and the amount of time it took.
For a 1Gbit/s network adapter expect about 100MBytes/sec if configured properly and not degraded by other traffic.
The number will be 10X for a 10Gb network.
Now, considering the new VIOS 4.1 & 3.1.4.x , the ftp service is disabled, so FTP method cannot be used.
The alternative method to test the network performance would be to use scp
First operation is to create a significant file ( 2GB in this example) on each MSP (Source  & Destination)
# dd if=/dev/zero of=/home/padmin/file2G count=2000 bs=1m
The best way would be to create a file that AIX file System Caching feature can map into memory.
Check "fre" column from vmstat 1 to see the size of free list : the units are in 4KB pages, so to cache 2 GB means 500000 pages are needed
This would provide you the free size in kBytes then you can adjust file size creation with dd command.
kthr    memory              page              faults              cpu
----- ----------- ------------------------ ------------ -----------------------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa    pc    ec
 0  0 1057025 802043   0   0   0   0    0   0  32  314 249  0  0 99  0  0.02   1.7
 0  0 1057025 802043   0   0   0   0    0   0  77 3262 434  2  5 92  0  0.14  13.8
 0  0 1057025 802043   0   0   0   0    0   0  32   31 249  0  0 99  0  0.01   1.4
fre = 802043  -> ~3,1 GB
If you have any doubt about file size, you can create a lower one as by checking the actual free list, we don't consider cases where filecache is already used by other.
Once the scp test is done, remove the file which will be released from memory caching.
# scp /home/file2G padmin@<REMOTE MSP>:/dev/null
We recommend to proceed with scp test from both directions (MSP as receiver & sender)
You can also use a  dedicated network performance tool (i.e iperf)

4. If you still experience LPM performance issue, open Service Request to IBM Support to proceed further analysis by providing the following data collection from MSP Source & Destination :

$ oem_setup_env
# echo "y" |snap -r
# mkdir -p /tmp/ibmsupt/testcase
# ifconfig -a > /tmp/ibmsupt/testcase/ifconfig.a.before
# netstat -v > /tmp/ibmsupt/testcase/netstat.v.before
# netstat -s > /tmp/ibmsupt/testcase/netstat.s.before
# netstat -ni > /tmp/ibmsupt/testcase/netstat.i.before
# netstat -nr > /tmp/ibmsupt/testcase/netstat.r.before
# arp -an > /tmp/ibmsupt/testcase/arp.an.before

# startsrc -s iptrace -a "/tmp/ibmsupt/testcase/iptrc.bin"

Perform FTP test between both MSP having Network Performance Issue,then stop iptrace after 30 sec.

# stopsrc -s iptrace

# ifconfig -a > /tmp/ibmsupt/testcase/ifconfig.a.after
# netstat -v > /tmp/ibmsupt/testcase/netstat.v.after
# netstat -s > /tmp/ibmsupt/testcase/netstat.s.after
# netstat -ni > /tmp/ibmsupt/testcase/netstat.i.after
# netstat -nr > /tmp/ibmsupt/testcase/netstat.r.after
# arp -an > /tmp/ibmsupt/testcase/arp.an.after
# snap -ac

NOTE: "snap-ac" will automatically collect testcase directory and the LPM minisnap requires to proceed further analysis

[{"Product":{"code":"SSPHKW","label":"PowerVM Virtual I\/O Server"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"Not Applicable","Platform":[{"code":"PF002","label":"AIX"}],"Version":"2.2.4;2.2.3;2.2.2;2.2.1;2.2.0","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
12 July 2024

UID

isg3T1024652