Troubleshooting
Problem
In this document we go through a few of the most common causes for a hanging mksysb operation.
We will dive in to the internals during NIM backups and learn how to troubleshoot and fix those problems.
Symptom
- The mksysb displays 100% complete on your NIM, but prompt is never returned.
- The mksysb is hanging at a certain percent of the backup.
Cause
1 The cause for a hang is usually a problem with the network. At the beginning of a NIM mksysb, the NIMSH daemon working on the client LPAR will open two TCP sessions, one on client port 3901 to > NIM 1023-513 and one on client port 3902 to > NIM 1023-513 where the second session is referred to as Auxiliary session and will be used to relay the mksysb command success/failure return code when the backup complete.
If this session is dropped or interrupted, the NIM master will keep waiting for that return code even after the process is fully complete and successful.
2 During mksysb backup we use the ‘backbyname’ command to backup the data we need, if the command is unable to access/read a specific file or directory, the process may hang. Normally, this would be caused by a hung NFS mount point or one where the root used has no read permissions for.
Additionally, this may be caused by a corrupt file system.
Diagnosing The Problem
Mksysb hanging after 100% complete.
To diagnose this problem, we will need to take a iptrace/tcpdump from both the NIM LPAR and the Client LPAR during the hanging operation.
*Before starting ensure you have at least 500MB free in the /tmp file system.
You can do that by following the bellow steps:
- Start iptrace on the client LPAR:
# startsrc -s iptrace -a "-s <NIM IP> -p 3901,3902 -b -L 1000000000 /tmp/<hostname>.iptrace"
- Start iptrace on the NIM LPAR:
# startsrc -s iptrace -a "-s <Client IP> -p 3901,3902 -b -L 1000000000 /tmp/<hostname>.iptrace"
- Start the mksysb operation:
# nim -o define -t mksysb -a server=master -a source=<Client LPAR> -a mk_image=yes -a location=<where to save our mksysb file> < NIM Resource name>
- Wait for the process to hang, you can verify its hanging by checking if the mksysb process is done on the Client LPAR, for example using “ps -ef’
# ps -ef | grep backbyname
root 19529868 18350146 120 06:23:24 - 0:10 backbyname -i -q -v -Z -p -U -f /tmp/20512972.mnt0/kronos.mksysb2
If this process is gone, it means the backup operation has either failed or completed. - *Note this process may start a minute or two after intiating the mksysb command on NIM.
- Stop the iptrace on both LPARs and analyze the data:
# stopsrc -s iptrace
You can use a tool like Wireshark to open the trace files and analyze the data, you need to look for “Retrasmission” packages on the client side of the trace and on port 3902, those indicate the client LPAR is trying to send out something to NIM, but is not getting a reply. You can verify the NIM is not receiving those packages by looking for the same package ID on the NIM side of the trace.
If the packages are missing from the NIM side, it means they were dropped in between the two LPARs, most likely by a firewall.
WireShark analysis will show retransmissions for the "FIN" package on the NIM AUX port 3902:
The FIX
This is generally caused by the TCP timeout setting on your firewall, because the AUX session may remain idle for a long time, depending on how long it takes for the mksysb to complete, some firewalls may consider the session inactive and drop it.
To fix this, your firewall's TCP timeout window must be increased to match the longest time it may take for an mksysb operation to complete.
As a workaround, the TCP keepalive settings on AIX may be tuned, the most common one is tcp_keepidle, this attribute is responsible for the time it takes for AIX to start sending keepalive packages on idle sessions. The default is 14400 half seconds which are 2 hours.
IBM does not offer recommendations on what the value should be, but its good if it's not less than 15 minutes or 1800 half seconds.
To change the value on your AIX system use:
# no -p -o tcp_keepidle=1800
To verify the value is set, use # no -a | grep keep
Mksysb hanging at a certain percent.
You can confirm the mksysb is hanging when there is no progress seen in the mksysb file size and the “backbyname” is still running on the Client LPAR:
# ps -ef | grep backby
root 19267770 20709568 101 07:24:20 - 0:00 backbyname -i -q -v -Z -U -f /tmp/20512946.mnt0/kronos.mksysb2
# ls -l /tmp/20512946.mnt0/kronos.mksysb2
-rw-r--r-- 1 root system 541900800 Apr 1 07:24 /tmp/20512946.mnt0/kronos.mksysb2
Wait 5 minutes and check again:
# ls -l /tmp/20512946.mnt0/kronos.mksysb2
-rw-r--r-- 1 root system 541900800 Apr 1 07:24 /tmp/20512946.mnt0/kronos.mksysb2
If true, you can stop the mksysb process and kill “backbyname” on the Client LPAR if it remains running after stopping mksysb.
Then follow the bellow steps to determine the hang point:
# nim -o define -t mksysb -a server=master -a mksysb_flags=piv -a source=kronos -a mk_image=yes -a location=/export/mksysb/kronos.mksysb2 kronos_mksysb2
This will display the files that are being archived on the console like this:
a 572 ./usr/share/lib/me/local.me
a 331 ./usr/share/lib/me/null.me
a 2477 ./usr/share/lib/me/refer.me
a 1859 ./usr/share/lib/me/sh.me
a 1208 ./usr/share/lib/me/tbl.me
a 650 ./usr/share/lib/me/thesis.me <<<<
./usr/share/lib/me/local.me
./usr/share/lib/me/null.me
./usr/share/lib/me/refer.me
./usr/share/lib/me/sh.me
./usr/share/lib/me/tbl.me
./usr/share/lib/me/thesis.me
./usr/share/lib/ms <<<<<<<< HANG POINT
Document Location
Worldwide
Was this topic helpful?
Document Information
Modified date:
16 December 2021
UID
ibm16149163