IBM Support

Data Collection for Debugging application hang.

How To


Summary

Determining cause of the application hang usually requires quite a bit of detective work.
But obvious root cause of these issues mostly one of the below Bugs :

• a thread waiting for an event that is never wake up by other thread
• two threads each holding a lock and trying to acquire the others
• A thread is trying to take lock recursively
• A thread is handling the lock in specified order
• Resource Crunch

Objective

Data Collection Steps : Collecting right Data at right time is important as it will reduce investigation time.
 

Environment

AIX 

Steps

We have to follow below Steps to Collect once we see the application 
1>    Collect PERFPMR  ( Please follow the instruction from https://www.ibm.com/support/pages/collecting-perfpmr-aix-71)
2>    Collect data related to the hanged processes 

a)	proctree <pid of the hanged process >
b)	Get the procstack for  the all pids show in the proctree.
c)	procstack <pid of the hanged process >
d)	procstack <pid of the process related to the hanged process>
e)	echo "tpid -d <pid of the hanged process or any related process" | kdb | grep "pvthread"
example : 
# echo "tpid -d 4457122" | kdb | grep "pvthread"
pvthread+11E400 4580 srcmstr SLEEP 1E402C1 03C 1536   

f)	Use pvthread+11E400  in below command like 
echo "f pvthread+11E400" | kdb
g)	Repeat e  and f for all the processes (hanged process and related process ) . 
If any of the proctsack command shows tid# like below , Please follow 
steps e  and f for all the thread to display Stack 

Example :
procstack 15794898
15794898: /opt/rsct/bin/rmcd -a IBM.LPCommands -r -S 1500
---------- tid# 43385433 (pthread ID:   1) ----------
0xd027a3ec __fd_select(??, ??, ??, ??, ??) + 0xcc
0x100007f4 select(0x1c, 0x2ff228b8, 0x0, 0x0, 0x0) + 0x34
0x10000f70 ctrl_loop() + 0x730
0x10003560 main(0x6, 0x2ff22d40) + 0x15c0
0x100001b8 __start() + 0x68
---------- tid# 91816417 (pthread ID: 2314) ----------
0xd058cb34 _event_sleep(??, ??, ??, ??, ??, ??) + 0x4f4
0xd058d81c _event_wait(??, ??) + 0x35c
0xd059c
3>    If this hang is easily re-creatable , then collect the trace 
 
a)    trace -an -C all -T10M -L200M -o /tmp/ibmsupt/testcase/trace.raw
b)    re-create the hang issue 
c)    trcstop
d)    collect the snap using "snap -c"
4>    if you are not having re-create and need faster solution , you have to take system dump 
Note this will reboot the System
a)    # sysdumpstart -p ( from victim )
or 
b)    # chsysstate -m <Server name > -o dumprestart -r lpar -n <lpar name> ( From HMC)
c)    # savecore -fd /var/adm/ras
           481-183    Saving 432887296 bytes of system dump in /var/adm/ras/vmcore.0.BZ
         (0)
d)    collect the snap using "snap -c"



Additional Information

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
Author: Chetan Gaonkar
Operating System: AIX and VIOS
Hardware: Power
Feedback: aix_feedback@wwpdl.vnet.ibm.comcgaonkar@in.ibm.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z0000001fMuAAI","label":"AIX General Support"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
26 June 2021

UID

ibm16467457