How To
Summary
Determining cause of the application hang usually requires quite a bit of detective work.
But obvious root cause of these issues mostly one of the below Bugs :
• a thread waiting for an event that is never wake up by other thread
• two threads each holding a lock and trying to acquire the others
• A thread is trying to take lock recursively
• A thread is handling the lock in specified order
• Resource Crunch
Objective
Data Collection Steps : Collecting right Data at right time is important as it will reduce investigation time.
Environment
AIX
Steps
We have to follow below Steps to Collect once we see the application
1> Collect PERFPMR ( Please follow the instruction from https://www.ibm.com/support/pages/collecting-perfpmr-aix-71)
2> Collect data related to the hanged processes
2> Collect data related to the hanged processes
a) proctree <pid of the hanged process >
b) Get the procstack for the all pids show in the proctree.
c) procstack <pid of the hanged process >
d) procstack <pid of the process related to the hanged process>
e) echo "tpid -d <pid of the hanged process or any related process" | kdb | grep "pvthread"
example :
# echo "tpid -d 4457122" | kdb | grep "pvthread"
pvthread+11E400 4580 srcmstr SLEEP 1E402C1 03C 1536
f) Use pvthread+11E400 in below command like
echo "f pvthread+11E400" | kdb
g) Repeat e and f for all the processes (hanged process and related process ) .
If any of the proctsack command shows tid# like below , Please follow
steps e and f for all the thread to display Stack
Example :
procstack 15794898
15794898: /opt/rsct/bin/rmcd -a IBM.LPCommands -r -S 1500
---------- tid# 43385433 (pthread ID: 1) ----------
0xd027a3ec __fd_select(??, ??, ??, ??, ??) + 0xcc
0x100007f4 select(0x1c, 0x2ff228b8, 0x0, 0x0, 0x0) + 0x34
0x10000f70 ctrl_loop() + 0x730
0x10003560 main(0x6, 0x2ff22d40) + 0x15c0
0x100001b8 __start() + 0x68
---------- tid# 91816417 (pthread ID: 2314) ----------
0xd058cb34 _event_sleep(??, ??, ??, ??, ??, ??) + 0x4f4
0xd058d81c _event_wait(??, ??) + 0x35c
0xd059c
3> If this hang is easily re-creatable , then collect the trace
a) trace -an -C all -T10M -L200M -o /tmp/ibmsupt/testcase/trace.raw
b) re-create the hang issue
c) trcstop
d) collect the snap using "snap -c"
4> if you are not having re-create and need faster solution , you have to take system dump
Note this will reboot the System
Note this will reboot the System
a) # sysdumpstart -p ( from victim )
or
b) # chsysstate -m <Server name > -o dumprestart -r lpar -n <lpar name> ( From HMC)
c) # savecore -fd /var/adm/ras
481-183 Saving 432887296 bytes of system dump in /var/adm/ras/vmcore.0.BZ
(0)
d) collect the snap using "snap -c"
Additional Information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Author: Chetan Gaonkar
Operating System: AIX and VIOS
Hardware: Power
Feedback: aix_feedback@wwpdl.vnet.ibm.com, cgaonkar@in.ibm.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Operating System: AIX and VIOS
Hardware: Power
Feedback: aix_feedback@wwpdl.vnet.ibm.com, cgaonkar@in.ibm.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Document Location
Worldwide
[{"Type":"MASTER","Line of Business":{"code":"LOB08","label":"Cognitive Systems"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"ARM Category":[{"code":"a8m0z0000001fMuAAI","label":"AIX General Support"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]
Was this topic helpful?
Document Information
Modified date:
26 June 2021
UID
ibm16467457