IBM Support

Diagnosing high CPU usage issue for the MQ agent

Technical Blog Post


Abstract

Diagnosing high CPU usage issue for the MQ agent

Body

It is difficult to determine the root cause when high CPU issue happens in a customer's environment. To narrow down the problematic area to thread level, this document will introduce some useful utilities and commands for various distributed platforms. 

1    Windows
Windbg and Process Explorer utilities are very good utilities to debug Windows programs. You can download them from the following links.
http://msdn.microsoft.com/en-us/windows/hardware/gg463009.aspx
and install Debugging Tools for Windows 32-bit Version then download the Process Explorer from                                                                                        http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx                                                                 
and install the v11.33.
When you start the Process Explorer by clicking the procexp.exe that you installed, then you will see a window like the following in which all Windows processes are listed.  
Select “Configure Symbols...” in the Options menu:
 
Select the dbghelp.dll in the folder where you installed the Windbg tool and select the local symbol path where you put the program symbols there (eg for MQ agent these symbol files are .pdb files which should be shipped when product releases). Click OK. Note that you should select the dbghelp.dll in the directory where you installed the Windbg tool instead of the one in %systemroot%\system32 because this latter copy of dbghelp.dll does not include the functionality required. Wait until you see the high CPU happens again and locate to the kmqlocal.exe process and click “Properties...”.
 
Change to tab Threads and locate to the thread (TID) which is consuming much CPU and click “Stack” button to show its call stack.
 
You will see something like:
 
Select all lines in the “Stack for thread” window and click the Copy button to save the call stack information and send it to IBM support for review.


2    Linux
1. When high CPU happens, run "top -H > top.out", top.out will show the process CPU utilization.

2. Run "ps H -e -o pid,tid,pcpu,cmd --sort=pcpu | grep kmqlocal-PID > ps.out"
 
Note that kmqlocal-PID is the process id of the high CPU process for MQ agent, the tid column show the thread id of the process. Write down the tid number which uses the most CPU usage.
 
3. Use gdb to identify the high CPU thread and take screenshots for the output and send them to IBM Support for review.

1) Start gdb
2) Attach to the kmqlocal process by
 
(gdb) attach PID
 
3) Run "info threads" to get all the thread info. Remember the thread number at the beginning of every line.

For example
 
(gdb) info threads
  25 Thread 4135922592 (LWP 31507)  0xffffe410 in __kernel_vsyscall ()
  24 Thread 4127529888 (LWP 31508)  0xffffe410 in __kernel_vsyscall ()
  23 Thread 4118920096 (LWP 31509)  0xffffe410 in __kernel_vsyscall ()
...
 
4) Switch to the thread and check its back trace (BT):
 
For example:
 
(gdb) thread 25
[Switching to thread 2 (Thread 4060576672 (LWP 3361))]#0  0xffffe410 in __kernel_vsyscall ()
(gdb) BT
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xf79f47e6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#2  0xf76d9206 in service_thread () from /opt/IBM/ITML3WLJ/tmaitm6/lx8263/lib/libkde.so
#3  0xf79f134b in start_thread () from /lib/libpthread.so.0
#4  0xf785c65e in clone () from /lib/libc.so.6


3    AIX
Use tprof to analyze the CPU usage of each application thread.

1.    Determine high CPU thread
The tprof command, an AIX operating system provided utility, uses sampling at the processor level to determine which process, thread and routine is using CPU time. tprof periodically samples which thread and instructions are being run on each processor and maps this back to the process and routine. The information is not 100% accurate, but is representative and gathered with very little performance overhead. The output from the "tprof -skeux sleep 60" command produces a file called sleep.prof. The tprof command uses sampling at the processor level to determine which process, thread and routine is using the processors.
-s Enables shared library profiling.
-k Enables kernel profiling.
-e Turns on kernel extension profiling.
-u Enables user mode profiling.
-x program Specifies the program to be executed by tprof.
Example
Run the trpof and find the function which consumes maximum CPU.                                                                   
  Profile: ./kmqagent                                                                                                         
  Total % For ./kmqagent[856302] thread 2166785 (USER) = 7.04                                                            
Subroutine                                            %   Source      
==========                                         ====== ======      
<0x1005B5D4>                                         3.74             
<0x1005B5E4>                                         1.30    

In most case this command output will show the subroutines’ name then you are able to check the code in those subroutines. However in my case when this article is written, tprof does not work correctly with my application as you can see that it shows HEX values. If you run into same symptom, you may need take additional steps to map it to actual subroutine name by dbx before this issue gets fixed in future.

-bash-3.2# dbx -a 856302  (pid of kmqagent)                                     
Waiting to attach to process 856302 ...                                          
Successfully attached to kmqlocal.                                               
Type 'help' for help.                                                            
reading symbolic information ...warning: no source compiled with -g                                                          
stopped in _event_sleep at 0xd04a1c54 ($t5)                                                                                                                 
(dbx) listi 0x1005B5D4
0x1005b5d4 (at__10RWIsvSlistCFUl+0x168) 8061009c   lwz   r3,0x9c(r1)
0x1005b5d8 (at__10RWIsvSlistCFUl+0x16c) 3863ffff   addi  r3,-1(r3)
0x1005b5dc (at__10RWIsvSlistCFUl+0x170) 9061009c   stw   r3,0x9c(r1)
0x1005b5e0 (at__10RWIsvSlistCFUl+0x174) 80610074   lwz   r3,0x74(r1)
0x1005b5e4 (at__10RWIsvSlistCFUl+0x178) 80630000   lwz   r3,0x0(r3)
0x1005b5e8 (at__10RWIsvSlistCFUl+0x17c) 90610074   stw   r3,0x74(r1)
0x1005b5ec (at__10RWIsvSlistCFUl+0x180) 8061009c   lwz   r3,0x9c(r1)                                                                    0x1005b5f0 (at__10RWIsvSlistCFUl+0x184) 28030000   cmpli cr0,0x0,r3,0x0
0x1005b5f4 (at__10RWIsvSlistCFUl+0x188) 4082ffe0   bne
0x1005b5d4 (at__10RWIsvSlistCFUl+0x168)
0x1005b5f8 (at__10RWIsvSlistCFUl+0x18c) 8061009c   lwz   r3,0x9c(r1)                                                                                                                                              
(dbx) map at__10RWIsvSlistCFUl                                                   
Entry 1:                                                                         
   Object name: kmqlocal                                                         
   Text origin:     0x10000000                                                   
   Text length:     0x73d63d                                                     
   Data origin:     0x20000c4c                                                   
   Data length:     0xd5fc4                                                      
   File descriptor: 0x3
listi: Lists instructions from the application program.                          
map: Displays information about load characteristics of the application   
               
2.    Use procstack <kmqlocal-PID> to Show Call Stacks

 

4    Solaris
1.  prstat -s cpu
prstat command shows the performance data for processes, sort by CPU usage.
PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP   
10488 cccusr   4067M 4042M cpu13    0   10 255:04:16 3.9% kmqagent/23    
11240 cccusr   4066M 4008M cpu24    0   10 295:42:33 3.9% kmqagent/24    
10852 cccusr    132M  114M cpu0    10   10 187:37:52 2.4% kmqagent/24    
26586 dtx01pb4 1517M 1424M cpu8    22   10  23:49:51 1.7% DataFlowEngine/56                                                       
23045 dtx01pb1 1328M 1121M sleep   59    0  10:51:25 0.7% DataFlowEngine/102                                                        19313 root     5848K 5312K sleep   59    0 195:31:09 0.6% itaagtd/1      
18951 dtx01pb2  849M  638M cpu2    49    0   4:11:31 0.4% DataFlowEngine/82                                                         
22711 dtx02pb4 1631M 1415M sleep   59    0  20:06:31 0.3% DataFlowEngine/88                                                         
2523 patrol     32M   28M sleep   29   10 30:59:42  0.3% PatrolAgent/1

2. "pstack process_id_for_kmqagent" for call stack information at thread level
10852:    /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTX
-----------------  lwp# 1 / thread# 1  --------------------
 fe546dec lwp_park (0, 0, 0)
 fe540e00 cond_wait_queue (51c1b8, 51c190, 0, 0, 1c00, 0) + 4c
 fe541348 cond_wait (51c1b8, 51c190, 0, ffffff80, ffffff80, ffffff80) + 10
 fe541384 pthread_cond_wait (51c1b8, 51c190, 0, 0, 51c190, fec1f964) + 8
 fec1f458 CallManager (0, ffbfedc0, 0, ffffff80, ffffff80, ffffff80) + 610
 fec1f964 KDCS_AddCallTask (1, feca0e20, 0, ffffff80, ffffff80, 0) + 11c
 fec24ab8 rpc__listen (a, ffbfedc0, 0, ffffff80, ffffff80, ffffff80) + 2d0
 ff31125c __1cGListen6FipnNNIDL_tag_1e9f__v_ (a, ffbfedc0, 0, ffffff80, 0, 0) + 164
 ff310208 kglpsrvr (2, ffbff3ec, ffffff80, 0, 0, ffffff80) + 4b0
 ff30f740 kramain  (2, ffbff3ec, ffffff80, 0, 0, 0) + 360
 00120c10 main     (2, ffbff3ec, ffbff3f8, 336800, fd7406c0, 0) + fe0
 0005b870 _start   (0, 0, 0, 0, 0, 0) + 108
-----------------  lwp# 2 / thread# 2  --------------------
 fe546dec lwp_park (0, fd67b890, 0)
 fe540e00 cond_wait_queue (3c2528, 3c2538, fd67b890, 0, 3c2538, 0) + 4c
 fe541244 cond_wait_common (3c2528, 3c2538, fd67b890, 0, 0, 0) + 294
 fe5413d8 _cond_timedwait (3c2528, 3c2538, fd67b9c8, 0, 3c2538, 0) + 34
 fe5414cc cond_timedwait (3c2528, 3c2538, fd67b9c8, 0, 0, ccd77e43) + 14
 fe54150c pthread_cond_timedwait (3c2528, 3c2538, fd67b9c8, 0, 3c2538, 127c9c) + c
 0007fcec __1cHCMEventPwaitWithTimeout6Ml_i_ (3c2528, 3e8, 0, 0, 0, ccd77e43) + 22c
 00127c9c __1cUIRAMQdDManagerResourceKCycleTimer6M_v_ (3c1fe0, 1243a8, 14c38, 0, 0, 381980) + 28c
 00124804 __1cKStatDaemon6Fpv_i_ (381940, fd67c000, 0, 0, 0, 1) + 2ec
 000847c0 __1cICMThreadMcallThreadEP6M_v_ (3c76a8, 0, ffffff80, ffffff80, 0, 0) + 108
 00084a90 __1cQThreadDispatcher6Fpv_0_ (3c76a8, fd67c000, ffffff80, ffffff80, 0, 1) + a8
 fe546d4c _lwp_start (0, 0, 0, 0, 0, 0)
                                       
3. 2-5 minutes of output from "truss -fl -p process_id_for_kmqagent" for system call information.

10852/26:    lwp_unpark(7)                    = 0
10852/7:    lwp_park(0x00000000, 0)                = 0
10852/7:    times(0xFCBCF810)                = 220206617
10852/7:    send(269, "1703\0\090AACE W F8888D3".., 149, 0)    = 149
10852/7:    lwp_unpark(9)                    = 0
10852/9:    lwp_park(0x00000000, 0)                = 0
10852/9:    send(261, "\0", 1, 0)                = 1
10852/10:    pollsys(0x004F5608, 15, 0xFCADFCB0, 0x00000000)    = 1
10852/10:    recv(262, "\0", 4, 0)                = 1
10852/10:    recv(262, 0xFCADFD58, 4, 0)            Err#11 EAGAIN
10852/10:    lwp_unpark(9)                    = 0
10852/9:    lwp_park(0x00000000, 0)                = 0
10852/9:    lwp_unpark(8)                    = 0
10852/8:    lwp_park(0x00000000, 0)                = 0
10852/9:    lwp_unpark(25)                    = 0
10852/8:    lwp_unpark(7)                    = 0
10852/7:    lwp_park(0x00000000, 0)                = 0
10852/25:    lwp_park(0x00000000, 0)                = 0
10852/7:    shutdown(269, SHUT_RDWR, SOV_DEFAULT)        = 0
10852/25:    pollsys(0x004F5608, 14, 0xFC84FCB0, 0x00000000)    = 0

4. 1-2 minutes of output from "truss -t \!all -fl -u a.out -p process_id_for_kmqagent" for detailed call information

10852/2@2:              -> __1cGRWTimeLcurrentTime6F_L_(0x3c1fe0, 0x1243a8, 0x14c38, 0x0)
10852/2@2:              <- __1cGRWTimeLcurrentTime6F_L_() = 0xccd78073
10852/2@2:            <- __1cGRWTimeDnow6F_0_() = 0xfd67bbd0
10852/2@2:            -> __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_(0x3c1fe0, 0x3e8, 0x0, 0x0)
10852/2@2:            <- __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_() = 0
10852/2@2:            -> __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_(0x3c1fe0, 0x3e8, 0x0, 0x0)
10852/2@2:            <- __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_() = 0
10852/2@2:            -> __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_(0x3c1fe0, 0x3e8, 0x0, 0x0)
10852/2@2:            <- __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_() = 0
10852/2@2:            -> __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_(0x3c1fe0, 0x3e8, 0x0, 0x0)
10852/2@2:            <- __1cUIRAMQdDManagerResourceMIsMQdDShutdown6M_i_() = 0
10852/2@2:            -> __1cGRWTimeLcurrentTime6F_L_(0x0, 0x3e8, 0x0, 0x0)
10852/2@2:            <- __1cGRWTimeLcurrentTime6F_L_() = 0xccd78073
10852/2@2:            -> __1cHCMEventPwaitWithTimeout6Ml_i_(0x3c2528, 0x3e8, 0x0, 0x0)
10852/3@3:          -> __1cPErrLogCollectorEread6M_v_(0x427d80, 0xffbff32b, 0x0, 0x3032a2)
10852/3@3:            -> __1cPErrLogCollectorMscanMessages6MpnG__FILE__v_(0x427d80, 0xfe5c3464, 0x0, 0xfe5c2fd8)
10852/3@3:            <- __1cPErrLogCollectorMscanMessages6MpnG__FILE__v_() = 0x427d80
10852/3@3:          <- __1cPErrLogCollectorEread6M_v_() = 0x427d80
…                                     
5. Output from "ps -aLef | grep kmqagent"

  cccusr 11240     1     1    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1     2    24   0   Nov 01 ?           8:07 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1     3    24   0   Nov 01 ?           2:06 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1     4    24   0   Nov 01 ?          19:55 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1     5    24   0   Nov 01 ?        6150:26 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1     6    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1     7    24   0   Nov 01 ?         147:29 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1     8    24   0   Nov 01 ?         106:41 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    28    24   0   Nov 12 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    10    24   0   Nov 01 ?         180:20 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    11    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    12    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    13    24   0   Nov 01 ?           0:03 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    14    24   0   Nov 01 ?           0:02 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    20    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    16    24   1   Nov 01 ?        10447:14 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    17    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    19    24   0   Nov 01 ?           0:01 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    22    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    23    24   0   Nov 01 ?           0:00 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    24    24   0   Nov 06 ?          19:54 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    25    24   0   Nov 06 ?          19:53 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
  cccusr 11240     1    26    24   0   Nov 06 ?          19:57 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg
cccusr 11240     1    27    24   0   Nov 06 ?         180:36 /sysu/ccc/sol283/mq/bin/kmqagent /sysu/ccc/config/HKP02SN0006_mq_PSDTXHKP6.cfg   
      
One can see that thread 16 is the one with the longest execution time.

The PS, PSTACK, TRUSS commands combined will help to determine which of the many threads running under the KMQAGENT process is causing the problem.  The PS with -aLef will indicate which thread is using the CPU. The PSTACK will provide the call stack. The TRUSS will provide call activity. Knowing the problem thread from the PS command will allow you to look at TRUSS/PSTACK output related to that thread.


5    IBM i
You can use the WRKACTJOB command. It will list all the jobs with CPU usage so you can select the job instance (e.g. KMQ00001) with option 5 (Work with) and select 20 (Work with threads), it comes as following screenshots.

 
 

Select those threads which consume much CPU. In this case, thread 0000000D is the thread which consumes high CPU. Select 10 to display its call stacks.

 

It tells that MQ agent is busy to process MQ error logs.

 

 

[{"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":""}]

UID

ibm11085349