IBM Support

Identifying High CPU or Hang condition in WebSphere on AIX for non-thread safe HashMap usage

Technical Blog Post


Abstract

Identifying High CPU or Hang condition in WebSphere on AIX for non-thread safe HashMap usage

Body

 

I wanted to list some steps below to help Identify a High CPU or possible Hang condition in WebSphere Application Server on AIX resulting from non-thread safe HashMap usage. The Hashmap APIs contributing to High CPU or Hung Threads at the top of the java stack, can include removeEntry(), getEntry() or findNonNullKeyEntry() function calls.

*Note that this issue can prevail on any OS platform and that the naming convention for the files shown below does vary.

 

1)  Collect the Performance Hang High CPU MustGather data with WebSphere Application Server on AIX to assist with diagnosing the problem.

Run the included aixperf.sh [PID] script as root to generate performance cpu data along with three javacores and also the log files should be reviewed during analysis.

See MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on AIX

 

2)  Review VMSTAT output first to confirm and verify if a High CPU condition exists.

Mon Apr 27 18:59:37 EDT 2015

System configuration: lcpu=8 mem=14336MB ent=0.20

kthr      memory              page                faults              cpu          
----- --------------  ------------------------  ------------ -----------------------

 r  b   avm   fre     re  pi  po  fr   sr  cy  in    sy   cs us sy id wa    pc    ec
 8  0 2021603 552075   0   0   0   0    0   0  78 31760 7965 90  2  8  0  1.98 988.6
 8  0 2020966 552712   0   0   0   0    0   0   7  4481 3759 93  0  6  0  2.00 998.4
 7  0 2020985 552693   0   0   0   0    0   0  10  6483 4028 92  0  8  0  2.00 998.5
 6  0 2021003 552675   0   0   0   0    0   0  31  3403 3805 92  0  8  0  2.00 998.5
 6  0 2021169 552509   0   0   0   0    0   0  34  9532 3886 92  0  7  0  1.99 996.8
 7  0 2021170 552508   0   0   0   0    0   0  41  4530 3953 92  0  7  0  2.00 999.0
 7  0 2021175 552503   0   0   0   0    0   0  15  4877 3838 91  0  9  0  2.00 999.4
 6  0 2021175 552502   0   0   0   0    0   0   8  4226 3955 91  0  8  0  2.00 998.9
 7  0 2021171 552505   0   0   0   0    0   0  14  4832 4072 92  0  8  0  2.00 999.4
 6  0 2021174 552502   0   0   0   0    0   0  13  4990 4155 92  0  8  0  2.00 999.1
 7  1 2021240 552436   0   0   0   0    0   0  35  5160 3905 91  0  8  0  2.00 998.9
 7  0 2021244 552432   0   0   0   0    0   0  13  6246 3735 92  0  8  0  2.00 998.8
 7  1 2017732 555050   0   0   0   0    0   0  75  7679 4121 93  0  5  2  2.00 998.6
 6  1 2007371 564512   0   0   0   0    0   0 109  4241 4050 92  0  7  1  2.00 998.8
 8  0 2007364 564519   0   0   0   0    0   0  10  4678 4022 92  0  8  0  2.00 998.4
 8  0 2007361 564523   0   0   0   0    0   0  23  3409 3914 92  0  8  0  2.00 998.8
 9  0 2007408 564486   0   0   0   0    0   0  44  5541 3930 94  0  6  0  2.00 999.0
 7  1 2008045 563849   0   0   0   0    0   0  72  4801 3657 93  0  6  0  2.00 999.5
 7  0 2008045 563849   0   0   0   0    0   0  13  3150 3911 93  0  7  0  2.00 998.9
 9  0 2007407 564486   0   0   0   0    0   0   8  4466 3923 91  0  9  0  2.00 999.3

kthr      memory              page                faults              cpu          
----- --------------  ------------------------  ------------ -----------------------

 r  b   avm   fre     re  pi  po  fr   sr  cy  in    sy   cs us sy id wa    pc    ec
 7  0 2007415 561719   0   0   0   0    0   0   9  4100 3987 92  0  0  2.00 999.6
 7  0 2007415 561718   0   0   0   0    0   0  66  5133 4146 91  0  9  0  2.00 999.2
 7  0 2007415 561718   0   0   0   0    0   0  16  3144 4029 92  0  7  0  2.00 998.5
 7  0 2007415 561718   0   0   0   0    0   0   7  4101 3988 92  0  8  0  2.00 998.8
 8  0 2007413 561717   0   0   0   0    0   0  14  4868 3816 91  0  8  0  2.00 999.2
 8  0 2007413 561717   0   0   0   0    0   0  13  3178 3781 93  0  7  0  2.00 998.5
 7  0 2007413 561716   0   0   0   0    0   0   7  4331 3915 92  0  8  0  2.00 998.9
 6  1 2007413 561716   0   0   0   0    0   0  70  6305 4071 92  0  8  0  2.00 998.7
 8  0 2007655 558631   0   0   0   0    0   0 115 54792 8930 88  5  6  0  1.94 971.8

-->>  Notice above that User CPU is > 90% and Idle CPU is < 10%, which means that a High CPU condition did occur.

 

3)  Review SLEEP.PROF output from the TPROF command to find out what java process (PID) and threads (TIDs) are consuming the largest portion of CPU.

Process                                Freq  Total Kernel   User Shared  Other
=======                                ====  ===== ======   ==== ======  =====
java                                     79  91.38   0.22   0.00   0.08  91.08
wait                                      8   8.40   8.40   0.00   0.00   0.00

Process                   PID      TID  Total Kernel   User Shared  Other
=======                ===      ===  ===== ======   ==== ======  =====
java                 11993262 59047939  15.33   0.01   0.00   0.00  15.32
java                 11993262 79822849  15.30   0.00   0.00   0.00  15.29
java                 11993262 56623133  15.18   0.00   0.00   0.00  15.18
java                 11993262 84869223  15.12   0.00   0.00   0.00  15.11
java                 11993262 52363371  15.08   0.00   0.00   0.00  15.08
java                 11993262 77791471  15.04   0.01   0.00   0.00  15.03

wait                  1114146  1638451   1.55   1.55   0.00   0.00   0.00

 

Convert TID decimal values above to hexadecimal format below:

59047939 = 0x3850003

79822849 = 0x4C20001

56623133 = 0x360001D

84869223 = 0x50F0067

52363371 = 0x31F006B

77791471 = 0x4A300EF

 

4)  Review a Javacore for High CPU thread stacks, to find the Method Calls and associated Code Stacks for the identified High CPU TIDs above, which is referenced under the corresponding (native thread ID).


Dump Event "user" (00004000) received
Date:                 2015/04/27 at 18:59:37

 

"WebContainer : 5"


(native thread ID:0x3850003)


Java callstack:
at java/util/HashMap.getEntry(HashMap.java:490(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

"WebContainer : 29"


(native thread ID:0x4C20001)


Java callstack:
at java/util/HashMap.getEntry(HashMap.java:490(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

"WebContainer : 24"


(native thread ID:0x360001D)


Java callstack:
at java/util/HashMap.getEntry(HashMap.java:490(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

"WebContainer : 21"


(native thread ID:0x50F0067)


Java callstack:
at java/util/HashMap.getEntry(HashMap.java:490(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

"WebContainer : 2"


(native thread ID:0x31F006B)


Java callstack:
at java/util/HashMap.getEntry(HashMap.java:491(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

"WebContainer : 28"


(native thread ID:0x4A300EF)


Java callstack:
at java/util/HashMap.getEntry(HashMap.java:490(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

5)  Review additional Javacores for same PID (11993262) to identify possible Hung Threads.  Sampling only the WebContainer : 21 thread across multiple dumps (spaced two minutes apart), notice that the stack for this particular thread (0x50F0067) has not moved throughout these (3) javacores, indicating a slow moving or possibly hung thread.

*Note that the Hung thread scenario does not always follow suit with a High CPU condition, as it did in this particular example.  There can often be one without the other, but good measure to check for both conditions if have appropriate data available.

 

Dump Event "user" (00004000) received
Date:                 2015/04/27 at 19:01:37

 

"WebContainer : 21"


(native thread ID:0x50F0067)

Java callstack:
at java/util/HashMap.getEntry(HashMap.java:490(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

Dump Event "user" (00004000) received
Date:                 2015/04/27 at 19:03:37

 

"WebContainer : 21"


(native thread ID:0x50F0067)


Java callstack:
at java/util/HashMap.getEntry(HashMap.java:491(Compiled Code))
at java/util/HashMap.containsKey(HashMap.java:475(Compiled Code))

at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code ))

 

6)  Review the SystemOut*.log files for WSRV0605W hung warning messages (hung thread detection policy configuration) and cross-reference the code stacks with the javacores, and in this case there is a corresponding entry or match for the WebContainer : 21 thread.  This thread has been running for 23.51 minutes or 1410650 milliseconds.  You can also search all of the SystemOut*.logs for an associated Completion message, indicating total number of milliseconds or time the thread was running, before it actually completed.  There was no such completion message for this particular example.

e. g.

  [4/27/15 18:59:13:086 EDT] 0000009d ThreadMonitor W   WSVR0605W: Thread "WebContainer : 21" (00001683) has been active for 1410650 milliseconds and may be hung.  There is/are 4 thread(s) in total in the server that may be hung.  	at java.util.HashMap.getEntry(HashMap.java:490)  	at java.util.HashMap.containsKey(HashMap.java:475)          at com/djl/webapp/getPolicyAuth(PolicyBean.java:7878(Compiled Code))

                 

Summary:  Reviewing relevant VMSTAT output, tprof SLEEP.PROF output, Javacore*.txt and SystemOut*.log data, established that both High CPU and Hung Thread conditions exist for this particular example.  Based upon the review and analysis results, the application owner for the associated code in com/djl/webapp/getPolicyAuth, should review the implementation along with the corresponding Java Specification, to help ensure that usage of the HashMap API is done in a thread safe Synchronized manner.  

 

Additional Reference:

High CPU / Hang on java.util.HashMap.findNonNullKeyEntry() due to non-thread-safe usage of HashMap

 

 

[{"Business Unit":{"code":"BU004","label":"Hybrid Cloud"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":""}]

UID

ibm11081371