My Dear Debuggers,
In my last posting (BPM performance problems - how to start with analysis) I explained how to start a performance analysis. Remember, performance problems are always related to these two events
(1) Out of Memory
(2) high CPU usage
Whereas an Out of Memory situation requires to have a look at native_stderr, javacore and heapdump logs, the high CPU problem investigation is based on analysis of output files generated by performance tool.
In our sample today , the performance problem happened on a Linux system. The user reported a high CPU usage even if there is no activity indicated. Following the Performance MustGather for Linux OS I asked customer to run the perftool (linperf.sh) and provide the captured stuff
This script captures metrics during the event and write the information to several files
This file gives an overview of virtual memory statistics. Here you may get indication of memory related issues (e.g. swapped memory, that may leads to high CPU)
All records > 0 in swap section indicate a memory problem. In our sample case we haven't such a situation.
This file provides statistics of all running processes
We see the most activity is owned by PID:31984
This give an overview of the top running processes (most CPU-intensive tasks on the system)
Here again, process with PID: 31984 owns the most %CPU time of all running processes (likely, for BPM this is always a process like /opt/IBM/WebSphere or /opt/IBM/BPM)
For this top PID the collector creates an additional statistic
We can see another process PID:41584 (WebContainer thread) consume nearly complete CPU time !
So what we know so far:
Customer run linperf tool to collect performance statistic. And of course, this tool also appears in the list of running processes - it is the process with PID:31984.
But more interesting for our analysis is PID:41584 logged by linperf tool as the one that eat the complete CPU time and running quite long ! Don't be confused by the output referring to PID, it is actually the Thread ID).
Within the javacores we can review the details of the thread callstack.
The javacore files show the Thread ID in HEX format. To be able to search for the Thread ID in question we have to convert threadID from DEC to HEX
Now we can search the javacore for Thread ID A270 (search for xA270) to get more details
Using the "java/lang/Thread getID:0x1E9AF" from the xA270 Thread , we can check the Systemout.log files (0001e9af) for error details.
In our sample, it seems to be a SSL problem - during SSLHandshake a hunging situation happened and it never completed correctly.
There is an existing defect for this situation - System property to enable SSL Channel timeoutValueInSSLClosingHandshake property
APAR PI70332 (WAS FP 126.96.36.199) introduced a proper "timeoutValueInSSLClosingHandshake" to terminate this kind of loop. It is set to -1 per default and should be adjusted. Later on, APAR PI78738 was created, setting the default value to 30 seconds.
As a result, those operations will be terminated after configured time (e.g. 30 seconds).
In order to permit performance analysis, it needs some experiences and training. But as we say many times: "No master fell just from heaven! ". This posting should encourage you to retrace every step, maybe you can find the one or other cause of the problem by yourself.
And if this does not help, take two of these and call me in the morning.
Your Dr. Debug