More frequent, we get help requests for BPM performance problems from our customer. The investigation of such kind of problems isn't easy at all, but the entrance to problem analysis is almost the same.
This posting should give you some instruction what you can do to collect basic information which are useful for further problem analysis.
In most cases our user open problem tickets with an information like " BPM system has high CPU usage " or " BPM system crashed with out of memory ".
Whereas the second statement " BPM system crashed with out of memory " already sets the direction of investigation, for the first one " BPM system has high CPU usage " the direction is not clear. Maybe the system run with high CPU, but why ? The root cause could be also a lack of memory.
The first step of our analysis is always to get an overview of the situation.
If the problem seems to address a performance problem, collect the following input
MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on AIX
MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on Linux
MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on Solaris
MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on Windows
If the problem is more related to an out of memory situation, collect the following input
MustGather: Out of Memory errors with WebSphere Application Server on AIX, Linux, or Windows
What we can do with the collected information now ?
Of course, you can send it to the support person for further analysis, however - having a look at the output should not harm!
Within last days a customer reported a problem with their BPM environment, since some days the Process Center (PC) AppCluster shows a very high CPU utilization. They were afraid, this situation could leads to application in production getting unresponsive.
Without any additional information I requested the performance MustGather.
The attached script(s) from the MustGather document generated a set of output files (e.g. 3 heapdumps within time intervall of 1 minutes, native_stderr.log, native_stdout.log, SystemOut.log, ...). The reason for getting these files is to better understand the activity within the JVM.
From the javacore we pick up information such as
* memory information + GC is activated
A max heap of 14GB should be sufficient at first glance.
* memory usage
1STHEAPTOTAL Total memory: 15032385536 (0x0000000380000000)
1STHEAPINUSE Total memory in use: 14701247408 (0x000000036C433BB0)
1STHEAPFREE Total memory free: 331138128 (0x0000000013BCC450)
2,2 % still available, not really much. Note that these values represent only a snapshot in time, the heap shrinks and grows usually.
* who created javacore
1TISIGINFO Dump Event "user" (00004000) received
If Dump Event is generated by Out of Memory situation, we always check for Current Thread. The current thread is the thread that is running when the signal is raised that causes the javacore to be written.
* hung threads
3XMTHREADINFO "HungThreadDetectorForDeferrable Alarm" J9VMThread:0x0000000031169000, j9thread_t:0x0000010025451510, java/lang/Thread:0x0000000101310E50, state:CW, prio=5
3XMJAVALTHREAD (java/lang/Thread getId:0x30, isDaemon:true)
3XMTHREADINFO1 (native thread ID:0x64F018B, native priority:0x5, native policy:UNKNOWN, vmstate:CW, vm thread flags:0x00000401)
3XMCPUTIME CPU usage total: 0.006508000 secs, user: 0.004455000 secs, system: 0.002053000 secs, current category="Application"
Another file we also check for possible problems is 'native_stderr.log'
This log file records information about errors that occur during the processing of the JVM. This log can be configured to also show information about garbage collection (-verbose:gc)
We use the 'IBM Pattern Modeling and Analysis Tool for Java Garbage Collector' (PMAT) to visualize the content in a perfect readable format.
It gives you lots of information about Garbage Collection and Java Heap, e.g.
At the beginning we can see the Garbage Collection needs less time, but later it needs more and more time at ever shorter intervals.
but also general information when the problem occured.
And here I saw that high CPU load is related to a java out of memory situation.
Usually, if an out of memory situation happens a javacore and heapdump (with same timestamp !!!) is generated
For the OOM situation here the related heapdump.phd file was logged and could be requested from our customer.
For the heapdump analysis we use another tool 'Memory Analyzer' showing statistics about
* Leak Suspects
* Top Components
* Heap Consumers
Clicking on Details shows java_class_names which accumulating memory
From our sample, com.ibm.bpm.search.artifact is a well known BPM java class that represent the Process Center Indexer. Here, the indexer consumes more than 80% of available resources that ends in the OOM finally.
The following developerworks posting discuss this situation and the final solution
Solving Performance and Out of Memory situation isn't a simple task and can take some time. Not always we can provide such a fast solution as shown in the sample above. Sometimes, adding additional memory to the system may solve an OOM problem, but it is NOT always the final solution. In this case a deeper investigation is needed to figure out why the system run into this unstable state.
But what you can do is providing the MustGather input that is found at the beginning of this posting with a clear problem description - this can ensure a fast investigation for (mostly) critical situation.
And if this does not help, take two of these and call me in the morning.
Your Dr. Debug