IBM Support

Using "IBM Pattern Modeling and Analysis Tool for Java Garbage Collector" to Analyze IBM Business Process Manager JVM Memory Usage and Garbage Collections

Technical Blog Post


Abstract

Using "IBM Pattern Modeling and Analysis Tool for Java Garbage Collector" to Analyze IBM Business Process Manager JVM Memory Usage and Garbage Collections

Body

 

As you know IBM Business Process Manager is built upon Java technology whose performance is highly impacted by garbage collection. Usually IBM Business Process Manager would recommend to use generational concurrent garbage collection strategy (-Xgcpolicy:gencon) which is well suited to an application that creates many short-lived objects, you can read this section in the product documentation to better understand this policy: Generational Concurrent Garbage Collector.
 
Analyzing the textual verbose GC information recorded in native_stderr.log is the typical way to understand the garbage collection behavior. By default verbose GC is not enabled, here is a guide regarding how to enable it: Enabling verbose garbage collection (verboseGC) in WebSphere Application Server. Besides reading the detail verbose garbage collection information directly, IBM Pattern Modeling and Analysis Tool for Java Garbage Collector is a powerful tool you can leverage. The previous link is an article that will provide you with some scenarios as well as examples how to better use this tool.
 
#1. Leverage the Summary Page to understand the garbage collection behavior
There will be a lot of useful information provided in the summary page, such as number of java heap exhaustion, maximum GC overhead, number of allocation failures. These are all important indicators of possible unhealthy heap size usage and garbage collection behavior.
image
Diagram 1: There will be a lot of useful information provided in the summary page, such as number of java heap exhaustion, maximum GC overhead, number of allocation failures. These are all important indicators of unhealthy heap size usage and garbage collection behavior.
 
In history, there may be a lot of setting changes, in the summary page you can see the detail of each configuration, general analysis and recommendation as first step to understand garbage collection behavior.
image

image

Diagram 2: For example, in this configuration, you can know the heap size setting is 1536m - 2560m, no heap exhaustion during the period.
 
Especially, for each special configuration, you can open the "Graph View Part" view to understand the garbage collection behavior of a specific JVM configuration. This is often used to compare the performance improvement between two sequencing configurations.
image
Diagram 3: You can click the button highlighted to review a specific garbage collection period.
 
#2. Leverage the Global/Scavenge Collection Analysis view to know whether there are too many global collections
 
The Global/Scavenge Collection Analysis view would give you a general idea, that among all the collections, how many/long are global collections and how many/long are scavenge collections. Usually a healthy GC configuration would have more scavenge collections (relative slight) than global collections (relative heavy).
image
Diagram 4: This sample is telling us that definitely there were too many global collections v.s. scavenge collections. A healthy garbage collection mechanism would have majority scavenge collections (run up nursery area) instead of global collections (run up of insufficient tenured area).
 
image
Diagram 5: This is a sample that shows the majority of the collections are scavenge collections. However even though the ratio looks good, you still need to further check the Graph View to further understand what happens.  
 
#3. Leverage "Overhead" in Graph View and CPU usage from nmon report to find whether GC is the root cause of high CPU usage
 
GC collection is often a possible reason of high CPU, and the related indicator is "Overhead"; healthy GC would always have <10% overhead, and if there are continuous >10% overhead, it usually means extra action is required to tune the system.
image
 
image
Diagram 6: Here is a sample of CPU usage generated from Nmon Analyser during a specific period. During this high CPU period, you can see the clear pattern match between CPU usage and GC overhead so you can reach the conclusion that the GC overhead is possibly the major cause of high CPU. Further detail is required to understand the high GC overhead.
 
#4. Leverage "Used Tenured (before)" and "Used Nursery (before)" items to find out the cause of GC overhead
 
There are a lot of items you can observe in the Graph View, such as Used Memory before and after, Total Memory before and after, Free Tenured before and after, Used Tenured before and after, Total Tenured before and after, Free Nursery before and after, Used Nursery before and after, Total Nursery before and after etc. Among them, "Used Tenured (before)" and "Used Nursery (before)" are often leveraged, since usually a garbage collection would be triggered because of either a full nursery area or insufficient tenured space.
image
Diagram 7: Let's deep dive into the previous scenario that high CPU is caused by high GC overhead. Now based on the diagram above, we can know the high GC overhead (blue line) was mainly caused by "insufficient remaining tenure space (purple line)", instead of nursery shortage (red line). So enlarging overall heap size would be the recommendation. You also need to further check heap dump files to see whether there is a memory leak. Usually if the server gets OOM (out of memory) after running for a period of time, and restarting the server would help at the beginning but would run into OOM again, it indicates a possible memory leak and deserves further heapdump file check.
 
image
Diagram 8: This is another example showing that the cause of garbage collection overhead (blue line) is because the nursery area (red line) is too small, which leads too much / too quick scavenge collections. So short living objects which should stay and be collected in the nursery area are moved to the tenured area (purple line) and causes the tenured area to keep growing until global collections happen. Enlarging the nursery area size (usually we recommend 25% - 40% of the total heap size) would be the recommended action.
 
#5. Observe other items in Graph View
 
Other items are also important to further understand garbage collection behavior. For example "Free Nursery (after)" is usually close to "Total Nursery (after)", since normally objects in nursery area would be moved to tenured area to release nursery space for new requests. If it's not the case, usually it means unhealthy GC.
image
Diagram 9: In this sample "Free Nursery (after)" (pink line) keeps decreasing which is abnormal. The reason behind this is that the tenured area is already full so objects can only stay at nursery area (light blue line).
 
#6. Leverage "Global Duration" and "Interval" as indicators of unhealthy garbage collection
 
"Overhead" is the first indicator you should check regarding garbage collection health. A healthy garbage collection policy will have long interval but short duration, or in other words short interval or long duration would impact system performance. To further understand collection interval and duration, you can leverage these two indicators "Global Duration" and "Interval" in the Graph View.
 
"Global Duration" is the indicator you can leverage to know how long the garbage collections would take. If it is too long for example >10s, you better adjust the heap settings to avoid a performance impact.
image
Diagram 10: In this sample you can see during 13:31 and 13:56, there are a lot of long duration (>10,000 ms) (red line with arrow mark) garbage collections and obviously both tenured (purple line) and nursery areas (red line without arrow mark) are used up. You need to enlarge heap size, and further check heap dump file to see whether there are memory leak suspicions.
 
Another indicator is "Interval (Since)", it would reflect how frequent the garbage collections are. Often the case is that the nursery area is set too small and there are too frequent scavenge collections and this impacts system performance as well.
image
Diagram 11: In this example, the nursery area (red line) is set too small and there are too frequent scavenge collections (pink line) and this impacts system performance as well. For example, as the arrow mark shows, there are only 366ms between 2 collections.
 
#7. Search key words in native_stderr.log files directly
 
Usually experienced analyzers would search the key words in native_stderr.log files directly. Firstly you can search key word exclusive-start to see the intervals between garbage collections. For example, the records below indicate too frequent collections, which would definitely cause CPU overhead and impact system performance:
<exclusive-start id="2314" timestamp="2016-10-13T13:17:38.903" intervalms="604.634">
<exclusive-start id="2326" timestamp="2016-10-13T13:17:39.435" intervalms="534.695">
<exclusive-start id="2338" timestamp="2016-10-13T13:17:39.794" intervalms="367.203">
<exclusive-start id="2350" timestamp="2016-10-13T13:17:40.185" intervalms="386.276">
<exclusive-start id="2362" timestamp="2016-10-13T13:17:40.653" intervalms="474.111">
<exclusive-start id="2374" timestamp="2016-10-13T13:17:41.263" intervalms="602.427">
<exclusive-start id="2386" timestamp="2016-10-13T13:17:41.607" intervalms="344.765">
<exclusive-start id="2398" timestamp="2016-10-13T13:17:42.153" intervalms="542.125">
<exclusive-start id="2410" timestamp="2016-10-13T13:17:42.966" intervalms="824.648">
<exclusive-start id="2422" timestamp="2016-10-13T13:17:43.716" intervalms="745.673">
<exclusive-start id="2434" timestamp="2016-10-13T13:17:44.435" intervalms="710.354">
 
You can also search key word exclusive-end to see the durations of garbage collections. For example, the records below indicate the collection is really long and highly impacts system performance, very possibly there would be CPU starvation reported.
<exclusive-end id="2717" timestamp="2016-10-13T13:30:07.669" durationms="17606.325" />
<exclusive-end id="2748" timestamp="2016-10-13T13:30:35.701" durationms="19027.872" />
<exclusive-end id="2779" timestamp="2016-10-13T13:31:00.498" durationms="18544.570" />
<exclusive-end id="2813" timestamp="2016-10-13T13:31:22.718" durationms="16879.805" />
<exclusive-end id="2836" timestamp="2016-10-13T13:31:45.749" durationms="18004.892" />
<exclusive-end id="2855" timestamp="2016-10-13T13:32:02.422" durationms="11673.137" />
<exclusive-end id="2874" timestamp="2016-10-13T13:32:21.047" durationms="14088.285" />
<exclusive-end id="2893" timestamp="2016-10-13T13:32:39.251" durationms="13719.640" />
 
You can also search key word percolate-collect in the native_stderr.log file to understand the time and reason when percolate happens (a scavenge which is converted into a global collection is called a percolate. Usually it's an "aggressive" GC since a previous GC was unable to reclaim sufficient resources. It means that the GC will try as much as it can, including compaction, class unloading, softref clearing, etc.) for example:
<percolate-collect id="2471" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2016-06-07T05:34:05.836"/>
<percolate-collect id="2721" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2016-06-07T06:09:17.265"/>
<percolate-collect id="4413" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2016-06-07T12:11:02.621"/>
<percolate-collect id="1064" from="nursery" to="global" reason="insufficient remaining tenure space" timestamp="2016-06-07T16:12:52.788"/>
<percolate-collect id="2585" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2016-06-08T07:10:17.516"/>
<percolate-collect id="3617" from="nursery" to="global" reason="failed tenure threshold reached" timestamp="2016-06-08T13:02:05.910"/>

 

You can also search key word totalBytesRequested in the native_stderr.log file to understand whether there is large memory request for example:

<af-start id="37541" totalBytesRequested="67108872" timestamp="2016-11-15T13:31:27.101" intervalms="489.437" />
<af-start id="37553" totalBytesRequested="24" timestamp="2016-11-15T13:31:27.570" intervalms="464.174" />
<af-start id="37565" totalBytesRequested="33554440" timestamp="2016-11-15T13:31:27.820" intervalms="252.289" />
<af-start id="37577" totalBytesRequested="67108872" timestamp="2016-11-15T13:31:28.226" intervalms="400.208" />
<af-start id="37589" totalBytesRequested="67108872" timestamp="2016-11-15T13:31:28.491" intervalms="263.373" />
<af-start id="37601" totalBytesRequested="34574776" timestamp="2016-11-15T13:31:28.820" intervalms="336.934" />
 

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11080705