Be careful when diagnosing Java memory leaks
kgibm 0600027VAP Visits (5702)
I was recently at a customer who believed that they had a Java memory leak. They compared heapdumps and couldn't find anything. They had experienced production OutOfMemoryErrors (OOMs) before (for a different reason), and they were so worried about what they perceived, that they increased the maximum heap size to 4GB so that the JVM could handle a day's worth of work, and then they put in a process to restart the JVMs every night.
They were using -Xgcpolicy:gencon (the IBM JVM's generational garbage collection policy), so it's important to look at the used heap after global collection, highlighted above with the "GC type" plot. They had a sawtooth pattern which is typical with a generational garbage collector. With gencon, the general sign of a Java memory leak is when the slope of the used heap after global collections is positive (i.e. if you connect the big troughs in the sawtooth). This is more easily visualized by selecting the "Used heap (after global collection)" plot:
The interesting thing was that pause times (and GC frequency) were not getting worse and the proportion of time spent in GC was pretty constant at around 1%, which is pretty good:
The mistake the customer made was to assume that rising Java heap usage, in and of itself, was proof of a Java leak. When they analyzed the heapdump and couldn't find anything, they chalked it up to lack of skill in using heap analysis tools and continued with their assumption.
We grabbed a system dump towards the end of that graph and loaded it using the Memory Analyzer Tool (MAT). After some investigation, we suspected that a cache of SoftReferences was taking up a huge portion of the Java heap. We ran Java Basics > References > Soft reference statistics, and sure enough, the "Only Softly Retained" histogram was taking up 1.2GB of Java heap. I predicted that at some point, the JVM would decide to clean these up, and there would be no issue. We agreed to let the JVM continue overnight without an automatic restart, and sure enough, there was a huge cleanup a few hours later:
This is expected behavior. A SoftReference is normally used as a cache and is designed to be removed in two conditions: 1) significant memory pressure, and 2) a certain number of GC cycles:
All soft references to softly-reachable objects are guaranteed to have been cleared before the virtual machine throws an OutOfMemoryError. Otherwise no constraints are placed upon the time at which a soft reference will be cleared or the order in which a set of such references to different objects will be cleared. Virtual machine implementations are, however, encouraged to bias against clearing recently-created or recently-used soft references.
Even though the graph looked ominous, what happened (on the IBM JVM) was that the default softrefthreshold of 32 meant that SoftReference collection was "falling behind" its creation. To avoid the ominous graph, the customer could try to balance this using the IBM JVM's command line parameter (only available on or after Java 5): -Xsoftrefthreshold (htt
I think there are a few lessons here: