The 4 general reasons for OutOfMemoryError errors and how not to get fooled

Technical Blog Post

Abstract

Body

The Java™ JRE will throw an OOM error for more than just a Java heap with no free space. So the single heapdump you have from an outage might not be useful at all and could lead you down the wrong debugging path. It is very important to get a complete set of documentation from an OutOfMemory (OOM) error. A single heapdump is never enough to figure out the general type of OOM you’re dealing with.
(Note that this information is correct for a WebSphere Java process running with an IBM Java JRE, and while all of the concepts are applicable to the Oracle Java JRE, some of the required debugging output will be in different places and with different formats.)

Terminology:

OutOfMemoryError – a Java error, like an exception, but worse. Normally this error indicates a shortage of Java heap, but can also be thrown for native memory and other operating system resource shortages.

Heapdump - a binary format debugging document with a representation of the Java heap.

Javacore – a text format debugging file that comes from IBM JREs.

Thread dump – can refer to a javacore file, but more accurately is an Oracle text dump of the Java threads in a process.

Core file – a probably large binary file from the operating system.

Crash – the running process unexpectedly terminates.

With just a heapdump file:
1.   I can’t tell if it was generated from an OOM error.
2.   I can’t tell if it was manually generated, or generated from some other signal.
3.   I can’t tell if it was generated because of a Java heap or native memory/resource shortage.
4.   I can’t tell the Java settings like gc policy and region sizes.
5.   Most of the time you can’t tell the exact Java build.
6.   I can’t tell what the WebSphere process was or even if it was a WebSphere process.
7.   If the heapdump was generated from an OOM error, I can’t tell if it was from a single large allocation request failure.

Here are the four types of OOM errors:

The classic Java heap leak.
This is where something is “leaking” in the Java heap. Objects are being created but not freed up and removed from the Java heap, and if the condition continues long enough, the Java heap will fill up. Once you know this is the type of problem you have, review a heapdump and find the leak suspects.

OOM from a native issue.
This type of OOM error has nothing directly to do with the Java heap. Looking at a heapdump for a native issue is a rookie mistake. Newer javacore files will report that the OOM was from a native issue. Older JREs might just report an OOM error, and not say that it’s really some type of native problem. On 32bit JREs, you can easily get a native OOM error by increasing the Java heap size too much. 1526M (or sometimes lower) is the point at which native OOM errors will start to happen. On UNIX® Oss, the ulimits can cause native OOM errors. The javacore you have from the outage will report on the expanded and free sizes of the Java heap. If the Java heap wasn’t fully expanded or had a large amount of free space, maybe you can rule out that this outage from a Java heap problem, and you really have a native problem. At this point, you should then review the GC History and the verboseGC output to see if there was a very large allocation request.

OOM from a very large allocation request.
The application will be running without a problem, then all of a sudden, some piece of code needs more Java heap than could ever be provided given the size of the Java heap. This would be like the application trying to get a 2G String object from a 1024M Java heap. This will never fit. You can see these allocations from the verboseGC output, and sometimes from the GC History in a javacore file.

OOM from a too small heap.
Many times we see 256M and 512M Java heaps. This is fine if your application can handle it, is a small application, and doesn’t need lots of Java heap, but most applications under a load will need more Java heap. The standard way to debug these possibilities is to increase the Java heap size to an amount probably larger than you need, let the application run under the load, and then review the verboseGC output. Then if you’ve ruled out that this isn’t a classic leak problem, set the Java heap size according to the use seen in the verboseGC output.

General Notes
By default, an IBM Java JRE will automatically generate a heapdump and a javacore file from an OOM error. If you’re not getting these two files on an OOM error, then someone has probably added an Xdump option or a IBM* environment variable. Fix this before going too far with debugging. Get a manually generated javacore file and review for environment entries and –Xdump options. Another place to review here is the server.xml file. Use the server.xml file with Oracle JREs.

For any OOM problem on an IBM JRE, we need at least a heapdump, javacore, and the verboseGC output.

The running process that causes an OOM error to be thrown will probably keep throwing OOM errors, and by default, every OOM error will generate a javacore and heapdump. This means multiple javacore files and heapdumps will be created. The most important heapdump and javacore files to see are from the first OOM error. You can get fooled with heapdump and javacore files that aren’t from the first OOM error. Heapdump and javacore files are named like this: javacore.[date].[time].[PID],[index].txt and heapdump.[date].[time].[PID],[index].phd. Get the files with the lowest index numbers.

By default with WebSphere, the verbose Garbage Collection output, which is needed all of the time, is written to the native_stderr.log file, but there’s an option, verbosegclog, that can redirect the verboseGC output to another file. Normally this option makes debugging harder. If this option is seen in a javacore file, then be sure to get the native_stderr.log file and the separate verboseGC output file.

Documentation needed at a minimum with WebSphere:

The native_stderr.log file – This is the most important file to review. It will have all of the messages from the JRE regarding errors coming from the JRE. It should also have the verboseGC output. Ideally this file will have at least the entire life of the Java process and not be truncated.
The verboseGC output – If the verbosegclog option is in use, then this log file must be provided.
The first heapdump generated. Review the native_stderr.log file for which file was generated first.
The first javacore file generated.

If you have this documentation, you can then figure out the type of OOM problem you have.

How to determine which type of OOM you have:
For a classic leak, review the javacore file and see if it reports a heap shortage (and not a native problem). Look to see if the Java heap was fully expanded and the amount of free space. Review the GC History for excessive GC warnings that point to the heap being filled up. Review the verboseGC output just before the first OOM error was thrown. Look for ‘excessive GC’ messages. Look at the last GC cycle before the first OOM error was thrown. Find the reason code here for the OOM. If at this point, it looks like a classic heap leak, then review the first heapdump file generated for what filled up the Java heap.

For a native OOM problem, review the top of the javacore file for the reason for the OOM error. Newer Java JREs are better about reporting that this is a native issue. At this point, attempt to rule out a Java heap problem by looking at the Java heap size, how large the Java heap was at the time of the OOM and the amount of free space in this heap. If the Java heap was fully expanded with no free space, it’s probably a Java heap problem, but if there was space available in the Java heap and yet an OOM error was thrown, then it might be a native problem.
On Linux, the default ulimits can cause native OOM errors. See "Insufficient ulimit -u (NPROC) Value Contributes to Native OutOfMemory"

Use the verboseGC and/or javacore file's GC History to rule out a large object allocation failure.
Review the top of the javacore file for the reason for the OOM error. Newer Java JREs are better about reporting that this is a native issue.
At this point, attempt to rule out a Java heap problem by looking at the Java heap size, how large the Java heap was at the time of the OOM and the amount of free space in this heap.
If the Java heap was fully expanded with no free space, it’s probably a Java heap problem, but if there was space available in the Java heap and yet an OOM error was thrown, then it might be a native problem.
Look at the native_stderr.log file to get the correct order of errors and creation of heapdumps and javacores.
Continue debugging using this TechNote: "Troubleshooting native memory issues"

Rule out large allocation failures by looking and the GC History in the javacore file and the verboseGC output.
In the verboseGC output, look at the GC cycle just before the OOM error.
If a large allocation failure caused the OOM error, use the allocation threshold option to report on the code causing this failure.
The ‘Current Thread’ reported in the javacore file might have the code that caused the OOM error, but the allocation threshold option is the only way to really know for sure. There’s no real need to look at a heapdump file for this type of problem unless you want to see if there are other large objects that were successfully allocated, but the allocation threshold option will tell you the code that wanted this large piece of the heap.

Too small Java heap used – This one can be hard to debug simply because of the customer’s universal reluctance to increase the Java heap size, but for this type of OOM problem, we want to see the Java heap usage with a larger Java heap. Then we can track the usage of the heap from the verboseGC output and see if the heap usage levels out or keeps going up. If it levels out, then the Java heap was too small, and a larger heap should be used. If the Java heap usage keeps increasing, then this OOM problem isn’t from having a heap too small. It’s a real classic heap leak and you’ll need to review a heapdump. (And classic leaks are easier to spot when the Java heap is larger.)

Additional Info and Reminders:

Get the right javacore and heapdump:
The running process that causes an OOM error to be thrown will probably keep throwing OOM errors, and by default, every OOM error will generate a javacore and heapdump.
This means multiple javacore files and heapdumps will be created.
The most important heapdump and javacore files to see are from the first OOM error.
You can get fooled with heapdump and javacore files that aren’t from the first OOM error.
Heapdump and javacore files are named like this: javacore.[date].[time].[PID],[index].txt and heapdump.[date].[time].[PID],[index].phd.
Get the files with the lowest index numbers.
If the option, verbosegclog, is in use, get this separate log file also.

Minimum set of files needed for WebSphere OOM problem:
The native_stderr.log file – This is the most important file to review. It will have all of the messages from the JRE regarding errors coming from the JRE. It should also have the verboseGC output. Ideally this file will have at least the entire life of the Java process and not be truncated.
The verboseGC output – If the verbosegclog option is in use, then this log file must be provided.
The first heapdump generated. Review the native_stderr.log file for which file was generated first.
The first javacore file generated.
If you have this documentation, you can then figure out the type of OOM problem you have.

What’s different with an Oracle JRE:
1.   Verbosegc is in a different format and in a different file, the native_stdout.log file.
2.   Verbosegc output is properly enabled for debugging OOM problems on WebSphere with different options.
3.   Heapdumps will probably be named *.hprof, and aren’t generated automatically on OOM errors like they are on IBM JREs.
4.   There are no javacore files, but you will get a thread dump on the same signals that would generate a javacore file. These thread dumps will be written to the native_stdout.log file.
5.   Messages from the JRE itself are written to the native_stdout.log file and not the native_stderr.log file.
6.   HP-UX verbosegc output is in a different format.

title image (modified) credit: (cc) Some rights reserved by Nemo

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"","label":""},"Component":"","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11080399

Tips

The 4 general reasons for OutOfMemoryError errors and how not to get fooled

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?