Linux glibc >= 2.10 (RHEL 6) malloc may show excessive virtual memory usage
kgibm 0600027VAP Comments (6) Visits (43060)
A recent customer on RedHat Enterprise Linux 6 (RHEL6) was running WAS 8, 64-bit. We noticed that the virtual size of the process was over 14GB (ps -p $PID -o vsz,rss). The maximum heap (-Xmx) was 5.5GB, so we were concerned there was a native memory leak. In IBM Java 626 (which ships with WAS 8), javacores have a NATIVEMEMINFO structure which tracks most JVM native allocations, including -Xmx itself, classes, classloaders, JIT, shared classes, and even some SDK native allocations like DirectByteBuffers. These are the most common things to leak (whether due to a bug in the JVM itself or leaking classloaders in an application, etc.); however, when we looked at the javacore, it was only showing about 6.25GB:
1MEMUSER JRE: 6,401,632,920 bytes / 9912 allocations
The next most common cause of a native memory leak is a third-party library leak (such as a type 2 database driver). We ran the linux-asmemory.sh script included in MustGather: Native Memory Issues on Linux. The output of /proc/$PID/maps did not really show any third party libraries outside of Java, WAS, and the OS itself. We could also see in the maps output that there were about 125 private memory regions, each about 64MB in size. Here are 4 examples:
That lined up with what we knew -- "well known" data structures in the JVM were consuming about 6GB and then 125*64MB ~= 8GB. We were about to start using a complex procedure to investigate who was allocating these regions when the customer found a hint that this could be "benign." The RHEL 6 release notes say the following:
Red Hat Enterprise Linux 6 features version 2.11 of glibc, providing many features and enhancements, including... An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores. This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores.
After some investigation, the glibc 2.10 "NEWS" release notes have this:
* The malloc implementation can be compiled to be less memory efficient but higher performing in multi-threaded programs. Implemented by Ulrich Drepper.
The developer, Ulrich Drepper, has a much deeper explanation on his blog:
Before, malloc tried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible... This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets happily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.
Here is some additional information I interpreted by looking in the code (malloc.c and arena.c):
These memory pools are called arenas and the implementation is in arena.c. The first important macro is HEAP_MAX_SIZE which is the maximum size of an arena and it is basically 1MB on 32-bit and 64MB on 64-bit:
HEAP_MAX_SIZE = (2 * DEFA
Next, the reused_arena function is called when the free list is empty to get the arena to use. After a certain number of arenas have already been created (2 on 32-bit and 8 on 64-bit, or the value explicitly set through the environment variable MALLOC_ARENA_TEST), the function will set the maximum number of arenas to ((NU
This perfectly explained our situation. In their case, they had 16 cores, so that is 16*8=128. Each arena has a maximum size of 64MB; 128*64=8GB.
Some prelminary testing on a particular workload showed significant performance improvements (an average CPU utilization decrease of 10%) by essentially reverting this behavior with the environment variable MALLOC_ARENA_MAX=1. However, this is not a blanket recommendation to use this parameter as a default (see Update #2 below).
A relevant bug report from Apache Hadoop:
New versions of glibc present in RHEL6 include a new arena allocator design. In several clusters we've seen this new allocator cause huge amounts of virtual memory to be used, since when multiple threads perform allocations, they each get their own memory arena. On a 64-bit system, these arenas are 64M mappings, and the maximum number of arenas is 8 times the number of cores. We've observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.
Update (March 12, 2012): Some people have asked me if they should apply MALLOC_ARENA_MAX=1 to all processes. First, you should start small and see if applying to Java/WAS processes helps or not (i.e. your mileage may vary). In theory, the libc change may improve performance of native processes that have highly concurrent malloc invocations, which could be many Linux processes. Java is somewhat unique in that it tends to make a few, large malloc calls (both for the Java heap and "segments" for JIT, classes, etc.) and so may be uniquely negatively affected by this change.
Update #2 (November 29, 2012): I wrote this post simply to document my findings at a particular customer situation, but it has become my most popular post, so I thought I'd add some more details. First, I've slightly modified the original blog post to de-emphasize the 10% CPU decrease that we observed. We never understood why the decrease occurred and we only tried changing max arenas on a hunch and it appeared to be beneficial. Unfortunately, the customer did not have time to work with us to fully understand what caused the difference. Afterwards, we tried to reproduce the problem in an IBM test environment, and we were not able to. There are clearly some variables that are not fully understood, and I certainly would not recommend MALLOC_ARENA_MAX=1 as a default tuning parameter until it can be better understood under what conditions it helps and if there are any conditions where it hurts (because clearly the original intent of arenas is to improve multi-core performance and there is supposedly evidence that it does do this on some programs). If you are a customer that can reproduce this issue, please contact me or ask your local IBM representative to contact me (as your email may be caught in IBM's spam filters). We also have internal plans to try to reproduce this error although the first time was unsuccessful. If I learn more, I will post a new blog post.