A recent customer on RedHat Enterprise Linux 6 (RHEL6) was running WAS 8, 64-bit. We noticed that the virtual size of the process was over 14GB (ps -p $PID -o vsz,rss). The maximum heap (-Xmx) was 5.5GB, so we were concerned there was a native memory leak. In IBM Java 626 (which ships with WAS 8), javacores have a NATIVEMEMINFO structure which tracks most JVM native allocations, including -Xmx itself, classes, classloaders, JIT, shared classes, and even some SDK native allocations like DirectByteBuffers. These are the most common things to leak (whether due to a bug in the JVM itself or leaking classloaders in an application, etc.); however, when we looked at the javacore, it was only showing about 6.25GB:
1MEMUSER JRE: 6,401,632,920 bytes / 9912 allocations
The next most common cause of a native memory leak is a third-party library leak (such as a type 2 database driver). We ran the linux-asmemory.sh script included in MustGather: Native Memory Issues on Linux. The output of /proc/$PID/maps did not really show any third party libraries outside of Java, WAS, and the OS itself. We could also see in the maps output that there were about 125 private memory regions, each about 64MB in size. Here are 4 examples:
7f379403e000-7f3798000000 ---p 00000000 00:00 0 7f3798000000-7f379bff7000 rw-p 00000000 00:00 0 7f379c021000-7f37a0000000 ---p 00000000 00:00 0 7f37a0028000-7f37a4000000 ---p 00000000 00:00 0
That lined up with what we knew -- "well known" data structures in the JVM were consuming about 6GB and then 125*64MB ~= 8GB. We were about to start using a complex procedure to investigate who was allocating these regions when the customer found a hint that this could be "benign." The RHEL 6 release notes say the following:
Red Hat Enterprise Linux 6 features version 2.11 of glibc, providing many features and enhancements, including... An enhanced dynamic memory allocation (malloc) behaviour enabling higher scalability across many sockets and cores. This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The amount of additional memory used for the memory pools (if any) can be controlled using the environment variables MALLOC_ARENA_TEST and MALLOC_ARENA_MAX. MALLOC_ARENA_TEST specifies that a test for the number of cores is performed once the number of memory pools reaches this value. MALLOC_ARENA_MAX sets the maximum number of memory pools used, regardless of the number of cores.
After some investigation, the glibc 2.10 "NEWS" release notes have this:
* The malloc implementation can be compiled to be less memory efficient but higher performing in multi-threaded programs. Implemented by Ulrich Drepper.
[glibc-2.10.1]$ grep CPPFLAGS malloc/Makefile
CPPFLAGS-malloc.c += -DPER_THREAD -DATOMIC_FASTBINS
The developer, Ulrich Drepper, has a much deeper explanation on his blog:
Before, malloc tried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible... This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets happily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.
The changes which are in glibc now create per-thread memory pools. This can eliminate false sharing in most cases. The meta data is usually accessed only in one thread (which hopefully doesn’t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create up to two memory pools per core on 32-bit machines and up to eight memory per core on 64-bit machines. The code delays testing for the number of cores (which is not cheap, we have to read /proc/stat) until there are already two or eight memory pools allocated, respectively.
While these changes might increase the number of memory pools which are created (and thus increase the address space they use) the number can be controlled. Because using the old mechanism there could be a new pool being created whenever there are collisions the total number could in theory be higher. Unlikely but true, so the new mechanism is more predictable.
... Memory use is not that much of a premium anymore and most of the memory pool doesn’t actually require memory until it is used, only address space... We have done internally some measurements of the effects of the new implementation and they can be quite dramatic.
Here is some additional information I interpreted by looking in the code (malloc.c and arena.c):
These memory pools are called arenas and the implementation is in arena.c. The first important macro is HEAP_MAX_SIZE which is the maximum size of an arena and it is basically 1MB on 32-bit and 64MB on 64-bit:
HEAP_MAX_SIZE = (2 * DEFAULT_MMAP_THRESHOLD_MAX) 32-bit [DEFAULT_MMAP_THRESHOLD_MAX = (512 * 1024)] = 1,048,576 (1MB) 64-bit [DEFAULT_MMAP_THRESHOLD_MAX = (4 * 1024 * 1024 * sizeof(long))] = 67,108,864 (64MB)
Next, the reused_arena function is called when the free list is empty to get the arena to use. After a certain number of arenas have already been created (2 on 32-bit and 8 on 64-bit, or the value explicitly set through the environment variable MALLOC_ARENA_TEST), the function will set the maximum number of arenas to ((NUMBER_OF_CPU_CORES) * (sizeof(long) == 4 ? 2 : 8)).
This perfectly explained our situation. In their case, they had 16 cores, so that is 16*8=128. Each arena has a maximum size of 64MB; 128*64=8GB.
Some prelminary testing on a particular workload showed significant performance improvements (an average CPU utilization decrease of 10%) by essentially reverting this behavior with the environment variable MALLOC_ARENA_MAX=1. However, this is not a blanket recommendation to use this parameter as a default (see Update #2 below).
A relevant bug report from Apache Hadoop:
New versions of glibc present in RHEL6 include a new arena allocator design. In several clusters we've seen this new allocator cause huge amounts of virtual memory to be used, since when multiple threads perform allocations, they each get their own memory arena. On a 64-bit system, these arenas are 64M mappings, and the maximum number of arenas is 8 times the number of cores. We've observed a DN process using 14GB of vmem for only 300M of resident set. This causes all kinds of nasty issues for obvious reasons.
Setting MALLOC_ARENA_MAX to a low number will restrict the number of memory arenas and bound the virtual memory, with no noticeable downside in performance - we've been recommending MALLOC_ARENA_MAX=4. We should set this in hadoop-env.sh to avoid this issue as RHEL6 becomes more and more common.
Update (March 12, 2012): Some people have asked me if they should apply MALLOC_ARENA_MAX=1 to all processes. First, you should start small and see if applying to Java/WAS processes helps or not (i.e. your mileage may vary). In theory, the libc change may improve performance of native processes that have highly concurrent malloc invocations, which could be many Linux processes. Java is somewhat unique in that it tends to make a few, large malloc calls (both for the Java heap and "segments" for JIT, classes, etc.) and so may be uniquely negatively affected by this change.
Update #2 (November 29, 2012): I wrote this post simply to document my findings at a particular customer situation, but it has become my most popular post, so I thought I'd add some more details. First, I've slightly modified the original blog post to de-emphasize the 10% CPU decrease that we observed. We never understood why the decrease occurred and we only tried changing max arenas on a hunch and it appeared to be beneficial. Unfortunately, the customer did not have time to work with us to fully understand what caused the difference. Afterwards, we tried to reproduce the problem in an IBM test environment, and we were not able to. There are clearly some variables that are not fully understood, and I certainly would not recommend MALLOC_ARENA_MAX=1 as a default tuning parameter until it can be better understood under what conditions it helps and if there are any conditions where it hurts (because clearly the original intent of arenas is to improve multi-core performance and there is supposedly evidence that it does do this on some programs). If you are a customer that can reproduce this issue, please contact me or ask your local IBM representative to contact me (as your email may be caught in IBM's spam filters). We also have internal plans to try to reproduce this error although the first time was unsuccessful. If I learn more, I will post a new blog post.