At a recent customer, we improved throughput by 50% simply by restarting with the AIX environment variable MALLOCOPTIONS=multiheap. This only applies to situations where there is heavy, concurrent malloc usage, and in many cases of WAS/Java, this is not the case.
The multiheap option does have costs, particularly increased virtual and physical memory usage. The primary reason is that each heap's free tree is independent, so fragmentation is more likely. There is also some additional metadata overhead.
malloc is often a bottleneck for application performance, especially under AIX... By default, the [AIX] malloc subsystem uses a single heap, which causes lock contention for internal locks that are used by malloc in case of multi-threaded applications. By enabling [the multiheap] option, you can configure the number of parallel heaps to be used by allocators. You can set the multiheap by exporting MALLOCOPTIONS=multipheap[:n], where n can vary between 1- 32 and 32 is the default if n is not specified. Use this option for multi-threaded applications, as it can improve performance.
How do you know if this is affecting you? It's not easy:
A concentration of execution time in the pthreads library... or in kernel locking... routines... is associated with a locking issue. This locking might ultimately arise at the system level (as seen with malloc locking issues on AIX), or at the application level in Java code (associated with synchronized blocks or methods in Java code). The source of locking issues is not always immediately apparent from a profile. For example, with AIX malloc locking issues, the time that is spent in the malloc and free routines might be quite low, with almost all of the impact appearing in kernel locking routines.
Nevertheless, here is an example tprof that shows this problem. We used tprof -ujeskzl -A -I -X -E -r report -x sleep 60
Process FREQ Total Kernel User Shared Other Java ======= ==== ===== ====== ==== ====== ===== ==== /usr/java5/jre/bin/java 174 22557 11850 0 7473 86 3148 Shared Object Ticks % Address Bytes ============= ===== ====== ======= ===== /usr/lib/libc.a[shr_64.o] 3037 9.93 900000000000d00 331774 /usr/lib/libpthread.a[shr_xpg5_64.o] 1894 6.19 9000000007fe200 319a8 Total Ticks For All Processes (KERNEL) = 15045 Subroutine Ticks % Source Address Bytes ========== ===== ====== ====== ======= ===== ._check_lock 2103 6.88 low.s 3420 40 Total Ticks For All Processes (/usr/lib/libc.a[shr_64.o]) = 3037 Subroutine Ticks % Source Address Bytes ========== ===== ====== ====== ======= ===== .malloc_y 856 2.80 ../../../../../../../src/bos/usr/ccs/lib/libc/malloc_y.c 41420 840 .free_y 669 2.19 ../../../../../../../src/bos/usr/ccs/lib/libc/malloc_y.c 3f980 9a0 Total Ticks For All Processes (/usr/lib/libpthread.a[shr_xpg5_64.o]) = 1894 Subroutine Ticks % Source Address Bytes ========== ===== ====== ====== ======= ===== .global_unlock_ppc_mp 634 2.07 pth_locks_ppc_mp.s 2d714 6c ._global_lock_common 552 1.81 ../../../../../../../../src/bos/usr/ccs/lib/libpthreads/pth_spinlock.c 2180 5e0 .global_lock_ppc_mp_eh 321 1.05 pth_locks_ppc_mp_eh.s 2d694 6c
The key things to notice are:
- In the first "Process" section, the "Kernel" time is high (about half of "Total"). This will also show up in topas/vmstat/ps as high "system" CPU time.
- In the "Shared Object" list, libc and libpthread are high.
- In the "KERNEL" section, ._check_lock is high.
- In the "libc.a" section, .malloc_y and .free_y are high.
- In the "libpthread.a" section, .global_unlock_ppc_mp and other similarly named functions are high.
AIX also offers other allocators and allocator options that may be useful:
Pool malloc: The pool front end to the malloc subsystem optimizes the allocation of memory blocks of 512 bytes or less. It is common for applications to allocate many small blocks, and pools are particularly space- and time-efficient for that allocation pattern. Thread-specific pools are used for multi-threaded applications. The pool malloc is a good choice for both single-threaded and multi-threaded applications.
Using the pool front end and multiheap malloc in combination is a good alternative for multi-threaded applications. Small memory block allocations, typically the most common, are handled with high efficiency by the pool front end. Larger allocations are handled with good scalability by the multiheap malloc. A simple example of specifying the pool and multiheap combination is by using the environment variable setting:
Buckets: This suboption is similar to the built-in bucket allocator of the Watson allocator. However, with this option, you can have fine-grained control over the number of buckets, number of blocks per bucket, and the size of each bucket. This option also provides a way to view the usage statistics of each bucket, which be used to refine the bucket settings. In case the application has many requests of the same size, then the bucket allocator can be configured to preallocate the required size by correctly specifying the bucket options. The block size can go beyond 512 bytes, compared to the Watson allocator or malloc pool options.
1. For a 32-bit single-threaded application, use the default allocator.
2. For a 64-bit application, use the Watson allocator.
3. Multi-threaded applications use the multiheap option. Set the number of heaps proportional to the number of threads in the application.
4. For single-threaded or multi-threaded applications that make frequent allocation and deallocation of memory blocks smaller than 513, use the malloc pool option.
5. For a memory usage pattern of the application that shows high usage of memory blocks of the same size (or sizes that can fall to common block size in bucket option) and sizes greater than 512 bytes, use the configure malloc bucket option.
6. For older applications that require high performance and do not have memory fragmentation issues, use malloc 3.1.
7. Ideally, the Watson allocator, along with the multiheap and malloc pool options, is good for most multi-threaded applications; the pool front end is fast and is scalable for small allocations, while multiheap ensures scalability for larger and less frequent allocations.
8. If you notice high memory usage in the application process even after you run free(), the disclaim option can help.