IBM Support

Memory Leak Diagnosis on AIX

How To


Summary

Provide instructions for determining the location of a memory leak on AIX.

Objective

This document is intended to give the user instructions to pinpoint where a memory leak may be occurring in a process that is experiencing continuous unexpected memory growth.

Steps


A memory leak is a condition where a process allocates memory, but does not free it when it is done with it. This becomes a problem if the process is a long-running process, like a daemon - the excess memory usage builds up over time. This could result in core dumps, unexpected behavior of the process, or could cause a performance issue if the excessive memory usage results in the system paging out memory to paging space.

Older documentation may suggest that MALLOCTYPE=debug should be set during the debug process; however, it is not needed and should not be used when performing memory leak diagnosis. MALLOCTYPE=debug is deprecated, and it is equivalent to the setting MALLOCDEBUG=catch_overflow. That is a setting that is only needed when diagnosing buffer overwrites or overreads, and does not apply to memory leak detection. Using this unnecessarily causes more memory usage due to the overhead needed to check for overflows. 


The first step in diagnosing a memory leak is to recognize that you are experiencing a leak. Monitor process memory usage with 'svmon -P -O summary=basic,unit=MB' to display memory usage of all processes, and make note of any processes that might be using an unexpected amount of memory. If you already suspect a particular process has a leak, monitor it specifically with 'svmon -P <PID> -O summary=basic,unit=MB'.

Another method to use, if a core dump has occurred, is to look at the malloc statistics in dbx. In this example, I have a program called 'leaktest' which is designed to leak memory. The program simply allocates 1MB repeatedly without freeing it:

void func1();
void func2();

void main() {
        while(1) func1();
}

void func1() {
        func2();
}

void func2() {
        char *str;
        str=(char *) malloc(1024*1024);
        strcpy(str,"testing");
}

Running that, it cores rather quickly, and then I load the core in dbx:

# dbx ./leaktest ./core
Type 'help' for help.
[using memory image in ./core]
reading symbolic information ...

Segmentation fault in . at 0xfc04
0x0000fc04 warning: Unable to access address 0xfc04 from core

Inside dbx, run the 'malloc' subcommand to check memory usage statistics:

(dbx) malloc
The following options are enabled:

        Implementation Algorithm........ Default Allocator (Yorktown)

Statistical Report on the Malloc Subsystem:
        Heap 0
                heap lock held by................ UNLOCKED
                bytes acquired from sbrk().......    267456272
                bytes in the freespace tree......        65312
                bytes held by the user...........    267390960
                allocations currently active.....          255
                allocations since process start..          255


The Process Heap
                Initial process brk value........ 0x20000680
                current process brk value........ 0x2ff11590
                sbrk()s called by malloc.........        241


That tells me that the process currently has ~256MB of memory allocated on the heap - looking at 'bytes held' - and there is not much left on the free space tree that can be used for new allocations. Also, since the start of this process, 255 memory allocations have been made, and they are all currently active - none of them have been freed. To determine if this amount of memory usage is excessive, you'll have to be aware of what the expected memory usage is for the process that you are examining; some processes may need to allocate and hold a large amount of memory. You'll also need to be aware of process limits and the process MAXDATA value in order to determine if the core dump was likely due to running out of memory on the heap.

The process limits can be checked inside of dbx with 'proc rlimit':

(dbx) proc rlimit
rlimit name:          rlimit_cur               rlimit_max       (units)
 RLIMIT_CPU:         (unlimited)             (unlimited)        sec
 RLIMIT_FSIZE:       (unlimited)             (unlimited)        bytes
 RLIMIT_DATA:        (unlimited)             (unlimited)        bytes
 RLIMIT_STACK:        4294967296              4294967296        bytes
...etc...

In this case, the data limit is set to unlimited, so there is no constriction on usable heap memory there.

The MAXDATA value can be checked outside of dbx, with the 'dump' command:

# dump -Xany -ov ./leaktest

./leaktest:

                        ***Object Module Header***
# Sections      Symbol Ptr      # Symbols       Opt Hdr Len     Flags
         5      0x00000b2c            149                72     0x1002
Flags=( EXEC DYNLOAD DEP_SYSTEM )
Timestamp = "Aug 21 12:03:48 2018"
Magic = 0x1df  (32-bit XCOFF)

                        ***Optional Header***
Tsize       Dsize       Bsize       Tstart      Dstart
0x0000041c  0x00000110  0x00000004  0x10000150  0x2000056c

SNloader    SNentry     SNtext      SNtoc       SNdata
0x0004      0x0002      0x0001      0x0002      0x0002

TXTalign    DATAalign   TOC         vstamp      entry
0x0007      0x0003      0x20000634  0x0001      0x20000620

maxSTACK    maxDATA     SNbss       magic       modtype
0x00000000  0x00000000  0x0003      0x010b        1L

In this case, we see that 'leaktest' is a 32-bit process with a MAXDATA value of 0x00000000.
MAXDATA of 0x00000000 is the default setting, which means for a 32-bit process, the heap and stack share a single 256MB memory segment (the 0x2 segment). The heap starts at the lower end of the 0x2 segment, at around 0x20000000, and works its way up. The stack starts at the upper end of the segment, around 0x2fffffff, and works its way down. They are in danger of colliding if either grows too large.
MAXDATA of 0x10000000 means the heap and stack are separated into different segments (stack in 0x2 and heap in 0x3), and that the heap is allowed a full 256MB segment to itself.
MAXDATA of 0x20000000 means that the heap and stack are separated, and that the heap gets two 256MB segments to itself (0x3 and 0x4) - and so on.

For this example, I can see that with the MAXDATA value for 'leaktest' at the default, that the process has used up the entire 256MB available in segment 0x2 while allocating memory, and either 1) the heap has crashed into the stack and cored because of that, or 2) it failed to allocate more memory once it reached the limit, and cored when trying to work on the resulting null memory pointer. As long as this process is not expected to use this much memory, I can be confident that a leak has occurred.

64-bit processes do not face the same constraints or need to set MAXDATA, since the addressing space is much larger. The process data limit would be the constricting factor in memory usage.


If I want to view the memory allocation tables for this process, which shows me all of the active (not yet freed) allocations that the process has made, I can do so in dbx with the 'malloc allocation' subcommand. For example:

(dbx) malloc allocation
Allocations Held by the Process:

   ADDRESS         SIZE HEAP    ALLOCATOR
0x20000688      1048584    0     YORKTOWN
0x20100698      1048584    0     YORKTOWN
0x202006a8      1048584    0     YORKTOWN
0x203006b8      1048584    0     YORKTOWN
0x204006c8      1048584    0     YORKTOWN
0x205006d8      1048584    0     YORKTOWN
0x206006e8      1048584    0     YORKTOWN
...etc...

That tells me that a memory allocation exists at address 0x20000688, it is 1MB in size, it is on Heap 0, and was allocated using the default Yorktown memory allocator. Another allocation exists at 0x20100698, and so on.
This information is not particularly useful in our diagnosis - all it does is let me know that there were many 1MB allocations active at the time of the core dump. It does not tell me what made these allocations.

The more useful information can be found by turning on MALLOCDEBUG memory allocation logging.

The settings I typically use are:

export MALLOCDEBUG=log:extended,stack_depth:20
<start process>
unset MALLOCDEBUG

The 'stack_depth' can be adjusted as needed - a larger value requires more memory overhead, but is able to show more elements of the stack, in case the process has the potential for a deep stack.
The 'extended' option is not necessarily needed, but provides extra information such as the time the allocation was made and the ID of the thread that made the allocation, which could be useful in a multithreaded application.

With my MALLOCDEBUG settings shown above, the 'malloc allocation' output gives much more useful information. Now I see:

(dbx) malloc allocation
Allocations Held by the Process:

   ADDRESS         SIZE HEAP      PID  PTHREAD_T  CLOCKTIME      SEQ STACK TRACEBACK
0x2001f688      1048576    0  6619390 0x00000000 1534872877        0 0xd01b1854 malloc_common_debugging
                                                                     0xd01111ec init_malloc
                                                                     0xd0112f4c malloc
                                                                     0x1000039c func2
                                                                     0x10000410 func1
                                                                     0x10000450 main
                                                                     0x100001bc __start
0x2011f698      1048576    0  6619390 0x00000000 1534872877        1 0xd01b1854 malloc_common_debugging
                                                                     0x1000039c func2
                                                                     0x10000410 func1
                                                                     0x10000450 main
                                                                     0x100001bc __start
0x2021f6a8      1048576    0  6619390 0x00000000 1534872877        2 0xd01b1854 malloc_common_debugging
                                                                     0x1000039c func2
                                                                     0x10000410 func1
                                                                     0x10000450 main
                                                                     0x100001bc __start
0x2031f6b8      1048576    0  6619390 0x00000000 1534872877        3 0xd01b1854 malloc_common_debugging
                                                                     0x1000039c func2
                                                                     0x10000410 func1
                                                                     0x10000450 main
                                                                     0x100001bc __start

...etc...

This tells me that each of these 1MB allocations were made when the main() function called func1(), which then called func2(), which then performed the memory allocation. Knowing this, we can now decide: should func2() be freeing the memory that it has allocated? Or does it pass this memory address back to one of the previous functions in the stack, which is then responsible for freeing the memory once it is done with it?

This is a very simple example, since there are allocations happening in only one spot, and we can identify the culprit with a quick glance. In reality, more effort will be needed in order to identify the offending stack trace or traces responsible for a leak.

To make my example a little more life-like, here is the new test program I am using:

void func1();
void func2();

void main() {
        char *a,*b,*c,*d;
        a=(char *) malloc (128);
        b=(char *) malloc (64);
        c=(char *) malloc (128);
        d=(char *) malloc (256);
        while(1) func1();
        free(a);free(b);free(c);free(d);
}

void func1() {
        func2();
}

void func2() {
        char *str;
        str=(char *) malloc(1024*1024);
        strcpy(str,"testing");
}

This is the same as above but I've added in a couple of 'non-leaked' allocations of a, b, c, and d, which would be freed if we were to get past the 'while' loop. This is still a very simple example, but will work to get my point across.

When this program cores, I load it in dbx - but for analysis purposes, I'll redirect malloc allocation data to a file called 'allocs.out':

# dbx ./leaktest ./core
Type 'help' for help.
[using memory image in ./core]
reading symbolic information ...

Segmentation fault in . at 0xfc04
0x0000fc04 warning: Unable to access address 0xfc04 from core
(dbx) malloc allocation > allocs.out
(dbx) quit

Then, I use a script to parse out the allocation data. The script that I use to analyze malloc logs is an awk script, shown below:

#!/bin/awk -f

function sort(ARRAY, ELEMENTS,   temp, i, j) {
        for (i = 2; i <= ELEMENTS; ++i) {
                for (j = i; ARRAY[j-1] > ARRAY[j]; --j) {
                        temp = ARRAY[j]
                        ARRAY[j] = ARRAY[j-1]
                        ARRAY[j-1] = temp
                }
        }
        return
}
BEGIN{
        sum=0;
        check[0,0]=0
        arr[0]=0
        i=0
}
{
   if((match($1,"^0x[2-9a]") && length($1)==10) || (match($1,"^0x0") && length($1)==18)){
        sum+=$2;
        if(check[$2,0] == 1){
            check[$2,1]++
        #    print $2  "*" check[$2,1]
        }
        else {
             arr[i++]=$2
             check[$2,0] = 1
             check[$2,1] = 1
        }
   }

}
END{
    ind=0
    n=0
    print "Final Result"

    sort(arr,i)
    for(j=0;j<=i;j++){
        ind=arr[j];
#       print ind " " check[ind,1] " and " ind*check[ind,1]
        print "Total: " ind*check[ind,1] "      Size of allocation: " ind "     Frequency of allocation: " check[ind,1]
    }
    print sum
}

Although there are likely more in-depth malloc log analysis tools out there, this one has done quite well for me over the years.

I run it against my 'allocs.out' file like this, and receive the following output:

# ./alloc.awk allocs.out
Final Result
Total: 256             Size of allocation: 128       Frequency of allocation: 2
Total: 64              Size of allocation: 64        Frequency of allocation: 1
Total: 256             Size of allocation: 256       Frequency of allocation: 1
Total: 267386880       Size of allocation: 1048576   Frequency of allocation: 255
267387456


That tells me that there are two active 128-byte allocations, consuming a total of 256 bytes. Additionally, one active allocation of 64 bytes, one of 256 bytes, and 255 allocations of 1MB each. In this case I can clearly decide that the 1MB allocations are the culprit.  At this point, I can look back at the 'allocs.out' file, searching for for 1MB allocations and determining if they are all coming from a common function call. If so, that is the source of the leak.

Again, in a real-life scenario, you'll have more allocations to consider than shown above. But by looking for the allocations with the largest frequency,  it shouldn't be too difficult to determine which allocations/stack traces are responsible for the leak using this technique.

In my examples, all of the functions shown in the stack were part of the main process. More likely, you will encounter external library functions when performing analysis.  One dbx subcommand that comes in handy when determining where exactly these functions are coming from is the 'map' command, as shown below.

Let's say that we are looking at this stack trace:

0x000000011010c290      1114112    0 17236260 0x00000102 1533826269     1270 0x00900000000053ec0 malloc_common_debugging
                                              0x0090000002172f890 sqloGetPrivateMemoryFromOs__FPPvCUlCUiCP12SMemLogEvent
                                              0x009000000217330ac allocateChunkGroup__7SMemSetF9SqloChunkT1UiCP12SMemResource
                                              0x00900000021645d10 getChunksFromTree__7SMemSetF9SqloChunkT1CPUiPP9SChunkGrpP9SqloC
                                              0x009000000216447c8 getContiguousChunks__7SMemSetF9SqloChunkCPUiPP9SChunkGrpP9SqloC
                                              0x0090000002178be7c getNewChunkSubgroup__13SQLO_MEM_POOLFCUlUiT1CP12SMemLogEventPP1
                                              0x00900000021790878 allocateMemoryBlock__13SQLO_MEM_POOLFCUlT1UiT1PP17SqloChunkSubg
                                              0x00900000021726824 sqloGetMemoryBlockExtended
                                              0x00900000021724168 sqlo_init_generic_data_temp__FP15sql_static_dataCUlT2CbP19SqloR
                                              0x009000000219cc170 sqlo_create_init_EDU_data_temp__FUiP19SqloResourceBinding
                                              0x00000000100001688 DB2main
                                              0x00900000021706194 sqloEDUMainEntry__FPcUi
                                              0x009000000217058b0 sqloEDUEntry
                                              0x00900000000598fec _pthread_body


That shows us that the 'sqloGetPrivateMemoryFromOs__FPPvCUlCUiCP12SMemLogEvent' function is the allocator of this particular bit of memory. However, we may not know where this function is located.
The stack shows us that this function's address is '0x0090000002172f890'. The 'map' command then shows us that this function belongs to libdb2e.a:

(dbx) map 0x0090000002172f890
Entry 61:
   Object name: ./db2/db2sox/DB2_11.1_FP2/lib64/libdb2e.a
   Member name: shr_64.o
   Text origin:     0x90000002158f000
   Text length:     0x10e1bded
   Data origin:     0x9001000a540b1d8
   Data length:     0x2d8e818
   File descriptor: 0x7d

If you believe that this function/library has a leak, and it is not maintained by yourself, then you would need to contact the appropriate support group for assistance.

Additionally, you may see that the malloc allocation report is unable to tell you a function name, like this:

0x2006e138           12    0  8716372 0x00000000 1535042884      270 0xd01b1854 malloc_common_debugging
                                                                     0xd193a4a8 ??
                                                                     0xd193af9c ??
                                                                     0xd193dea0 ??
                                                                     0xd19550fc ??
                                                                     0xd196cabc ??
                                                                     0xd1a3b340 ??
                                                                     0xd1a3b180 ??
                                                                     0x1007c6e8 ssh_OpenSSL_add_all_algorithms
                                                                     0x10004604 main
                                                                     0x10000214 __start

To find what the function at address 0xd193a4a8 is, try using the 'listi' subcommand:

(dbx) listi 0xd193a4a8
0xd193a4a8 (default_malloc_ex+0x18) 80410014            lwz   r2,0x14(r1)
0xd193a4ac (default_malloc_ex+0x1c) 81810048            lwz   r12,0x48(r1)
0xd193a4b0 (default_malloc_ex+0x20) 38210040           addi   r1,0x40(r1)
0xd193a4b4 (default_malloc_ex+0x24) 7d8803a6           mtlr   r12
0xd193a4b8 (default_malloc_ex+0x28) 4e800020            blr


In this case, it tells us the function is called default_malloc_ex(); and as shown above, the 'map' subcommand can be used to tell us what library this function belongs to.

Further notes:

Make sure that fullcore is turned on (chdev -l sys0 -a fullcore=true) before generating any core dumps. Likewise, ensure that the limits defined in /etc/security/limits do not restrict core file size, and that the directories where core dumps are generated (either the current working directory of the process, or the directory specified by core_path) has ample free space. If the core dump is truncated for any reason, the malloc logs may not be readable.

When checking for/diagnosing a memory leak using the method shown above, you do not need to wait for the process to core dump on its own. You may use the gencore command to generate a core dump, while allowing the process to continue execution. For example:

gencore <PID of process> /<directory>/core

You may then use dbx to view the core file that you've generated. If needed, after running gencore once, you can then wait hours/days until memory usage has grown more, and then run gencore again. This will help give you an idea of where memory growth is occurring when comparing the two sets of data.

If you are engaging with IBM support and want them to view the core file, or if you need to view the data on a system other than where the core was generated, you must use the snapcore command to collect all necessary data - run:

snapcore <full path to core file> <full path to process>

Further information regarding collecting cores and sending to IBM can be found here: https://www-01.ibm.com/support/docview.wss?uid=ibm10716719

A live process can also be attached to with dbx, using the 'dbx -a <PID>' command. Be careful using this method - for one thing, attaching to it with dbx will stop the execution of the process until you detach or issue a 'cont' instruction. The other issue is that if you use 'quit' to exit dbx, it will kill your process. You must use 'detach' if you want the process to continue execution.


The MALLOCDEBUG allocation logging system causes extra memory overhead to be used. For example, a 32-bit process with a stack_depth of 20 requires about 125 bytes of memory per active memory allocation in order to store the necessary data. Increasing the stack_depth by 10 causes an increase of 40 bytes per active allocation. A 64-bit process with stack_depth of 20 requires about 240 bytes per allocation, with an increase of 80 bytes when increasing stack_depth by 10.

If the usage of MALLOCDEBUG causes memory to run out too quickly, you may temporarily adjust a process's MAXDATA value by using the LDR_CNTRL variable when starting the process up. For example:

export LDR_CNTRL=MAXDATA=0x80000000
export MALLOCDEBUG=log:extended,stack_depth:20
<start process>
unset MALLOCDEBUG
unset LDR_CNTRL

This will allow the process to use eight 256MB segments (2GB total) of heap memory; if the process had a default MAXDATA value of 0x0, this gives it significantly more memory to work with.
64-bit processes are allowed more memory to work with, so this is unnecessary if debugging a 64-bit process.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SWG10","label":"AIX"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
15 September 2022

UID

ibm10728873