Skip to main content

skip to main content

developerWorks  >  AIX and UNIX  >

A Framework for the User Defined Malloc Replacement Feature

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Gary Hook (ghook@us.ibm.com), Senior Technical Consultant, IBM PartnerWorld for Developers, eServer pSeries

01 Feb 2002

Learn how to take advantage of a facility in AIX that lets you replace the memory subsystem with one of your own design. The author explains strategies you can use and gives you some sample code.

Introduction

When porting or developing applications on AIX, one of the most common issues that arises is that of customized memory allocation tools. Developers often wish to use an implementation of the malloc() and friends system library routines that allows for detailed debugging, monitoring and analysis of an application's dynamic memory utilization. In case of cross-platform development, an organization may have a single implementation of these common memory management functions that run on all the systems.

In AIX, since the release of AIX 4.3.3, a feature has been available which allows for the use of a user-supplied memory allocation subsystem. This feature, known as the User Defined Malloc Replacement facility, provides for the dynamic loading of a "plug-in" which supplies a well-defined set of symbol definitions that will be invoked by all components of the running application.

This article explores the details relevant to creating a malloc replacement mechanism, including an explanation of the code necessary for taking advantage of this feature, and strategies to incorporate support for threaded processes.



Back to top


History

The shared library structure on AIX does not provide for the runtime look-up of symbol definitions in precisely the same fashion as other systems. In particular, a shared library, or module, is considered to be a "black box"; the interface to this black box as defined by a set of imported and exported symbols. Within the module, however, references to symbols defined within the same module are nominally considered to be tightly bound. This tight binding is constructed when the module is built and even external references (imports into the module) are defined as supplied by a specific module. The net result of this architecture is improved performance for exec() and dynamic loading, at the expense of the ability to easily and flexibly perform symbol preemption.1

Taking this approach to shared modules on AIX, consider the c system library (libc.a). Most system library routines are implemented within a single shared module that resides within this archive. The memory allocation subsystem (malloc() and friends) is implemented within this module, shr.o, which means that references to malloc() from within the module can not be overridden or preempted. On a stock installation, without using runtime linking and the rtl_enable command, the technique usually used on other platforms for the incorporation of custom system interfaces can not be used on AIX. That is, AIX does not provide runtime symbol lookup, so simply providing an alternative definition of a symbol "first" in a set of modules is not sufficient to override references to the default instance.



Back to top


Solution

Recognizing the need to support user-supplied malloc subsytem implementations, the malloc plug-in feature was added to AIX 4.3.3. This feature allows a developer to create a dynamically loadable module that implements the complete allocation subsystem interface and use this module in any existing application, without recompilation or relinking. The architecture also avoids the necessity of statically linking applications in order to override symbol references within system libraries.

There are some complications that must be considered when designing this plug-in:

  1. There is a "chicken-and-egg" problem, in that the malloc subsystem must be able to function before it has been "initialized."
  2. The subsystem should support both threaded and non-threaded applications. This is due to the fact that processes on AIX are able to become threaded at any point during execution if the pthreads library is dynamically loaded.

The following sections outline the construction of a simple plug-in, illustrating the issues that need to be addressed in order to assure correct behavior in the situations described above.



Back to top


The Basics

Search a current version of the AIX pubs1 for the phrase "User Defined Malloc Replacement". This document defines the basic requirements for a plug-in. Herein the developer will find the complete set of function declarations which the plug-in must define, the method for indicating that an application should use the plug-in, and a number of restrictions on the naming and locating of the plug-in. This article will assume that the reader has read that article and is aware of its contents.

Let's start by reviewing the set of required functions. This set can be broken up into 3 groups:

Primary API__malloc__
__free__
__realloc__
__calloc__
Control & Information__malloc_init__
__mallopt__
__mallinfo__
fork() handlers__malloc_prefork_lock__
__malloc_postfork_unlock__

The first group is the basic programming interface for the memory allocation subsystem. The second group contains an initialization function, called only once during program execution, an optimization function for controlling the behavior of the subsystem, and an information function which can be used to return data about the behavior and performance of the subsystem to the application. The functions in the final group are used to serialize access to the subsystem across the fork() system call. More on this later. Finally, note that throughout the remainder of this article these functions are referenced without the leading or trailing underscores for the sake of visual clarity.

The first step in creating a plug-in is to implement the primary API as listed above. This article focuses on the structure of the plug-in, rather than upon the implementation of an allocation subsystem.


Figure 1: Simple implementation
struct _mblock
{
    size_t  size;
    size_t  pad;
    char    buf[8];
};
static  int nmallocs = 0, nfrees = 0;

void *__malloc__( size_t nbytes )
{
    mblock *mb;
    int bufsize = (nbytes  + sizeof(mblock) - 1) & ~0x7;

    mb = (mblock *) sbrk( bufsize );
    nmallocs++; 



Refer to Example 1 for our version of malloc(). Each allocated block will start with a header containing a size field, plus padding to ensure alignment requirements are met. Every request size will be rounded up to the next multiple of 8 bytes, the original size of the request will be stored in the header, and the sbrk() system call will be used to acquire the memory. Since the padding has no other purpose at this time, a signature will be written to it (which may be useful if we add debugging code later on) and a pointer to the actual buffer is returned to the caller. Finally, there are two static variables which will be used to collect simple statistics.

The free() function, in figure 2, will easily meet our needs by essentially doing nothing other than incrementing a counter. The realloc() function allocates a buffer and moves the memory contents to the new location. calloc() is built on top of malloc(), with the added function of initializing the memory block with null bytes.


Figure 2: free, realloc and calloc
void __free__( void *ptr )
{
  /* No-op: don't do anything */
  nfrees++;
}

void * __realloc__( void *ptr, size_t nbytes )
{
  void        *newptr;
  if ( newptr = __malloc__( nbytes ) )
  {
    mblock  *mb = (mblock *) &((size_t *) ptr)[-2];
    mblock  *nmb = (mblock *) &((size_t *) newptr)[-2];
    memcpy( ptr, newptr, MIN(mb->size,nmb->size) );
    return( newptr );
  }
  else
    return( NULL );
}

void * __calloc__( size_t count, size_t nbytes )


The conclusion here is that malloc() must be self-initializing. The function may use the common technique of checking and setting a global variable to indicate whether it has been called previously. As our example code does not require any initialization, this technique will is not required. As we develop our example, adding support for threads, an alternative technique will be used.

The mallopt() and mallinfo() functions will do very little in our example implementation. They are secondary to the creation of our framework, although some applications do take advantage of their utility. The code at the end of this article shows a __mallopt__() function that simply returns 0, and a __mallinfo__() function that only returns an empty mallinfo structure.



Back to top


What About Threads?

As stated earlier, a general subsystem is going to need to support both threaded and non-threaded applications. This turns out to be not as complex as it sounds, due to the availability of indicators that can be checked by our code. By starting with a simple implementation, such as the one shown here, we can add code as required to handle issues relevant to threads, and do so in a manner that allows the code to function correctly in both single- and multi-threaded processes.

A critical issue regarding threads is that of fork()ing. When a threaded process on AIX forks, only the thread calling fork() exists in the new child process. This thread becomes the main thread of execution, and all other threads cease to exist. Therefore it is critical that no locks or mutexes are held by any thread other than the one calling fork(). To facilitate this requirement, the fork() function calls a series of registered routines before and after the creation of the child process; our plug-in must provide two of these routines, listed in the table above, to serialize access to our data structures and ensure that no locks or mutexes are held by any threads that will not exist in the child process.

So that our plug-in can support simultaneous calls from multiple threads, a mutex will be used to serialize access to the memory allocation routines. Since this article is not about writing clever, threaded memory allocation code, it is adequate to demonstrate the required structure of the plug-in by letting only one of malloc(), calloc(), realloc(), and free() execute at any given moment.

The code shown above in figures 1 and 2 must be enhanced to respect this mutex, which will be called mallocmutex. Consider figure 3, where 2 statements have been added. The global variable __n_pthreads is checked for a positive value, and if this check is true lock/unlock functions are called for our mutex. The value of __n_pthreads can always be used as an indicator of threads activity in the process, as well as indicating the current number of threads. If the value is -1, the process is single-threaded and no pthreads-related calls need be made. If the value is greater than 0, the threads environment has been initialized, and new threads may be created at any time. Therefore, acquisition of the mutex before any work occurs is crucial, and the mutex must be released before the function returns. Note here that in the interest of granularity, the mutex is released immediately after the memory block is acquired; the initialization of the block may occur even while some other thread is potentially calling one of the memory allocation routines.


Figure 3 : thread-aware __malloc__()
pthread_mutex_t mallocmutex;

void *__malloc__( size_t nbytes )
{
    mblock *mb;
    int bufsize = (nbytes  + sizeof(mblock) - 1) & ~0x7;

    if ( __n_pthreads > 0 )
        pthread_mutex_lock( &mallocmutex );

    mb = (mblock *) sbrk( bufsize );
    nmallocs++;

    if ( __n_pthreads > 0 )
        pthread_mutex_unlock( &mallocmutex );


Similar modifications must be made to free(), but as realloc() and calloc() are built on top of malloc() they do not require additional code. Note that a more complex implementation of these routines would likely require the same additions as malloc(). The final code is included in the source at the end of this article.

With our functions modified to handle multiple threads, let us consider the pre-fork and post-fork functions. As was mentioned above, the pre-fork function must acquire any locks in use by the subsystem; in our example that would be the mallocmutex mutex. Also, we have an opportunity for an optimization. By the time the pre-fork function is invoked, all other threads have been quiesced. It is safe, then, to ascertain whether more than a single thread is running, since no threads will be created between the time our pre-fork function is called and the process fork actually occurs. Also, since our plug-in supports both threaded and non-threaded processes, there will be no need to acquire mallocmutexin a single threaded app; actually, attempting to do so would be an error, since the thread environment, and therefore our mutex, will not have been initialized. As can be seen in figure 4, then, if the number of threads is 1 (for a threaded process) or -1 (for a single-threaded process) a flag is set and the function immediately returns. Otherwise, mallocmutex is acquired and control returned to fork().


Figure 4 : the pre-fork function
static int single_threaded = 0;

void __malloc_prefork_lock__( void )
{
    if ( __n_pthreads < 2 )
    {
        single_threaded = 1;
        return;
    }


In figure 5 is the post-fork function. Here is the converse of __malloc_prefork_lock__(), wherein a check of the single-threaded flag takes place, and then if necessary mallocmutex is released. Note that at the time this function is called fork() has not yet returned in the child process, and this function is executed in the context of the main thread of execution since no other threads exist.


Figure 5 : the post-fork function
void __malloc_postfork_unlock__( void )
{
    if ( single_threaded )
    {
        single_threaded = 0;
        return;
    }

    pthread_mutex_unlock( &mallocmutex );
}  	


Finally, let's return to the initialization function. Now that we've added a mutex to control access to the functions, and have provided support for forking our process, there remains the task of ensuring that the subsystem properly initializes itself in both single- and multi-threaded processes. Recall from the prior section that __malloc_init() is not called first; __malloc__() is, and __malloc_init() gets invoked when the process becomes multi-threaded. Our implementation will consider the tasks required to enable the subsystem to handle multiple threads, and make no assumptions about the point in time at which the process becomes multi-threaded.

Consider the code shown in figure 6. The subsystem's mutex is initialized with the "recursive" attribute; this ensures that our code can, if desired, take advantage of recursion, and thus multiple locks of the same mutex by the same thread. While the example code presented in this article does not utilize that type of structure, this function provides an excellent example of the type of task that should be accomplished at initialization time. Our __malloc_init__() initializes an attribute structure, sets the mutex type, and uses that attribute to initialize the plug-in's mutex. Once this initialization function completes execution (along with any initialization required by other subsystems within the process) the __n_pthreads variable will contain a positive value, indicating that the process is multi-threaded. Now malloc() and free() will access the mutex during their execution; review figure 3 for the code.


Figure 6: malloc subsystem initialization
void __malloc_init__( void )
{
    pthread_mutexattr_t attr;

    pthread_mutexattr_init( &attr );
    pthread_mutexattr_settype( &attr, PTHREAD_MUTEX_RECURSIVE );



Back to top


Building the Plug-in

A simple makefile is shown in figure 7. The xlc_r compiler is used to provide proper support for threads, and the compiler is used to build the loadable module. The User Defined Malloc Replacement feature takes advantage of the AIX facility of storing loadable modules within archives, or "libraries"; one advantage of this is to provide both a 32-bit and a 64-bit implementation of the plug-in packaged within a single archive file. The 32-bit module itself must be named mem32.o; only the archive name is arbitrary. Our example builds the mem32.o module and places it in the file libmalloc.a.


Figure 7: the makefile
CC =         xlc_r
CFLAGS =     -g

EXPORT =     malloc.exp
EXP_OPTION = -bE:$(EXPORT)


libmalloc.a:  mem32.o
        $(AR) -rv $@ $?

mem32.o:      malloc.o $(EXPORT)


Note the options used to build the loadable module. An export list specifies the interface, or API, for the module, and the malloc.exp file is shown in figure 8.


Figure 8: the list of exported symbols
__malloc__
__free__
__realloc__
__calloc__
__mallopt__
__mallinfo__
__malloc_init__
__malloc_prefork_lock__
__malloc_postfork_unlock__


The symbols listed comprise the set of functions that must be provided by the plug-in. The "-bnoentry" option indicates that the module being built does not have a default entry point (in contrast to executable files, which have a well-defined entry symbol). The "-bM:SRE" option indicates to the linker that the module being built should be marked "shared", i.e. a shared library or shared module. Under the covers, the compiler also adds references to system libraries such as libpthreads.a and libc.a. The resulting mem32.o module contains our subsystem code with all the required function definitions and depends upon system libraries as required.



Back to top


Using the Plug-in

Now that our module is built and packaged, any existing binary can be directed to use it via an environment variable, and no modifications to any application are necessary. The variable MALLOCTYPE is set to the keyword "user" followed by a colon and the name of the archive containing the modules. For our implementation, then, the ksh syntax would be:

$ export MALLOCTYPE=user:libmalloc.a

Any subsequently run application or command will utilize the plug-in. To disable the use of the plug-in, unset (ksh) or unsetenv (csh) the MALLOCTYPE variable.



Back to top


That's It?

Yup, that's it: the basic structure for a malloc replacement on AIX. Creating a plug-in becomes a matter of incorporating your memory allocation subsystem code with this framework. And adding basic support for threads is a choice that requires only a few additional lines of code.

This article has drawn upon existing documentation in the AIX General Programming Concepts: Writing and Debugging Programs manual, with the goal of elaborating on the malloc replacement feature using working code, as well as touching upon other facets of the AIX such as creating and using shared modules. While there are other details that can be examined (for example, the concept of a process going threaded during execution), we've touched upon the history and issues necessary to get up and running using this AIX feature. We've also seen that incorporating a new subsystem into an existing application is transparent, allowing memory allocation techniques to be tailored to the application. So have fun plugging in!



Back to top


Appendix A: The Code

/*----------------------------------------------------------------------
        malloc.c

    This file combines malloc plug-in framework code with a very
    simple memory allocation subsystem.  Framework code is commented
    as such.  The memory allocation code simply calls the OS every
    time a block is requested.

----------------------------------------------------------------------*/
#include <sys/types.h>
#include <sys/param.h>
#include <unistd.h>
#include <stdio.h>
#include <malloc.h>
#include <pthread.h>


/* API declaration */
extern void *__malloc__(size_t);
extern void __free__(void *);
extern void *__realloc__(void *, size_t);
extern void *__calloc__(size_t, size_t);
extern int __mallopt__(int, int);
extern struct mallinfo __mallinfo__();
extern void __malloc_init__(void);
extern void __malloc_prefork_lock__(void);
extern void __malloc_postfork_unlock__(void);


/* Framework */
pthread_mutex_t mallocmutex;

/* Primitive statistics */
static nmallocs = 0;
static nfrees = 0;


/* Every allocation has a 2-word header containing the
 * original size of the allocation and some reserved
 * space.  The block returned to the application follows
 * immediately.
 */
typedef struct _mblock mblock;
struct _mblock
{
    size_t size;
    size_t pad;
    char buf[8];
};


void *
__malloc__( size_t nbytes )
{
    mblock *mb;
    int bufsize = (nbytes  + sizeof(mblock) - 1) & ~0x7;

    /* Framework */
    if ( __n_pthreads > 0 )
        pthread_mutex_lock( &mallocmutex );

    mb = (mblock *) sbrk( bufsize );
    nmallocs++;

    /* Framework */
    if ( __n_pthreads > 0 )
        pthread_mutex_unlock( &mallocmutex );

    /* N.B. we have the block, so another thread can get in
     * here while we finish initializing the header.
     */
    mb->size = nbytes;
    mb->pad = 0xA5A5A5A5;


    return( (void *) mb->buf );
}


void
__free__( void *ptr )
{
    /* No-op: don't do anything */
    if ( __n_pthreads > 0 )
        pthread_mutex_lock( &mallocmutex );

    nfrees++;

    if ( __n_pthreads > 0 )
        pthread_mutex_unlock( &mallocmutex );
}


void *
__realloc__( void *ptr, size_t nbytes )
{
    void *newptr;

    /* Since our simple implementation is built on top
     * of malloc and doesn't really try to re-allocate,
     * just grab a new block of memory and copy the
     * contents.  The old block is lost.  Note that
     * there is no real framework code here.
     */
    if ( newptr = __malloc__( nbytes ) )
    {
        mblock *mb = (mblock *) &((size_t *) ptr)[-2];
        mblock *nmb = (mblock *) &((size_t *) newptr)[-2];

        memcpy( ptr, newptr, MIN(mb->size,nmb->size) );
        return( newptr );
    }
    else
        return( NULL );
}


void *
__calloc__( size_t count, size_t nbytes )
{
    void *ptr;

    /* Again, simple code.  Grab a block and initialize it */

    if ( ptr = __malloc__( count * nbytes ) )
        memset( ptr, '\0', count * nbytes );

    return( ptr );
}


int
__mallopt__( int cmd, int val )
{
    /* No-op */
}


struct mallinfo
__mallinfo__() 
{
    static struct mallinfo mi;

    /* The mallinfo structure could be used to return
     * the number of allocations stored in the nmallocs
     * and nfrees counters.
     */

    return( mi );
}


void
__malloc_init__( void )
{
    pthread_mutexattr_t attr;

    /* This function is called when the pthreads subsystem
     * gets initialized.  Since we link directly to libpthreads
     * any calls to the pthreads routines will force the
     * process to become multi-threaded.  A more flexible
     * plug-in would require a dynamic lookup of the symbols
     * and the use of function pointers.  But for brute-force
     * illustration, this does the job.
     */

    pthread_mutexattr_init( &attr );
    pthread_mutexattr_settype( &attr, PTHREAD_MUTEX_RECURSIVE );

    pthread_mutex_init( &mallocmutex, &attr );
}

/* Framework
 *
 * Everything here-on is part of the framework.  If required,
 * more work could be done at fork time.
 */

static int single_threaded = 0;

void
__malloc_prefork_lock__( void )
{
    if ( __n_pthreads < 2 )
    {
        single_threaded = 1;
        return;
    }

    pthread_mutex_lock( &mallocmutex );
}


void
__malloc_postfork_unlock__( void )
{
    if ( single_threaded )
    {
        single_threaded = 0;
        return;
    }
    pthread_mutex_unlock( &mallocmutex );
}

#!
*-----------------------------------------------------------------------
* malloc.exp
*
* Exported symbols for the plug-in
*
__malloc__ 
__free__ 
__realloc__ 
__calloc__ 
__mallinfo__ 
__mallopt__ 
__malloc_init__ 
__malloc_prefork_lock__ 
__malloc_postfork_unlock__

#-----------------------------------------------------------------------
#       Makefile
#
#   Builds the malloc plug-in, using the required module name,
#   and stores it into an archive.
#-----------------------------------------------------------------------

CC = xlc_r
CFLAGS = -g

EXPORT = malloc.exp
EXP_OPTION = -bE:$(EXPORT)


libmalloc.a: mem32.o
        $(AR) -rv $@ $?

mem32.o: malloc.o $(EXPORT)
        $(CC) -o $@ malloc.o $(EXP_OPTION) -bnoentry -bM:SRE





Resources



About the author

Gary R. Hook is a Senior Technical Consultant at IBM, providing application development, porting, and technical assistance to independent software vendors. Mr. Hook's professional experience focuses on Unix-based application development. Upon joining IBM in 1990, he worked with the AIX Technical Support center in Southlake, Texas, providing consulting and technical support services to customers, with an emphasis upon AIX application architecture. Now residing in Austin, Mr. Hook was a member of the AIX Kernel Development team from 1995 through 2000, specializing in the AIX linker, loader, and general application development tools. You can contact him at ghook@us.ibm.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top