The goal of memory management is to provide a method by which memory can be dynamically shared amongst a variety of users for a variety of purposes. The memory management method should do both of the following:
- Minimize the amount of time required to manage the memory
- Maximize the available memory for general usage (minimize management overhead)
Memory management is ultimately a zero-sum game of tradeoffs. You can develop an algorithm that uses little memory for management but takes more time to manage the available memory. You can also develop an algorithm that efficiently manages memory but uses a bit more memory. In the end, the requirements for the particular application drive the balance of the tradeoffs.
Early memory managers used a heap-based allocation strategy. In this method, a large block of memory (called the heap) is used to provide memory for user-defined purposes. When users need a block of memory, they make a request for a given size. The heap manager looks at the available memory (using a particular algorithm) and returns the block. Some of the algorithms used in this search are the first-fit (the first block encountered in the heap that satisfies the request), and the best-fit (the best block in the heap that satisfies the request). When the users are finished with the memory, they return it to the heap.
The fundamental problem with this heap-based allocation strategy is fragmentation. As blocks of memory are allocated, they are returned in different orders and at different times. This tends to leave holes in the heap requiring more time to efficiently manage the free memory. This algorithm tends to be memory efficient (allocating what's necessary) but requires more time to manage the heap.
Another approach, called buddy memory allocation, is a faster memory technique that divides memory into power-of-2 partitions and attempts to allocate memory requests using a best-fit approach. When memory is freed by the user, the buddy block is checked to see if any of its contiguous neighbors have also been freed. If so, the blocks are combined to minimize fragmentation. This algorithm tends to be a bit more time efficient but can waste memory due to the best-fit approach.
This article focuses on Linux kernel memory management and, in particular, the mechanisms provided through slab allocation.
The slab allocator used in Linux is based on an algorithm first introduced by
Jeff Bonwick for the SunOS operating system. Jeff's allocator revolves around
object caching. Within a kernel, a considerable amount of memory is allocated for
a finite set of objects such as file descriptors and other common structures. Jeff
found that the amount of time required to initialize a regular object in the
kernel exceeded the amount of time required to allocate and deallocate it. His
conclusion was that instead of freeing the memory back to a global pool, he would
have the memory remain initialized for its intended purpose. For example, if
memory is being allocated for a mutex, the mutex initialization function
(mutex_init) need only be performed once when the
memory is first allocated for the mutex. Subsequent allocations of the memory need
not perform the initialization because it's already in the desired state from the
previous deallocation and call to the deconstructor.
The Linux slab allocator uses these ideas and others to build a memory allocator that is efficient in both space and time.
Figure 1 illustrates the high-level organization of the slab structures. At the
highest level is the cache_chain, which is a linked
list of the slab caches. This is useful for best-fit algorithms that look for a
cache that most closely fits the size of the desired allocation (iterating the
list). Each element of the cache_chain is a
kmem_cache structure reference (called a cache).
This defines a pool of objects of a given size to manage.
Figure 1. The major structures of the slab allocator

Each cache contains a list of slabs, which are contiguous blocks of memory (typically pages). Three slabs exist:
slabs_full- Slabs that are fully allocated.
slabs_partial- Slabs that are partially allocated.
slabs_empty- Slabs that are empty, or no objects allocated.
Note that the slabs on the slabs_empty list are prime
candidates for reaping. This is the process by which the memory used by
slabs is provided back to the operating system for other uses.
Each slab in the slab list is a contiguous block of memory (one or more contiguous pages) that is divided into objects. These objects are the fundamental elements that are allocated and deallocated from the particular cache. Note that the slab is the smallest allocation to the slab allocator, so if it needs to grow, this is the minimum by which it will grow. Typically, numerous objects are allocated per slab.
As objects are allocated and deallocated from a slab, the individual slab can
move between the slab lists. For example, when all objects are consumed in a slab,
it moves from the slabs_partial list to the
slabs_full list. When a slab is full and an object is
deallocated, it moves from the slabs_full list to the
slabs_partial list. When all objects are deallocated,
it moves from the slabs_partial list to the
slabs_empty list.
The slab cache allocator provides a number of benefits over traditional memory management schemes. First, kernels commonly rely on the allocation of small objects that are allocated numerous times over the lifetime of the system. The slab cache allocator provides this through the caching of similarly sized objects, thus avoiding the fragmentation problems that commonly occur. The slab allocator also supports the initialization of common objects, thus avoiding the need to repeatedly initialize an object for the same purpose. Finally, the slab allocator supports hardware cache alignment and coloring, which allows objects in different caches to occupy the same cache lines for increased cache utilization and better performance.
Now on to the application program interface (API) for creating new slab caches, adding memory to the caches, destroying the caches, as well as the functions to allocate and deallocate objects from them.
The first step is to create your slab cache structure, which you can create statically as:
static struct kmem_cache *my_cachep; |
This reference is then used by the other slab cache functions for creation,
deletion, allocation, and so on. The kmem_cache
structure contains the per-central processing unit (CPU) data, a set of tunables
(accessible through the proc file system), statistics, and necessary elements to
manage the slab cache.
Kernel function kmem_cache_create is used to create a
new cache. This is commonly performed at kernel init time or when a kernel module
is first loaded. Its prototype is defined as:
struct kmem_cache *
kmem_cache_create( const char *name, size_t size, size_t align,
unsigned long flags;
void (*ctor)(void*, struct kmem_cache *, unsigned long),
void (*dtor)(void*, struct kmem_cache *, unsigned long));
|
The name argument defines the name of the cache, which
is used by the proc file system (in /proc/slabinfo) to identify this cache. The
size argument specifies the size of the objects that
should be created for this cache, with the align
argument defining the required alignment for each object. The
flags argument specifies options to enable for the
cache. These flags are shown in Table 1.
Table 1. Partial list of options for kmem_cache_create (as specified in flags)
| Option | Description |
|---|---|
| SLAB_RED_ZONE | Insert markers at header and trailer of the object to support checking of buffer overruns. |
| SLAB_POISON | Fill a slab with a known pattern to allow monitoring of objects in the cache (objects owned by the cache, but modified externally). |
| SLAB_HWCACHE_ALIGN | Specify that the objects in this cache must be aligned to the hardware cachline. |
The ctor and dtor arguments
define an optional object constructor and deconstructor. The constructor and
deconstructor are callback functions provided by the user. When a new object is
allocated from the cache, it can be initialized through the constructor.
After the cache is created, its reference is returned by the
kmem_cache_create function. Note that this function
allocates no memory to the cache. Instead, when attempts are made to allocate
objects from the cache (when it's initially empty), memory is allocated to it
through a refill operation. This same operation is used to add memory to
the cache when all its objects are consumed.
Kernel function kmem_cache_destroy is used to destroy
a cache. This call is performed by kernel modules when they are unloaded. The
cache must be empty before this function is called.
void kmem_cache_destroy( struct kmem_cache *cachep ); |
To allocate an object from a named cache, the
kmem_cache_alloc function is used. The caller provides
the cache from which to allocate an object and a set of flags:
void* kmem_cache_alloc( struct kmem_cache *cachep, gfp_t flags ); |
This function returns an object from the cache. Note that if the cache is
currently empty, this function may invoke
cache_alloc_refill to add memory to the cache. The flag
options for kmem_cache_alloc are the same as those for
kmalloc. Table 2 provides a partial list of the flag
options.
Table 2. Flag options for the kmem_cache_alloc and kmalloc kernel functions
| Flag | Description |
|---|---|
| GFP_USER | Allocate memory for a User (this call may sleep). |
| GFP_KERNEL | Allocate memory from kernel RAM (this call may sleep). |
| GFP_ATOMIC | Force no-sleep on this call (useful for interrupt handlers). |
| GFP_HIGHUSER | Allocate from high memory. |
The kernel function kmem_cache_zalloc is similar to
kmem_cache_alloc, except that it performs a
memset of the object to clear it out prior to returning
the object to the caller.
To free an object back to the slab, the
kmem_cache_free is used. The caller provides the cache
reference and the object to be freed.
void kmem_cache_free( struct kmem_cache *cachep, void *objp ); |
The most common memory management functions in the kernel are the
kmalloc and kfree functions.
The prototypes for these functions are defined as:
void *kmalloc( size_t size, int flags ); void kfree( const void *objp ); |
Note that in kmalloc the only arguments are the
requested size of the object to allocate and a set of flags (see the partial list
in Table 2). But kmalloc and
kfree use the slab cache just like the
previously-defined functions. Instead of naming a specific slab cache from which
to allocate an object, the kmalloc function iterates
through the available caches looking for one that can satisfy its size constraint.
When found, an object is allocated (using
__kmem_cache_alloc). To free an object with
kfree, the cache from which the object was allocated is
determined with a call to virt_to_cache. This function
returns the cache reference that is then used in a call to
__cache_free to release the object.
The slab cache API provides a number of other useful functions. The
kmem_cache_size function returns the size of the
objects that are managed by this cache. You can also retrieve the name of a given
cache (as defined at cache creation time) through a call to
kmem_cache_name. A cache can be shrunk by releasing
free slabs in the cache. This can be accomplished with a call to
kmem_cache_shrink. Note that this action (called
reaping) is performed automatically on a periodic basis by the kernel (through
kswapd).
unsigned int kmem_cache_size( struct kmem_cache *cachep ); const char *kmem_cache_name( struct kmem_cache *cachep ); int kmem_cache_shrink( struct kmem_cache *cachep ); |
The following code snippets demonstrate the creation of a new slab cache,
allocating and deallocating objects from the cache, and then destroying the cache.
To begin, a kmem_cache object must be defined and then
initialized (see Listing 1). This particular cache contains 32-byte objects and is
hardware-cache aligned (as defined by the flags argument
SLAB_HWCACHE_ALIGN).
Listing 1. Creating a new slab cache
static struct kmem_cache *my_cachep;
static void init_my_cache( void )
{
my_cachep = kmem_cache_create(
"my_cache", /* Name */
32, /* Object Size */
0, /* Alignment */
SLAB_HWCACHE_ALIGN, /* Flags */
NULL, NULL ); /* Constructor/Deconstructor */
return;
}
|
With your slab cache allocated, you can now allocate an object from it. Listing 2 provides an example of allocating and deallocating an object from the cache. It also demonstrates two of the miscellaneous functions.
Listing 2. Allocating and deallocating objects
int slab_test( void )
{
void *object;
printk( "Cache name is %s\n", kmem_cache_name( my_cachep ) );
printk( "Cache object size is %d\n", kmem_cache_size( my_cachep ) );
object = kmem_cache_alloc( my_cachep, GFP_KERNEL );
if (object) {
kmem_cache_free( my_cachep, object );
}
return 0;
}
|
Finally, Listing 3 is an example of destroying a slab cache. The caller must ensure that no attempts to allocate objects from the cache are performed during the destroy operation.
Listing 3. Destroying a slab cache
static void remove_my_cache( void )
{
if (my_cachep) kmem_cache_destroy( my_cachep );
return;
}
|
The proc file system provides a simple way to monitor all slab caches that are active in the system. This file, called /proc/slabinfo, provides detailed information about all slab caches in addition to providing a few tunables that are accessible from user space. The current version of slabinfo provides a header so that the output is a bit more readable. For each slab cache in the system, the number of objects, number of active objects, and the object size is provided (in addition to the objects and pages per slab). A set of tunables and slab data are also provided.
To tune a particular slab cache, simply echo the slab cache name and the three tunable parameters as a string to the /proc/slabinfo file. The following example illustrates how to increase the limit and batchcount while leaving the shared factor as is (format is "cache name limit batchcount shared factor"):
# echo "my_cache 128 64 8" > /proc/slabinfo |
The limit field indicates the maximum number of
objects that will be cached for each CPU. The
batchcount field is the maximum number of global cache
objects that will be transferred to the per-CPU cache when it becomes empty. The
shared parameter indicates the sharing behavior for
Symmetric MultiProcessing (SMP) systems.
Note that you must have superuser privileges to tune parameters in the proc file system for a slab cache.
For small embedded systems, a slab emulation layer exists called the SLOB. This
slab replacement is advantageous in small embedded Linux systems, but while it
conserves up to 512KB of memory, it suffers from fragmentation and poor scaling.
When CONFIG_SLAB is disabled, the kernel falls back to
this SLOB allocator. See the Resources section for more
details.
The source code for the slab cache allocator is actually one of the more readable aspects of the Linux kernel. Outside of the indirection that exists in the function calls, the source is intuitive and, in general, well commented. If you'd like to know more about the slab cache allocator, I suggest that you start there as it's the most up-to-date documentation on the mechanism. The Resources section below offers a number of sources that describe the slab cache allocator, but unfortunately all of them are out of date given the current 2.6 implementation.
Learn
-
"The Slab Allocator:
An Object-Caching Kernel Memory Allocator (1994)"
is Jeff Bonwick's original paper, which describes the first slab allocator that
appeared in the SunOS 5.4 kernel memory allocator.
-
"The Linux Slab
Allocator (200)" describes the Linux version of the slab allocator.
This paper describes the 2.4 kernel version, which has since been updated.
-
The
SLOB allocator is an implementation
of the SLAB cache for memory-constrained systems. This can be enabled through
kernel configuration.
-
The online book Understanding the Linux Virtual Memory
Manager (PDF format), by Mel Gorman, gives a detailed look at memory management within
Linux. You can download it from Prentice Hall.
- In the
developerWorks Linux zone,
find more resources for Linux developers, including
our readers' favorite Linux articles and
tutorials
over the last month.
- Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
- With
IBM trial software,
available for download directly from developerWorks, build your next development
project on Linux.
Discuss
- Get involved in the
developerWorks community
through our developer blogs, forums, podcasts, and community
topics in our new
developerWorks spaces.

M. Tim Jones is an embedded software architect and the author of GNU/Linux Application Programming, AI Application Programming, and BSD Sockets Programming from a Multilanguage Perspective. His engineering background ranges from the development of kernels for geosynchronous spacecraft to embedded systems architecture and networking protocols development. Tim is a Consultant Engineer for Emulex Corp. in Longmont, Colorado.



