Skip to main content

skip to main content

developerWorks  >  Web development  >

Self-manage data buffer memory

Deliver code efficiency, simplicity, portability, and security

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Xiaoming Zhang (zhang@uk.ibm.com), Staff Software Engineer, IBM UK Lab

06 Jan 2004

The C programming language defines two standard memory management functions: malloc() and free(). C programmers frequently use those functions to allocate buffers at run time to pass data between functions. In many situations, however, you cannot predetermine the actual sizes required for the buffers, which may cause several fundamental problems for constructing complex C programs. In this article, Xiaoming Zhang advocates a self-managing, abstract data buffer. He outlines a pseudo-C implementation of the abstract buffer and details the advantages of adopting this mechanism.

Software's scale and complexity grows all the time, fundamentally affecting how the architecture of applications. In many circumstances it seems impractical to code all functionality into a single piece of software. The importance for independent software pieces to interact, say in the form of a plug-in, becomes increasingly apparent. To achieve such interaction with relative ease, even among pieces written by different vendors, requires a well-defined interface. Writing software that meets such demands in a legacy programming language such as C can be a challenge.

Considering that challenge, this article examines the data buffer interface in the C programming language, with an eye toward improving upon current practices. Although memory management may appear trivial, a properly designed interface can produce code efficiency, simplicity, and portability -- none of which may be achievable otherwise. As such, the next section outlines the various problems programmers currently face when they adopt a conventional data buffer management scheme. That's followed by an abstract data buffer scheme, illustrated by a pseudo implementation, that solves many of those problems, followed by code snippets demonstrating the solution's benefits.

Conventional practices and the problems they present

C programmers frequently use dynamically allocated buffers (by calling malloc()/free() functions) to pass data between functions. Oftentimes, however, you cannot determine the size of such data until run time. Although that approach offers flexibility, it does have some impacts. First, it requires extra management work (allocation and freeing of memory block) wherever a buffer block is required. If the allocation and freeing cannot be done at the same code location, it is important to ensure a piece of memory block is freed once and only once when the block is no longer needed; otherwise, memory leaks or code crashes may result. Second, the buffer's size must be predetermined before a block can be allocated. However, you may not always find it easy to identify the data size. Frequently, developers adopt a conservative estimation of the maximum size, which may result in a significant waste of memory resources.

To avoid possible memory leaks and code crashes due to multiple freeing, good practice dictates you clearly predefine the program's section responsible for allocating buffer memory and the section responsible for freeing. However, defining responsibility causes other difficulties in practice. Because you must specify the size when a buffer is created under the conventional allocation scheme, a data provider, which presumably knows the size of the data it provides, is the best party to perform the buffer allocation operation. On the other hand, the best party for freeing is presumably the data consumer as it knows when the data is no longer needed. Frequently, data providers and data consumers are not the same.

When data providers and data consumers come from different software providers, the interacting parties may adopt or assume different underlying memory management mechanisms. For example, some software providers may choose a self-managed heap space, while others rely on the underlying operating system (OS) for such functionality. Moreover, different OSs may implement memory management differently. For example, the PalmOS provides two distinctive memory resources: heap-based and database-based. Generally speaking, different memory management mechanisms have their advantages and disadvantages, so you may not wish to assume a particular mechanism in advance. Different preferences may even lead to conflicting coding practices.

Three ways to resolve this problem are:

  • One of the interacting parties defines the underlying memory allocation mechanism for data exchange. The other party always uses the published interface to allocate or free buffers to avoid possible inconsistency. This model requires both parties to stick to a programming convention that may not be relevant to the software's basic functionality and, in general, can make the code less reusable.

  • The party driving the computation takes responsibility for management operations -- a relatively problem-free scheme when that party acts as the data provider. However, when that party acts as the data consumer, things become tricky. To avoid discovering data size, the consumer can allocate a buffer of an arbitrary size. Multiple calls to the data provider must be made if the provided data buffer is insufficiently large. This approach, therefore, requires coding an extra loop around the interaction call just in case multiple calls are required.

  • With the third option, the consumer becomes responsible for management operations. Here, however, the consumer must make an advanced call to discover the buffer size if the other party is the data provider -- placing an extra burden on the other party to code the logic for providing the size information, which may require the execution of a time-consuming algorithm. Also, this solution may introduce a serious efficiency problem: Suppose that function a() acquires data from function b(), which in turn, during its execution, acquires data from function c(). Assume that discovering the buffer sizes and providing the actual data require the execution of the same algorithm.

    To obtain a piece of data from b(), a() must make two calls: one for size discovery and one for obtaining the actual data. For each call that a() makes, b() must make two calls to c(). Therefore, by the end of this operation, the algorithm coded in c() may have been executed four times. In principle, the execution should occur only once.

Clearly, all three solutions present limitations and the conventional buffer memory management approach is not a good mechanism for coding large-scale interacting software.

Besides the aforementioned difficulties, security proves an issue here: The conventional buffer management scheme cannot easily prevent a malicious user from deliberately overwriting a data buffer, causing program misbehavior. Considering all that, a properly designed data buffer interface is required!



Back to top


What is the solution?

In the last section, you saw how the conventional buffer management scheme produces several problems. In contrast, when you make an abstract data buffer, the solution becomes simple.

Conceptually, two operations create a data buffer under the conventional scheme: the creation of a data buffer entity and the allocation of actual memory. In fact, however, you needn't allocate actual memory until the actual data becomes available -- you can separate the two operations.

You can initially create an abstract buffer with an empty linked list of memory block. The abstract data buffer allocates the memory blocks when and only when the actual data becomes available. Freeing the memory becomes the responsibility of the abstract data buffer as well. Considering all that, centralizing memory management and data copying operations produces the following advantages:

  • Either party can construct and/or destroy an abstract data buffer by calling predefined API functions.
  • Memory usage is kept near optimal as buffer memory becomes allocated only when necessary and is freed as soon as possible, thus minimizing memory leaks.
  • No party needs to know the underlying memory management scheme, making the software highly portable with guaranteed compatibility between interacting parties.
  • Because no party needs to manage memory, buffer-size identification becomes unnecessary (hence there's no possibility of the multi-execution problem identified earlier).
  • Buffer overflow proves impossible as data copies only when room for extra data exists.

A simple implementation

To represent an abstract data buffer, declare two structured data types:


Listing 1. Declare two structured data types to represent an abstract data buffer
typedef struct BufferBlockHeader_st BufferBlockHeader;

struct BufferBlockHeader_st {
   BufferBlockHeader * pNextBlock;
};

struct Buffer_st {
   long  int             totalLength;
   BufferBlockHeader *   pFirstBlock;
   short int             startPoint;
   BufferBlockHeader *   pLastBlock;
   short int             endPoint;
};

typedef struct Buffer_st Buffer;

The Buffer holds the information about a created abstract buffer, and it manages a linked list of memory blocks:

  • The totalLoength records the number of bytes currently stored in the buffer.
  • The pFirstBlock points to the first memory block of the linked list.
  • The startPoint records the first byte's offset location into the first memory block.
  • The pLostBlock points to the linked list's last memory block.
  • The endPoint records the first free byte's offset location into the last memory block.

You can introduce an additional parameter into Buffer to specify each memory block's size, and you can set it to a preferred value during the initialization of an abstract buffer. Here, a default block size is assumed.

The pNextBlock in the BufferBlockHeader structure always points to the next memory block in the linked list, if allocated. Each memory block, when allocated, consists of a BufferBlockHeader header followed by a buffer block for storage of actual data.

Figure 1 illustrates an abstract buffer when stored with some data.


Figure 1. Data structure of an abstract buffer
Graphical representation of relations

Let M denote the Buffer's size (which is normally 20 bytes) and B denote the chosen size of a memory block. The overhead averages about (M+B) bytes (ignoring the pointers at the beginning of each memory block). The B in (M+B) results because, on average, only half of the first and half of the last memory block are used. This overhead is almost constant.

You must explicitly create the abstract buffer by calling the following newBuffer() function before data can be buffered:


Listing 2 Create an abstract buffer with the newBuffer() function
Buffer * newBuffer() {
   allocate a Buffer structure;
   initialize the structure;
}

In Listing 2, the function allocates a piece of memory containing a Buffer and initializes its entries to indicate an empty abstract buffer.

Correspondingly, you must destruct an abstract buffer after use by calling the following freeBuffer() function:


Listing 3. Destruct a buffer with the freeBuffer() function
void freeBuffer(Buffer * pBuffer    /* pointer to the buffer to be freed */
	       ) {
   while (there is more memory block in the linked list) {
      free the next memory block;
   }
   free the Buffer structure;
}

The function in Listing 3 frees any memory blocks in the linked list, then frees the Buffer allocated by newBuffer().

To incrementally append data segments to an abstract buffer, use the following function:


Listing 4. Incrementally append data segments to an abstract buffer
long int appendData(Buffer * pBuffer,    /* pointer to the abstract buffer */
             byte *   pInput,    /* pointer to the data source */
             long int offset,    /* offset of the input data */
             long int dataLength    /* number of bytes of the input data */
               ) {
   while (there is more input data) {
      fill the current memory block;
      if (there is more input data) {
         allocate a new memory block and add it into the linked list;
      }
   }      
}

The function in Listing 4 copies bytes stored in pInput[offset..offset+dataLength] into the abstract buffer pointed by pBuffer, inserting new memory blocks into the linked list if necessary, and returns the number of bytes successfully copied into the abstract buffer.

In a similar fashion, you can read data segments off segment by segment from an abstract buffer with the following function:


Listing 5. Read data by segments from an abstract buffer
long int readData(Buffer * pBuffer,    /* pointer to the abstract buffer */
           byte *   pOutput,    /* pointer to the output byte array */
           long int offset,    /* offset of the output byte array */
           long int arrayLength    /* size of available output byte array */
             ) {
   while (there is something more to read and there is room for output) {
      read from the first memory block;
      if (the first memory block is empty) {
         delete the first memory block from the linked list and free its memory;
      }
   }
}

In Listing 5, the function destructively reads off up to arrayLength leading bytes from the abstract buffer pointed by pBuffer, deleting memory blocks from the linked list when they become empty, and returns the number of bytes successfully read.

If required, you can implement a function similar to readData() to allow a nondestructive read.

You may benefit by implementing a function that returns the number of bytes currently stored in an abstract buffer:


Listing 6. Return the number of bytes in an abstract buffer
long int bytesAvailable(Buffer * pBuffer  /* pointer to the abstract buffer */
		       ) {
   return totalLength;
}



Back to top


Advantages of using the abstract buffer

You saw, earlier, several difficult areas associated with the conventional buffer scheme. As an alternative, by centralizing the memory management and data copying operations, the proposed abstract buffer immediately eliminates the possibilities of inconsistent memory management and buffer overflow. It also leads to simpler coding and avoids the possible multi-execution problem identified earlier. To better understand the solution, let's see an example using pseudo-code, first with conventional approaches, followed by the centralized solution.

Simpler coding than the fixed-size buffer approach

Suppose function a() gets input data from function b() but does not know function b()'s size. You can avoid querying the input size by letting a() allocate a fixed size buffer and iteratively call b() until b() indicates the end of the input data:


Listing 7. Allocate a fixed size buffer and call input data
int b(byte *buf, int bufSize) {
   fill buf;
   return size of output;
}

void a() {

   byte * buf = malloc(BUFFER_SIZE);
   int size;

   if (NULL != buf) {
      while (there is more data from b()) {
         size = b(buf, BUFFER_SIZE);
         process data in buf;
      }
      free(buf);
   }

}

By using the abstract buffer, simplify the code to:


Listing 8. Call input data for an an abstract buffer
void b(Buffer *buf) {
   fill buf;
}

void a() {

   Buffer * buf = newBuffer();
   if (NULL != buf) {
      b(buf);
      process data in buf;
      freeBuffer(buf);
   }

}

No multi-execution compared with the size discovery approach

Again, suppose function a() gets input data from function b() but does not know b()'s size. To allocate a sufficiently large buffer for b(), a() must make an advanced discovery call to b() (presumably, only b() knows about the size). a() would look something like:


Listing 9. Make advanced discovery calls
int b(byte *buf, int bufSize) {
   if (NULL != buf) {
      fill buf;
   }
   return size of output;
}

void a() {

   int size = b(NULL, 0);
   byte * buf = malloc(size);
   if (NULL != buf) {
      b(buf, size);
      process data in buf;
      free(buf);
   }
}

Notice that a() calls b() twice.

By using the abstract buffer, you can write the code as:


Listing 10. Make a single discovery call for an abstract buffer
void b(Buffer *buf) {
   fill buf;
}

void a() {

   Buffer * buf = newBuffer();
   if (NULL != buf) {
      b(buf);
      process data in buf;
      freeBuffer(buf);
   }

}

Which only calls b() once.



Back to top


In conclusion

This article examined problems produced when two C functions interact using a conventional data buffer management scheme. Such problems may become major issues when coding large-scale interacting software. As an alternative, a self-managing abstract data buffer can combat those problems. Implementing this proposed abstract data buffer should be a relatively trivial task for an average C programmer.

To benefit from the solution, you must clearly define a lasting concrete abstract data buffer interface. Adopting such an interface will simplify future code development. However, treat existing code migration to the interface with caution, using a case-by-case analysis weighing the costs/benefits.



Resources

  • Check out the classic C programming manual: The C Programming Language by Brian W. Kernighan and Dennis M. Ritchie (Prentice Hall PTR, 1988).

  • Find an extensive coverage of C memory management techniques in C Memory Management Techniques by Len Dorfman and Marc J. Neuberger (McGraw-Hill Osborne Media, 1992).

  • Get an explanation of Palm OS memory management in Palm Programming by Glenn Bachmann (Sams Publishing, 1999).

  • developerWorks Web Architecture zone: Expand your site development skills with articles and tutorials that specialize in Web technologies.


About the author

Staff Software Engineer Xiaoming Zhang, having spent nine years as an academic carrying out research, gave up his university lectureship to join IBM in 1998. Since then, he spends most of his time designing and developing messaging middleware targeted for small devices. He is also particularly interested in the data security aspects of messaging. Xiaoming, who holds a PhD in computer-speech signal processing from the University of Wales Swansea, has published nearly thirty conference and journal papers covering speech signal processing, parallel numerical computation, application of functional programming, and image processing.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top