Skip to main content

Initializing memory efficiently on Power Architecture platforms

Clearing cache lines considered beneficial

Carlos Cavanna (ccavanna@ca.ibm.com), Software Developer, IBM Software Group, Application and Integration Middleware
Carlos Cavanna is a Software Developer in IBM Compilers Group, specializing in Java JIT (Java Just-In-Time) compilation for the PowerPC Architecture.

Summary:  Learn to efficiently initialize memory on Power Architecture™ systems. Software Developer Carlos Cavanna compares simple loops clearing one bit at a time to more elaborate algorithms, including the dcbz instruction to zero whole cache lines at a time. The article concludes with some rough performance numbers to help you tune your own applications.

Date:  23 May 2006
Level:  Introductory
Activity:  2765 views

Application developers often face the common problem of initializing memory contents to a zero value. Sometimes a simple algorithm suffices for the purpose of the application. However, many times, the memory initialization routines are invoked frequently over large memory areas. In such cases, those libraries can become a bottleneck.

Power Architecture family processors offer various alternatives to approach this problem. This article explores some of them and presents performance information that illustrates the cost of each alternative depending on the size of the memory area that needs to be initialized.

A simple approach

A first approach that developers usually take toward this problem is to write a simple loop that iterates over all the memory locations to be initialized.


Listing 1. Initializing memory the easy way

void Trivial1(void* ptr, unsigned long length)
{
   for (unsigned long i=0; i < length; i += sizeof(char))
      *(char*)(ptr+i) = 0;
}

When the program requires memory initialization frequently, you want to find ways to optimize this loop. One alternative is to iterate over integer values (four bytes long) instead of char values. This is viable depending on the length of the data area and how it is aligned in memory, but the simple version presented here can run slightly past the end of a buffer.


Listing 2. Initializing memory slightly faster

#define UINT  unsigned int

void Trivial4(void* ptr, unsigned long length)
{
   unsigned char *p = ptr;
   for (unsigned long i=0; i < length; i += sizeof(UINT))
      *((UINT *)(p+i)) = 0;
}

Other alternatives are to use long long (eight bytes) or even floating point types. In the case of storing floating point values, using a 0.0f value will not suffice. You have no guarantee that 0.0f will be represented by zeroes in memory. In this case, you can use an assembly instruction to move a real zero value to a floating point register, and then store this value in memory.

Memset

C libraries provide their own functions to initialize memory contents. The memset function (a replacement for the now-deprecated bzero) does the same thing as the loops above, in a more elegant manner.


Listing 3. Having someone else initialize memory

void ZeroMem(void *ptr, size_t length)
{
        memset(ptr, 0, length);
}

Initializing cache lines

While memset is very easy to use, you might find a more efficient alternative, depending on how memset is implemented by the host platform.

Instead of transferring single bytes or words, it is more efficient for the cache and main memory to exchange a larger amount of data. This larger unit is called a cache line or cache block, and it is defined as the smallest unit that is transferred between main memory and the cache. Note that a cache line is read and cached as a whole. Caches take advantage of the principle of locality of reference -- meaning, if one memory location is read, it is very likely that following memory locations are going to be accessed afterwards so it won't be specific to the technique but to how caches work.

On Power Architecture processors, the dcbz (data cache block zero) instruction initializes a data cache line. The instruction syntax is as follows:


Listing 4. Clearing a single cache line

dcbz r1, r2

If r1 is not 0, dcbz operates on the effective address computed as r1+r2; if r1 is 0, it operates on the address residing on r2.

The values in the whole data cache line that contains the address are set to zero. This means that all bytes preceding the computed address up to the beginning of the data cache line, and all bytes following the address up to the end of the cache line, will be initialized to 0.

For example, suppose the cache line is 64 bytes. If r1 is 0 and r2 contains the address 0x8000005E (shown in bold below), calling dcbz r1,r2 transforms the memory block:


Listing 5. Memory about to be cleared

0x80000030:  2ff22a7b 2ff22a89 2ff22a9e 2ff22ab3
0x80000040:  2ff22ae1 2ff22aff 2ff22cf7 2ff22d10
0x80000050:  2ff22d1f 2ff22d4d 2ff22d6c 2ff22d9f
0x80000060:  2ff22dba 2ff22ddd 2ff22de4 2ff22def
0x80000070:  2ff22e18 2ff22e39 2ff22e41 2ff22e55
0x80000080:  2ff22e18 2ff22e39 2ff22e41 2ff22e55
0x80000090:  2ff22e18 2ff22e39 2ff22e41 2ff22e55

into this one:


Listing 6. The whole cache line has been zeroed

0x80000030:  2ff22a7b 2ff22a89 2ff22a9e 2ff22ab3
0x80000040:  00000000 00000000 00000000 00000000
0x80000050:  00000000 00000000 00000000 00000000
0x80000060:  00000000 00000000 00000000 00000000
0x80000070:  00000000 00000000 00000000 00000000
0x80000080:  2ff22e18 2ff22e39 2ff22e41 2ff22e55
0x80000090:  2ff22e18 2ff22e39 2ff22e41 2ff22e55

where the entire cache line was set to the value zero.

One interesting attribute of dcbz is that if the memory address to initialize is not present in the data cache, a block is established and set to zero without the need to access memory. Also, note that dcbz deals with virtual addresses, and when the cache is flushed, main memory will reflect that those memory areas were changed.

You can use the dcbz instruction in conjunction with dcbst (data cache block store), which copies the contents of a modified data cache block from the cache to main memory. This is useful for ensuring that the initialized lines are visible to main memory immediately. In cases such as DMA usage, the use of sync (synchronize) or eieio (enforce in-order execution of I/O) might also be required, because these ensure that the store will complete before continuing with the execution of the program. For more information about the Power Architecture storage model, see Resources.

An example using dcbz

Cache line sizes differ between Power Architecture implementations. Therefore, when issuing dcbz instructions to initialize memory contents, the program needs to obtain the size of the cache line first, so it can calculate how much memory it will be initializing for each dcbz instruction.

One way to do that on AIX® is:


Listing 7. Identifying the cache line size

void dcbz(char *);
#pragma mc_func dcbz  {"7c001fec"}  /* dcbz, 0, r3 */
#pragma reg_killed_by dcbz

static UINT getCacheLineSize()
{
        char buf[1024];
        UINT i, ppcCacheLineSize;

        memset(buf, 255, 1024);
        dcbz(&buf[512]);
        for (i = 0, ppcCacheLineSize = 0; i < 1024; i++)
                if (buf[i] == 0) ppcCacheLineSize++;
        return ppcCacheLineSize;
}

While the above code sample only works on AIX, it's easy enough to do the same thing on other operating systems. For instance, on Linux®, the call to the dcbz "function" can be replaced with this line:

__asm__ __volatile__("dcbz 0,%0" : /* no outputs */ : "r" (&buf[512]));

Because you have no guarantee what alignment the buffer will have, the example uses a sufficiently large array for it. First, all the buffer bits are set to 1. Then, dcbz is called pointing to the middle of the buffer. Finally, getCachLineSize counts the number of slots that contain a 0.

Notice that cache line size verification is made only once, usually during program initialization. The result can be stored in a static variable, so there is no need to repeat the process every time the program needs to know the cache line size.

The dcbz instruction is non-privileged. With respect to protection, dcbz is treated as a store to the addressed cache. However, it has high latency and cycle count to next instruction. Because the dcbz instruction takes a long time to execute (compared to simpler instructions), you should ensure that the benefit of targeting an entire cache line will outweigh the cost of issuing such instruction. The experimental evaluation shows that from memory regions larger than 256 bytes (two cache lines), dcbz is very efficient.

Following is a first approach to using dcbz:


Listing 8. Clearing whole blocks of memory

#ifdef PPC32
#define PTR_T_SIZE UINT
#elif defined PPC64
#define PTR_T_SIZE ULONG
#else
#error PPC32 or PPC64 must be defined
#endif


void DCBZ1(void *ptr, unsigned long length)
{
   static PTR_T_SIZE cacheLineSize = 0;
   char *addr = ptr;
   char *limit;

   /* one-time-only calculation of cache line size */
   if (!cacheLineSize)
     cacheLineSize = getCacheLineSize();

   /* zero any initial portion to first cache line boundary */
   if ((PTR_T_SIZE)addr & (cacheLineSize-1)) {
     limit = (char*) (length < cacheLineSize ?
                ((PTR_T_SIZE)addr+length) :
                (((PTR_T_SIZE)addr + cacheLineSize) & ~(cacheLineSize-1)));
      /* code assumes ptr will have UINT alignment! */
      /* xlc -O3 will unroll this loop */
      for (; addr < limit; addr += sizeof(UINT))
         *((UINT *)addr) = 0;
   }

   /* dcbz full cache lines */
   /* dcbz forms a group on POWER4, so there is no reason to unroll */
   limit = (char *) (((PTR_T_SIZE)ptr + length) & ~(cacheLineSize-1));
   for (; addr < limit; addr += cacheLineSize)
      dcbz(addr);

   /* zero final portion smaller than a cache line */
   limit = (char *) ((PTR_T_SIZE)ptr + length);
   if (addr < limit) {
      /* code assumes length will be a multiple of sizeof(UINT)! */
      /* xlc -O3 will unroll this loop */
#pragma execution_frequency(very_high)
      for (; addr < limit; addr += sizeof(UINT))
         *((UINT *)addr) = 0;

   }
}



First, this algorithm initializes any memory portion up to the cache line boundary. Then, dcbz is invoked to continue the initialization in cache-size segments. Finally, it initializes any remaining portion smaller than the cache line size.

Note that dcbz is used for initializing memory to zero only. If the application requires an initial value different than 0, such as 0xDEADBEEF (which is often used to denote uninitialized memory regions), then the available options are limited to use of memset or a simple loop.

It is possible to issue two, four, or more consecutive dcbz instructions (each targeting a different cache line). For some memory sizes, this proves to be an advantage, as the results from the experimental evaluation will show below.

Malloc disclaim

Malloc disclaim is an optional extension to malloc. It informs the operating system that part of the pages in the virtual address space will no longer be needed, so the system can stop paging such memory area and the pages can be freed. If that virtual memory location is accessed again, the operating system has to create a zeroed page of memory and map it to the virtual address space, which is returned for the use of the application. This is useful in instances where a process has a high paging-space usage, but is not actually using the memory.


Listing 9. Using the disclaim function

#include <sys/shm.h>
#include <unistd.h>

void Disclaim(void *ptr, unsigned long length)
{
   char *limit = ptr+length;
   char *addr = ptr;
   int ps = getpagesize();

   /* zero any initial portion to first page boundary */
   if ((PTR_T_SIZE)addr & (ps-1)) {
      limit = (char*) (length < ps ? ((PTR_T_SIZE)addr+length) :
      (((PTR_T_SIZE)addr + ps) & ~(ps-1)));
      /* code assumes ptr will have UINT alignment! */
      /* xlc -O3 will unroll this loop */
      for (; addr < limit; addr += sizeof(UINT))
         *((UINT *)addr) = 0;
   }

   /* disclaim all remaining pages */
   limit = (char *) (((PTR_T_SIZE)ptr + length) & ~(ps-1));
   if (addr < limit) {
      unsigned long number_pages = (limit-addr)/ps;
      disclaim(addr, number_pages * ps, ZERO_MEM);
      addr += number_pages * ps;
   }

   /* zero final portion smaller than a page size */
   limit = (char*) ((PTR_T_SIZE)ptr + length);
   if (addr < limit) {
      /* code assumes length will be a multiple of sizeof(UINT)! */
      /* xlc -O3 will unroll this loop */
#pragma execution_frequency(very_high)
      for (; addr < limit; addr += sizeof(UINT))
         *((UINT *)addr) = 0;

   }
}

Malloc disclaim is not enabled by default. You can find more information about this routine on AIX operating system and extensions documentation (see Resources).


Experimental evaluation

This section presents the performance results from the use of the algorithms described above (and some variations of them).

All tests were compiled with optimization level of -O3, using the IBM XL C compiler, and executed on AIX 5.3 running on a p630 POWER4+™ microprocessor, with a 128-byte cache line.

The results are presented as relative to the execution of the Trivial algorithm with a size of four bytes (int data type), over 1000 iterations. Results for size 1 (char) are not presented since they show a very bad performance on all cases due to an unnatural way of initializing the memory contents. In all cases, the results for a 32-bit application are similar to those for a 64-bit application.


Figure 1. Performance for memory regions smaller than 2K

The experiments show that using a simple algorithm for small memory regions is beneficial. Using dcbz begins to pay off around memory areas 256 bytes long or larger (two cache lines).


Figure 2. Performance for memory regions larger than 2K

As you can observe, issuing four dcbz instructions is the optimal for memory sizes ranging from 2K to about 1M. At that point, disclaim becomes the most efficient mechanism.

Conclusions

This article presented several algorithms for initializing memory contents to 0, together with the performance impact of using each of them.

Both dcbz and disclaim are very useful resources for initializing memory contents to 0. However, note that if the program requires that memory be initialized to a value other than 0, then the best that can be done is the simple approach, like the algorithms presented in the beginning, and perhaps with some minor variations for the sake of performance.

Acknowledgements

The author would like to express his gratitude to Brian Hall of IBM for all the interesting discussions about these issues, in particular about the use of disclaim.


Resources

Learn

Get products and technologies

  • Meet the IBM family of C and C++ Compilers: they boast advanced compiler and optimization technologies and are built on a common code base allowing for easier porting of your applications between platforms.

  • Meet AIX, the IBM UNIX platform: rumors of its demise and of its having originated at the hands of space aliens are greatly exaggerated (rumors of its innovative virtualization, micro-partitioning, and advanced capabilities are all true).

  • See all Power Architecture-related downloads on one page.

Discuss

About the author

Carlos Cavanna is a Software Developer in IBM Compilers Group, specializing in Java JIT (Java Just-In-Time) compilation for the PowerPC Architecture.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=123092
ArticleTitle=Initializing memory efficiently on Power Architecture platforms
publish-date=05232006
author1-email=ccavanna@ca.ibm.com
author1-email-cc=dwpower@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers