Application developers often face the common problem of initializing memory contents to a zero value. Sometimes a simple algorithm suffices for the purpose of the application. However, many times, the memory initialization routines are invoked frequently over large memory areas. In such cases, those libraries can become a bottleneck.
Power Architecture family processors offer various alternatives to approach this problem. This article explores some of them and presents performance information that illustrates the cost of each alternative depending on the size of the memory area that needs to be initialized.
A first approach that developers usually take toward this problem is to write a simple loop that iterates over all the memory locations to be initialized.
Listing 1. Initializing memory the easy way
void Trivial1(void* ptr, unsigned long length)
{
for (unsigned long i=0; i < length; i += sizeof(char))
*(char*)(ptr+i) = 0;
}
|
When the program requires memory initialization frequently, you want to find ways to optimize this loop. One alternative is to iterate over integer values (four bytes long) instead of char values. This is viable depending on the length of the data area and how it is aligned in memory, but the simple version presented here can run slightly past the end of a buffer.
Listing 2. Initializing memory slightly faster
#define UINT unsigned int
void Trivial4(void* ptr, unsigned long length)
{
unsigned char *p = ptr;
for (unsigned long i=0; i < length; i += sizeof(UINT))
*((UINT *)(p+i)) = 0;
}
|
Other alternatives are to use long long (eight bytes) or even floating point types. In the case of storing floating point values, using a 0.0f value will not suffice. You have no guarantee that 0.0f will be represented by zeroes in memory. In this case, you can use an assembly instruction to move a real zero value to a floating point register, and then store this value in memory.
C libraries provide their own functions to initialize memory contents.
The memset function (a replacement for the
now-deprecated bzero) does the same thing as
the loops above, in a more elegant manner.
Listing 3. Having someone else initialize memory
void ZeroMem(void *ptr, size_t length)
{
memset(ptr, 0, length);
}
|
While memset is very easy to use, you might find a more efficient
alternative, depending on how memset is implemented by the host
platform.
Instead of transferring single bytes or words, it is more efficient for the cache and main memory to exchange a larger amount of data. This larger unit is called a cache line or cache block, and it is defined as the smallest unit that is transferred between main memory and the cache. Note that a cache line is read and cached as a whole. Caches take advantage of the principle of locality of reference -- meaning, if one memory location is read, it is very likely that following memory locations are going to be accessed afterwards so it won't be specific to the technique but to how caches work.
On Power Architecture processors, the dcbz (data cache block zero)
instruction initializes a data cache line. The instruction
syntax is as follows:
Listing 4. Clearing a single cache line
dcbz r1, r2 |
If r1 is not 0, dcbz operates on the effective address computed as
r1+r2; if r1 is 0, it operates on the address residing on r2.
The values in the whole data cache line that contains the address are set to zero. This means that all bytes preceding the computed address up to the beginning of the data cache line, and all bytes following the address up to the end of the cache line, will be initialized to 0.
For example, suppose the cache line is 64 bytes. If r1 is 0 and r2
contains the address 0x8000005E (shown in bold below), calling dcbz r1,r2 transforms the memory block:
Listing 5. Memory about to be cleared
0x80000030: 2ff22a7b 2ff22a89 2ff22a9e 2ff22ab3 0x80000040: 2ff22ae1 2ff22aff 2ff22cf7 2ff22d10 0x80000050: 2ff22d1f 2ff22d4d 2ff22d6c 2ff22d9f 0x80000060: 2ff22dba 2ff22ddd 2ff22de4 2ff22def 0x80000070: 2ff22e18 2ff22e39 2ff22e41 2ff22e55 0x80000080: 2ff22e18 2ff22e39 2ff22e41 2ff22e55 0x80000090: 2ff22e18 2ff22e39 2ff22e41 2ff22e55 |
into this one:
Listing 6. The whole cache line has been zeroed
0x80000030: 2ff22a7b 2ff22a89 2ff22a9e 2ff22ab3 0x80000040: 00000000 00000000 00000000 00000000 0x80000050: 00000000 00000000 00000000 00000000 0x80000060: 00000000 00000000 00000000 00000000 0x80000070: 00000000 00000000 00000000 00000000 0x80000080: 2ff22e18 2ff22e39 2ff22e41 2ff22e55 0x80000090: 2ff22e18 2ff22e39 2ff22e41 2ff22e55 |
where the entire cache line was set to the value zero.
One interesting attribute of dcbz is that if the memory address to
initialize is not present in the data cache, a block is established
and set to zero without the need to access memory. Also, note that dcbz deals with virtual addresses, and when the cache
is flushed, main memory will reflect that those memory areas were
changed.
You can use the dcbz instruction in conjunction
with dcbst (data cache block store),
which copies the contents of a modified data cache block from
the cache to main memory. This is useful for ensuring that the
initialized lines are visible to main memory immediately. In cases
such as DMA usage, the use of sync (synchronize)
or eieio (enforce in-order execution of I/O) might also be required, because these
ensure that the store will complete before continuing with the
execution of the program. For more information about the Power Architecture
storage model, see Resources.
Cache line sizes differ between Power Architecture implementations. Therefore,
when issuing dcbz instructions to initialize memory contents, the
program needs to obtain the size of the cache line first, so it
can calculate how much memory it will be initializing for each dcbz
instruction.
One way to do that on AIX® is:
Listing 7. Identifying the cache line size
void dcbz(char *);
#pragma mc_func dcbz {"7c001fec"} /* dcbz, 0, r3 */
#pragma reg_killed_by dcbz
static UINT getCacheLineSize()
{
char buf[1024];
UINT i, ppcCacheLineSize;
memset(buf, 255, 1024);
dcbz(&buf[512]);
for (i = 0, ppcCacheLineSize = 0; i < 1024; i++)
if (buf[i] == 0) ppcCacheLineSize++;
return ppcCacheLineSize;
}
|
While the above code sample only works on AIX, it's easy enough to do the same thing on other operating systems. For instance, on Linux®, the call to the dcbz "function" can be replaced with this line:
__asm__ __volatile__("dcbz 0,%0" : /* no outputs */ : "r"
(&buf[512]));
Because you have no guarantee what alignment the buffer will have,
the example uses a sufficiently large array for it. First, all the
buffer bits are set to 1. Then, dcbz is called pointing to the
middle of the buffer. Finally, getCachLineSize counts the number
of slots that contain a 0.
Notice that cache line size verification is made only once, usually during program initialization. The result can be stored in a static variable, so there is no need to repeat the process every time the program needs to know the cache line size.
The dcbz instruction is non-privileged. With respect to protection,
dcbz is treated as a store to the addressed cache. However, it has
high latency and cycle count to next instruction. Because the dcbz instruction
takes a long time to execute (compared to simpler instructions), you should ensure that the benefit of targeting an
entire cache line will outweigh the cost of issuing such instruction.
The experimental evaluation shows that from memory regions larger
than 256 bytes (two cache lines), dcbz is very efficient.
Following is a first approach to using dcbz:
Listing 8. Clearing whole blocks of memory
#ifdef PPC32
#define PTR_T_SIZE UINT
#elif defined PPC64
#define PTR_T_SIZE ULONG
#else
#error PPC32 or PPC64 must be defined
#endif
void DCBZ1(void *ptr, unsigned long length)
{
static PTR_T_SIZE cacheLineSize = 0;
char *addr = ptr;
char *limit;
/* one-time-only calculation of cache line size */
if (!cacheLineSize)
cacheLineSize = getCacheLineSize();
/* zero any initial portion to first cache line boundary */
if ((PTR_T_SIZE)addr & (cacheLineSize-1)) {
limit = (char*) (length < cacheLineSize ?
((PTR_T_SIZE)addr+length) :
(((PTR_T_SIZE)addr + cacheLineSize) & ~(cacheLineSize-1)));
/* code assumes ptr will have UINT alignment! */
/* xlc -O3 will unroll this loop */
for (; addr < limit; addr += sizeof(UINT))
*((UINT *)addr) = 0;
}
/* dcbz full cache lines */
/* dcbz forms a group on POWER4, so there is no reason to unroll */
limit = (char *) (((PTR_T_SIZE)ptr + length) & ~(cacheLineSize-1));
for (; addr < limit; addr += cacheLineSize)
dcbz(addr);
/* zero final portion smaller than a cache line */
limit = (char *) ((PTR_T_SIZE)ptr + length);
if (addr < limit) {
/* code assumes length will be a multiple of sizeof(UINT)! */
/* xlc -O3 will unroll this loop */
#pragma execution_frequency(very_high)
for (; addr < limit; addr += sizeof(UINT))
*((UINT *)addr) = 0;
}
}
|
First, this algorithm initializes any memory portion up to the
cache line boundary. Then, dcbz is invoked to continue the
initialization in cache-size segments. Finally, it initializes
any remaining portion smaller than the cache line size.
Note that dcbz is used for initializing memory to zero
only. If the application requires an initial value different than
0, such as 0xDEADBEEF (which is often used to denote uninitialized memory
regions), then the available options are limited to use of memset
or a simple loop.
It is possible to issue two, four, or more consecutive dcbz instructions
(each targeting a different cache line). For some memory sizes, this
proves to be an advantage, as the results from the experimental
evaluation will show below.
Malloc disclaim is an optional extension to malloc. It informs the operating system that part of the pages in the virtual address space will no longer be needed, so the system can stop paging such memory area and the pages can be freed. If that virtual memory location is accessed again, the operating system has to create a zeroed page of memory and map it to the virtual address space, which is returned for the use of the application. This is useful in instances where a process has a high paging-space usage, but is not actually using the memory.
Listing 9. Using the disclaim function
#include <sys/shm.h>
#include <unistd.h>
void Disclaim(void *ptr, unsigned long length)
{
char *limit = ptr+length;
char *addr = ptr;
int ps = getpagesize();
/* zero any initial portion to first page boundary */
if ((PTR_T_SIZE)addr & (ps-1)) {
limit = (char*) (length < ps ? ((PTR_T_SIZE)addr+length) :
(((PTR_T_SIZE)addr + ps) & ~(ps-1)));
/* code assumes ptr will have UINT alignment! */
/* xlc -O3 will unroll this loop */
for (; addr < limit; addr += sizeof(UINT))
*((UINT *)addr) = 0;
}
/* disclaim all remaining pages */
limit = (char *) (((PTR_T_SIZE)ptr + length) & ~(ps-1));
if (addr < limit) {
unsigned long number_pages = (limit-addr)/ps;
disclaim(addr, number_pages * ps, ZERO_MEM);
addr += number_pages * ps;
}
/* zero final portion smaller than a page size */
limit = (char*) ((PTR_T_SIZE)ptr + length);
if (addr < limit) {
/* code assumes length will be a multiple of sizeof(UINT)! */
/* xlc -O3 will unroll this loop */
#pragma execution_frequency(very_high)
for (; addr < limit; addr += sizeof(UINT))
*((UINT *)addr) = 0;
}
}
|
Malloc disclaim is not enabled by default. You can find more information about this routine on AIX operating system and extensions documentation (see Resources).
This section presents the performance results from the use of the algorithms described above (and some variations of them).
All tests were compiled with optimization level of -O3, using the IBM XL C compiler, and executed on AIX 5.3 running on a p630 POWER4+™ microprocessor, with a 128-byte cache line.
The results are presented as relative to the execution of the Trivial algorithm with a size of four bytes (int data type), over 1000 iterations. Results for size 1 (char) are not presented since they show a very bad performance on all cases due to an unnatural way of initializing the memory contents. In all cases, the results for a 32-bit application are similar to those for a 64-bit application.
Figure 1. Performance for memory regions smaller than 2K
The experiments show that using a simple algorithm
for small memory regions is beneficial. Using dcbz begins to pay off around memory
areas 256 bytes long or larger (two cache lines).
Figure 2. Performance for memory regions larger than 2K
As you can observe, issuing four dcbz instructions is the optimal
for memory sizes ranging from 2K to about 1M. At that point, disclaim
becomes the most efficient mechanism.
This article presented several algorithms for initializing memory contents to 0, together with the performance impact of using each of them.
Both dcbz and disclaim are very useful
resources for initializing
memory contents to 0. However, note that if the program requires
that memory be initialized to a value other than 0, then the best that
can be done is the simple approach, like the algorithms presented in the
beginning, and perhaps with some minor variations for the sake of
performance.
The author would like to express his gratitude to Brian Hall of IBM for all the interesting discussions about these issues, in particular about the use of disclaim.
Learn
-
Find a
good starting reference for the Power Architecture storage model on developerWorks.
-
The disclaim feature of the AIX malloc implementation is documented in the
AIX online documentation.
-
The
IBM Semiconductor solutions technical library
Semiconductor solutions technical library
hosts a wealth of information -- from
specifications and user manuals, to product briefs and errata, and much more.
-
Keep abreast of all the Power Architecture-related news: Subscribe to the Power
Architecture Community Newsletter.
Get products and technologies
-
Meet the IBM
family of C
and C++ Compilers: they boast advanced compiler and optimization technologies and are built on a
common code base allowing for easier porting of your applications between
platforms.
-
Meet AIX, the IBM UNIX
platform: rumors of its demise and of its having originated at
the hands of space aliens are greatly exaggerated (rumors of its
innovative virtualization, micro-partitioning, and advanced capabilities
are all true).
-
See all
Power Architecture-related
downloads on one page.
Discuss
- Participate in the discussion forum.
-
Take part in the IBM developerWorks Power Architecture technology forums.
-
Send a letter to the editor.
Comments (Undergoing maintenance)




