Topic
  • No replies
SystemAdmin
SystemAdmin
196 Posts

Pinned topic Using the zArchitecture prefetch bif with queues (z10 and newer)

‏2011-06-20T23:35:57Z |
In a previous blog/forum entry I provided an example of using a prefetch bif with an array to improve performance on a z10 or z196 machine.
These entries are in the context of this blog entry, which discusses how to utilize such bifs in binaries that will run on any machine, even if they are not z10 or z196's.

The basic principle of using prefetch instructions is to issue them with enough time for the requested memory to actually be brought into cache, and to do so in places where data cache misses are actually a problem. If d-cache misses are not actually occurring, then adding the bif unnecessarily increases the overhead of that piece of code for no benefit. If d-cache misses are a problem, but we issue the prefetch too soon, then there will be no net benefit.

However, the cost of a d-cache miss is so high that even if only a small fraction of prefetch calls are successful it can be a net win, so it is worth investigating their use.

This entry deals with using prefetches with a queue. Previously, an array was presented. An array presents an attractive target for prefetching because we know a contiguous piece of memory is being used, and valid prefetch addresses can be calculated from the current reference. If the array is being accessed in a predictable manner, then calculating an effective prefetch address can be straight forward. An effective prefetch address is one that will not be requested until the prefetch instruction has had enough time to bring it into cache.

Other data structures such as trees, linked lists and queues that are implemented with pointers can be more challenging to use with prefetch instructions. There's no guarantee that the 'next' pointer references memory that is in the cache (or not), or even in the same page. More importantly, there is generally no way to predict what specific memory locations will be needed in the future, so it can be difficult to issue the prefetch instruction far enough in advance to minimize d-cache misses. If, for example, a linked list is being traversed and we are simply checking the value of a single field, issuing a prefetch will probably not help because there is not enough time for the prefetch to bring the needed data into memory before it is actually needed. For example:



while(ptr!=NULL)
{ __dcbt(ptr->next);  
/* this if(ptr->value == 100){ printf("hi\n"); } ptr= ptr->next; }

The prefetch will not help much, unless there are a lot of entries with the value 100.
However, if even a little work is being done between memory dereferences, then a prefetch can be advantageous.

The below example is compiled with


xlc -oprefetch -qarch=8 prefetchExample2.c

or

xlc -onop -DNO_PREFETCH -qarch=8 prefetchExample2.c


I ran a prefetch enabled version (prefetch) and non-enabled (nop) with varying amounts of work on a z10.
The larger the work, the more time the prefetch instruction had to work it's magic. The smaller the amount
of work, the less effective it was.

As the results show, even with a small amount of work prefetch instructions can help, although their benefit does taper off.

Prefetch with a small amount of work

time prefetch 1299999 99 1 Allocating an queue of size 1299999 (332.259766 Mb), will process work 1 times, repeated 99 times. Initializing the queue.... mallocing a total of 332 Mbytes element size: 268 Starting work.... Stopped work. Elapsed time: 18 seconds Checking array.... All done.   real             0m22.21s user             0m16.41s sys              0m 5.47s

No prefetch with a small amount of work:

time nop 1299999 99 1 Allocating an queue of size 1299999 (332.259766 Mb), will process work 1 times, repeated 99 times. Initializing the queue.... mallocing a total of 332 Mbytes element size: 268 Starting work.... Stopped work. Elapsed time: 19 seconds Checking array.... All done.   real             0m23.74s user             0m17.55s sys              0m 5.85s

Prefetch with a large amount of work:

time prefetch 1299999 99 90 Allocating an queue of size 1299999 (332.259766 Mb), will process work 90 times, repeated 99 times. Initializing the queue.... mallocing a total of 332 Mbytes element size: 268 Starting work.... Stopped work. Elapsed time: 77 seconds Checking array.... All done.   real             1m21.53s user             1m 0.68s sys              0m20.22s


No prefetch with a large amount of work:

time nop 1299999 99 90 Allocating an queue of size 1299999 (332.259766 Mb), will process work 90 times, repeated 99 times. Initializing the queue.... mallocing a total of 332 Mbytes element size: 268 Starting work.... Stopped work. Elapsed time: 95 seconds Checking array.... All done.   real             1m38.79s user             1m13.57s sys              0m24.52s



prefetchExample2.c

#include <stdio.h> #include <builtins.h>   #define MBSIZE (1024*1024) #define Z10_CACHELINESIZE 256     typedef struct q
{ 

int       index; struct q *next; 
/* note: this is in same cache line as index */ 

int       work; 
/* filler to ensure each queue element can trigger a cache miss */ 

char      filler[Z10_CACHELINESIZE]; 
/* if 'next' were here, just reading it could cause a cache miss */ 
}Queue;   Queue *newElement(

int id, Queue *next)
{ Queue *ptr = (Queue *) calloc(1,sizeof(Queue)); 

if(NULL == ptr) 
{ perror(
"Memory failure\n"); exit(9); 
} ptr->index=id; ptr->work=1; ptr->next=next; 

return ptr; 
}   Queue * initQueue(

int size)
{ printf(
"mallocing a total of %d Mbytes\n",(sizeof(Queue) * size)/MBSIZE);   printf(
"element size: %d\n",sizeof(Queue)); Queue *head = newElement(0,NULL);   

for(

int i=1;i < size;++i)
{ head = newElement(i,head); 
} 
}   

void doWork(

int *work,

int size)
{ 

for(

int i=0;i < size;++i)
{ *work = *work/size; 
} 
}   #ifndef NO_PREFETCH 
/* set to 0 to disable */ #define PREFETCH_ON 1 #

else #define PREFETCH_ON 0 #endif   

void processQueue(Queue *head,

int workSize)
{ Queue *ptr = head;   

while(NULL != ptr)
{ 

if(PREFETCH_ON)__dcbtst(ptr->next); doWork(&(ptr->work),workSize); ptr= ptr->next; 
}   
}     

void checkQueue(Queue *head,

int workSize)
{ Queue *ptr = head;   

while(NULL != ptr)
{ 
/* some bogus check */ 

if(ptr->work == -3) printf(
"Found a strange value %d index %d\n",ptr->work,ptr->index); ptr= ptr->next; 
}   
}     

int main(

int argc,

char *argv[])
{ unsigned 

long 

long startTime,stopTime;   

if(argc !=4)
{ fprintf(stderr,
"Usage:%s <numQueueElements> <numIterations> <sizeWork>\n",argv[0]); 

return 99; 
}   

int size = atoi(argv[1]); 

int iterations = atoi(argv[2]); 

int workSize = atoi(argv[3]); printf(
"Allocating an queue of size %d (%f Mb), will process work %d times, repeated %d times.\n", size,(

float)(size*sizeof(Queue))/MBSIZE,workSize,iterations);   printf(
"Initializing the queue....\n");   Queue *head =initQueue(size);   printf(
"Starting work....\n");   startTime = time(NULL); 

for(

int i=0; i < iterations;++i) processQueue(head,workSize); stopTime = time(NULL);   printf(
"Stopped work.\n");   printf(
"Elapsed time: %lld seconds\n",stopTime-startTime);   printf(
"Checking queue....\n"); checkQueue(head,size); 
/* need this to ensure processQueue not removed*/ printf(
"All done.\n");   

return 0; 
}
Updated on 2011-06-23T01:08:57Z at 2011-06-23T01:08:57Z by SystemAdmin
  • DaveyC
    DaveyC
    55 Posts

    Re: Using the zArchitecture prefetch bif with queues (z10 and newer)

    ‏2011-06-22T06:05:23Z  
    Great stuff Chris! It's invaluable information and I'm sure I speak for all of us when I say a big thank you.

    It would be intereting to run your prefecth example using multiple threads with a shared queue to see just how effective
    the shared cache arhitecture on modern hardware is. FWIW, I don't have access to a machine with prefetch!

    David Crayford
  • SystemAdmin
    SystemAdmin
    196 Posts

    Re: Using the zArchitecture prefetch bif with queues (z10 and newer)

    ‏2011-06-23T01:08:57Z  
    • DaveyC
    • ‏2011-06-22T06:05:23Z
    Great stuff Chris! It's invaluable information and I'm sure I speak for all of us when I say a big thank you.

    It would be intereting to run your prefecth example using multiple threads with a shared queue to see just how effective
    the shared cache arhitecture on modern hardware is. FWIW, I don't have access to a machine with prefetch!

    David Crayford
    That's an interesting idea. I'll see what I can cook up.