-qprefetch

Pragma equivalent

None.

Purpose

Inserts prefetch instructions automatically where there are opportunities to improve code performance.

When -qprefetch is in effect, the compiler may insert prefetch instructions in compiled code. When -qnoprefetch is in effect, prefetch instructions are not inserted in compiled code.

Syntax


                        .-:-------------------------------.   
        .-prefetch---.  V                                 |   
>>- -q--+-noprefetch-+----+-----------------------------+-+----><
                          +-=--assistthread--=--+-SMT-+-+     
                          |                     '-CMP-' |     
                          +-=--noassistthread-----------+     
                          |    .-noaggressive-.         |     
                          '-=--+-aggressive---+---------'

Defaults

-qprefetch
-qprefetch=noassistthread
-qprefetch=noassistthread:noaggressive

Parameters

assistthread | noassistthread

When you work with applications that generate a high cache-miss rate, you can use -qprefectch=assistthread to exploit assist threads for data prefetching. This suboption guides the compiler to exploit assist threads at optimization level -O3 -qhot or higher. If you do not specify -qprefetch=assistthread, -qprefetch=noassistthread is implied.

aggressive | noaggressive

This suboption guides the compiler to generate aggressive data prefetching at optimization level -O3 -qhot or higher. If you do not specify aggressive, -qprefetch=noaggressive is implied.

CMP: For systems based on the chip multi-processor architecture (CMP), you can use -qprefetch=assistthread=cmp.
SMT: For systems based on the simultaneous multi-threading architecture (SMT), you can use -qprefetch=assistthread=smt.

Note: If you do not specify either CMP or SMT, the compiler uses the default setting based on your system architecture.

Usage

The -qnoprefetch option does not prevent built-in functions such as __prefetch_by_stream from generating prefetch instructions.

When you run -qprefetch=assistthread, the compiler uses the delinquent load information to perform analysis and generates prefetching assist threads. The delinquent load information can either be provided through the built-in __mem_delay function (const void *delinquent_load_address, const unsigned int delay_cycles), or gathered from dynamic profiling using -qpdf1=level=2.

When you use -qpdf to call -qprefetch=assistthread, you must use the traditional two-step PDF invocation:

Run -qpdf1=level=2
Run -qpdf2 -qprefetch=assistthread

Examples

Here is how you generate code using assist threads with __MEM_DELAY:

Initial code:

int y[64], x[1089], w[1024];

  void foo(void){
    int i, j;
    for (i = 0; i &l; 64; i++) {
      for (j = 0; j < 1024; j++) {
        
        /* what to prefetch? y[i]; inserted by the user */
        __mem_delay(&y[i], 10);          
        y[i] = y[i] + x[i + j] * w[j];                            
        x[i + j + 1] = y[i] * 2;       
    }     
  }    
}

Assist thread generated code:

void foo@clone(unsigned thread_id, unsigned version)

{ if (!1) goto lab_1;

/* version control to synchronize assist and main thread */
if (version == @2version0) goto lab_5; 

goto lab_1;

lab_5:

@CIV1 = 0;

do { /* id=1 guarded */ /* ~2 */

if (!1) goto lab_3;

@CIV0 = 0;

do { /* id=2 guarded */ /* ~4 */

/* region = 0 */

/* __dcbt call generated to prefetch y[i] access */
__dcbt(((char *)&y + (4)*(@CIV1)))    
@CIV0 = @CIV0 + 1; 
} while ((unsigned) @CIV0 < 1024u); /* ~4 */  

lab_3:
@CIV1 = @CIV1 + 1;
} while ((unsigned) @CIV1 < 64u); /* ~2 */  

lab_1:

return; 
}