-qprefetch
Category
Pragma equivalent
None.
Purpose
Inserts prefetch instructions automatically where there are opportunities to improve code performance.
When -qprefetch is in effect, the compiler may insert prefetch instructions in compiled code. When -qnoprefetch is in effect, prefetch instructions are not inserted in compiled code.
Syntax
.-:-----------------------------------. V | .-prefetch----+---------------------------------+-+-. | | .-noassistthread-----------. | | | +-=--+-assistthread--=--+-SMT-+-+-+ | | | '-CMP-' | | | | .-noaggressive-. | | | +-=--+-aggressive---+-------------+ | | '-=--dscr--=--value---------------' | >>- -q--+-noprefetch----------------------------------------+--><
Defaults
-qprefetch=noassistthread:noaggressive:dscr=0
Parameters
- assistthread | noassistthread
- When you work with applications that generate a high cache-miss
rate, you can use -qprefetch=assistthread to exploit
assist threads for data prefetching. This suboption guides the compiler
to exploit assist threads at optimization level -O3 -qhot or
higher. If you do not specify -qprefetch=assistthread, -qprefetch=noassistthread is
implied.
- CMP
- For systems based on the chip multi-processor architecture (CMP), you can use -qprefetch=assistthread=cmp.
- SMT
- For systems based on the simultaneous multi-threading architecture
(SMT), you can use -qprefetch=assistthread=smt.Note: If you do not specify either CMP or SMT, the compiler uses the default setting based on your system architecture.
- aggressive | noaggressive
- This suboption guides the compiler to generate aggressive data prefetching at optimization level -O3 or higher. If you do not specify aggressive, -qprefetch=noaggressive is implied.
- dscr
- You can specify a value for the dscr suboption
to improve the runtime performance of your applications. The compiler
sets the Data Stream Control Register (DSCR) to the specified dscr value
to control the hardware prefetch engine. The value is valid when -mcpu=pwr8 is in effect and the optimization
level is -O2 or greater. The default value of dscr is
0.
- value
-
The value that you specify for dscr must be 0 or greater, and representable as a 64-bit unsigned integer. Otherwise, the compiler issues a warning message and sets dscr to 0. The compiler accepts both decimal and hexadecimal numbers, and a hexadecimal number requires the prefix of 0x. The value range depends on your system architecture. See the product information about the POWER® Architecture for details. If you specify multiple dscr values, the last one takes effect.
Usage
The -qnoprefetch option does not prevent built-in functions such as __prefetch_by_stream from generating prefetch instructions.
When you run -qprefetch=assistthread, the compiler uses the delinquent load information to perform analysis and generates prefetching assist threads. The delinquent load information can either be provided through the built-in __mem_delay function (const void *delinquent_load_address, const unsigned int delay_cycles), or gathered from dynamic profiling using -qpdf1=level=2.
- Run -qpdf1=level=2
- Run -qpdf2 -qprefetch=assistthread
Examples
Here is how you generate code using assist threads with __MEM_DELAY:
int y[64], x[1089], w[1024];
void foo(void){
int i, j;
for (i = 0; i &l; 64; i++) {
for (j = 0; j < 1024; j++) {
/* what to prefetch? y[i]; inserted by the user */
__mem_delay(&y[i], 10);
y[i] = y[i] + x[i + j] * w[j];
x[i + j + 1] = y[i] * 2;
}
}
}
void foo@clone(unsigned thread_id, unsigned version)
{ if (!1) goto lab_1;
/* version control to synchronize assist and main thread */
if (version == @2version0) goto lab_5;
goto lab_1;
lab_5:
@CIV1 = 0;
do { /* id=1 guarded */ /* ~2 */
if (!1) goto lab_3;
@CIV0 = 0;
do { /* id=2 guarded */ /* ~4 */
/* region = 0 */
/* __dcbt call generated to prefetch y[i] access */
__dcbt(((char *)&y + (4)*(@CIV1)))
@CIV0 = @CIV0 + 1;
} while ((unsigned) @CIV0 < 1024u); /* ~4 */
lab_3:
@CIV1 = @CIV1 + 1;
} while ((unsigned) @CIV1 < 64u); /* ~2 */
lab_1:
return;
}