Topic
  • No replies
IBM OpenCL
IBM OpenCL
5 Posts

Pinned topic An example with several types of compute kernel that work well on Cell

‏2009-11-19T04:14:09Z |
To get an OpenCL program to perform well on Cell for the Accelerator device, there are severalsteps one could take:
1) The kernel computational code should use native 128 bit vector data types or larger (ex: float4,
int4, etc). AOS (array of structure) vector style is preferred. The Cell SPEs are vector
engines, if the computation is performed using scalar data types then only 1/4 of the
Cell's computational power can be reached.

2) Cell uses a DMA engine to transfer data between __global memory and __local memory. The
explicit way to access this engine is with OpenCL's async_work_group_copy builtin function.
Using async_work_group_copy functions will result in better performance than using loads and
stores from and to __global memory since the latter method is handled via a software data
cache on Cell.

3) Larger local work group size will also result in better performance since the
implementation treats large work group size as implicit unrolling of the kernel. Furthermore,
larger work group means more data is transferred per async_work_group_copy call, thus
more efficient.

The following kernels will demonstrate the steps outlined above. In this
example, we are computing the addition of two matrices A and B of NUM_ROW x
NUM_COL floats and store the result into matrix C.

In the first kernel, matrix_add_scalar_global, each kernel iteration performs
the matrix addition for one row, so the global_work_size is just the number of
rows in the matrices. All computation is done using scalar single-precision
floating point numbers. The parameter 'size' is the number of floating point
numbers per row. Reads and writes from and to __global memory are used to access the
matrices.

__kernel __attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1))) 

void matrix_add_scalar_global ( __global 

float *A, __global 

float *B, __global 

float *C, unsigned 

int size) 
{ unsigned 

int i; unsigned 

int gid = get_global_id(0); unsigned 

int offset = size * gid;   

for (i = 0; i < size; i++) 
{ C[offset + i] = A[offset + i] + B[offset + i]; 
} 
}

The second kernel, matrix_add_vector_global, is very similar to the first one
except it uses float4 type instead of float type. With OpenCL, vectorizing a
computational core, as shown here, is very straightforward.

__kernel __attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1))) 

void matrix_add_vector_global ( __global float4 *A, __global float4* B, __global float4* C, unsigned 

int size) 
{ unsigned 

int i; unsigned 

int offset = (size/4) * get_global_id(0);   

for (i = 0; i < (size/4); i++) 
{ C[offset + i] = A[offset + i] + B[offset + i]; 
}   
}


The third kernel, matrix_add_vector_async_copy, uses the builtin function
async_work_group_copy to transfer the data from the matrices in to and out of
__local memory. The builtin async_work_group_copy function enables direct
access to the Cell DMA engines, and thus faster transfers of data between
__global and __local. Note that the local work group size is specified to be
1 in this case.

__kernel __attribute__ ((reqd_work_group_size (1, 1, 1))) 

void matrix_add_vector_async_copy ( __global float4* A, __global float4* B, __global float4* C, __local  float4* lA, __local  float4* lB, __local  float4* lC, unsigned 

int size) 
{ unsigned 

int i; unsigned 

int cnt = size/4; unsigned 

int offset = (cnt) * get_global_id(0); event_t event = (event_t)0;   
//get the matrices into local memory event = async_work_group_copy (lA, (__global 

const float4*)&(A[offset]), cnt, event); event = async_work_group_copy (lB, (__global 

const float4*)&(B[offset]), cnt, event); wait_group_events (1, &event);   

for (i = 0; i < cnt; i++) 
{ lC[i] = lA[i] + lB[i]; 
}   event = async_work_group_copy (&C[offset], (__local 

const float4*)lC, cnt, event); wait_group_events (1, &event); 
}


The fourth kernel is an improvement on the previous one. Instead of local work
group size being set to 1, it is set to a size of WORK_GROUP_SIZE. This OpenCL
implementation essentially loop unrolls the computation by the number of work
items per work group. Applications can gain substantial performance improvement
by experimenting with larger work group sizes. Larger work group means more data
is transferred per async_work_group_copy call, thus can be more efficient.

In this kernel, size/4 * WORK_GROUP_SIZE * sizeof (float4) bytes for each
work group. A call to the barrier() function is used after the computation step
to ensure that all computations in all the work items have been completed before
the data is transferred to __global memory.


__kernel __attribute__ ((reqd_work_group_size (WORK_GROUP_SIZE, 1, 1))) 

void matrix_add_vector_large_wg ( __global float4* A, __global float4* B, __global float4* C, __local float4* lA, __local float4* lB, __local float4* lC, unsigned 

int size ) 
{ unsigned 

int i; unsigned 

int workgroup_cnt = (size/4) * WORK_GROUP_SIZE; unsigned 

int gid = get_global_id(0); unsigned 

int offset = workgroup_cnt * (gid >> 2); unsigned 

int wi_offset = (size/4) * get_local_id(0);   event_t event = (event_t)0;   
//DMA the data necessary for the whole wg into local memory event = async_work_group_copy (lA, (__global 

const float4*)&A[offset], workgroup_cnt, event); event = async_work_group_copy (lB, (__global 

const float4*)&B[offset], workgroup_cnt, event); wait_group_events (1, &event);   

for (i = 0; i < (size/4); i++) 
{ lC[wi_offset + i] = lA[wi_offset + i]  + lB[wi_offset + i]; 
}   barrier(CLK_LOCAL_MEM_FENCE);   event = async_work_group_copy (&C[offset], (__local 

const float4*)lC, workgroup_cnt, event); wait_group_events (1, &event); 
}

More information on how to optimize OpenCL programs for Power and Cell
architectures can be found in the Installation and User's Guide for OpenCL
Development Kit for Linux on Power. This guide is part of the OpenCL Development Kit download.