 | Level: Introductory Kane Scarlett (kane@us.ibm.com), developerWorks Editor,
IBM
18 Mar 2008 In this Cell Broadband Engine™ (Cell/B.E.) series, learn how to use the
Accelerated Library Framework (ALF) in the IBM SDK for Multicore Acceleration 3.0 to
add two large matrices together. There is one example for host data
partitioning and one for accelerator data partitioning. The "ALF for Cell/B.E.
Programmer's Guide and API Reference, Version 3.0" (see Resources) is the source for the content.
Introduction
For this article, two large matrices are added together using ALF. The
problem can be expressed simply as:
 |
More fun with ALF
Look for more in the Fun
with ALF series:
- In "Table lookup," learn how the task context buffer is used as a large
lookup table to convert the 16-bit input data to 8-bit output data.
- In "Min-max finder," discover how you can use the task context to keep the
partial computing results for each task instance and then combine these
partial results into the final result.
- In "Multiple vector dot products," uncover how to use the bundled work block
distribution with the task context to handle situations where the
work block cannot hold the partitioned data because of a local memory size
limit.
- In "Overlapped I/O buffer," take a look at using overlapped I/O buffers to
do matrix addition.
- In "Task dependency," check out a simple simulation in which you can use task
dependency in a two-stage pipeline application.
And watch for similar Fun with series addressing DaCS, BLAS, and other
technologies to make your Cell/B.E. programming easier.
|
|
A[m,n] + B[m,n] = C[m,n]
where m and n are the
dimensions of the matrices.
This simple example, crafted for both host and accelerator data partitioning,
demonstrates how to:
- Start the ALF runtime environment.
- Use task descriptor.
- Start a task on the accelerators.
- Create and add a work block to a task.
- Exit the ALF runtime environment correctly.
You can use this sample as a template to build more complicated applications.
Exploring the host data
partitioning example
In this example, the host application:
- Initializes the ALF runtime environment.
- Creates a task descriptor.
- Creates a task based on that task descriptor.
- Creates work blocks with the appropriate data transfer lists that start
invocations of the computational kernel on the accelerator.
- Waits for the computational kernel to finish and exits.
The accelerator application includes a simple computational kernel that computes
the addition of the two matrices. The scalar code to add two matrices for a
uni-processor computer is:
float mat_a[NUM_ROW][NUM_COL];
float mat_b[NUM_ROW][NUM_COL];
float mat_c[NUM_ROW][NUM_COL];
int main(void)
{
int i,j;
for (i=0; i<NUM_ROW; i++)
for (j=0; j<NUM_COL; j++)
mat_c[i][j] = mat_a[i][j] + mat_b[i][j];
return 0;
}
|
Exploring the host data partitioning source code
The following code listings show only the relevant sections of the code. For a
complete listing, refer to the ALF samples directory
matrix_add/STEP1a_partition_scheme_A/common/host_partition.
An ALF host program can be logically divided into these sections:
- Initialization
- Task setup
- Work block setup
- Task wait and exit
Initialization
The following code segment shows how ALF is initialized and how accelerators are
allocated for a specific ALF runtime.
alf_handle_t alf_handle;
unsigned int nodes;
/* initializes the runtime environment for ALF*/
alf_init(&config_parms, &alf_handle;);
/* get the number of SPE accelerators available for from the Opteron */
rc = alf_query_system_info(alf_handle, ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes;);
/* set the total number of accelerator instances (in this case, SPE) */
/* the ALF runtime will have during its lifetime */
rc = alf_num_instances_set (alf_handle, nodes);
|
Task setup
This section of an ALF host program contains information about the description
of a task and the creation of the task runtime. The
alf_task_desc_create function creates a task
descriptor. This descriptor can be used multiple times to create different
executable tasks. The function alf_task_create creates
a task to run an SPE program with the name spe_add_program.
/* variable declarations */
alf_task_desc_handle_t task_desc_handle;
alf_task_handle_t task_handle;
const char* spe_image_name;
const char* library_path_name;
const char* comp_kernel_name;
/* describing a task that's executable on the SPE*/
alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &task_desc_handle;);
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_TSK_CTX_SIZE, 0);
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE,
sizeof(add_parms_t));
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE,
H * V * 2 sizeof(float));
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_OUT_BUF_SIZE,
H * V * sizeof(float));
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_NUM_DTL_ENTRIES, 8);
alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_MAX_STACK_SIZE, 4096);
/* providing the SPE executable name */
alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_IMAGE_REF_L,(unsigned long long) spe_image_name);
alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_LIBRARY_REF_L,(unsigned long) library_path_name);
alf_task_desc_set_int64(task_desc_handle,
ALF_TASK_DESC_ACCEL_KERNEL_REF_L,(unsigned long) comp_kernel_name);
|
Work block setup
This section shows how work blocks are created. After the program has created
the work block, it describes the input and output associated with each work block.
Each work block contains the input description for blocks in the input matrices of
size H * V starting at location
matrix[row][0] with H and
V representing the horizontal and vertical dimensions
of the block.
In this example, assume that the accelerator memory can contain the two input
buffers of size H * V elements, and assume that the
output buffer is of size H * V. The program calls
alf_wb_enqueue() to add the work block to the queue to
be processed. ALF employs an immediate runtime mode. As soon as the first work
block is added to the queue, the task starts processing the work block. The
function alf_task_finalize closes the work block queue.
alf_wb_handle_t wb_handle;
add_parms_t parm __attribute__((aligned(128)));
parm.h = H; /* horizontal size of the block */
parm.v = V; /* vertical size of the block */
/* creating work blocks and adding parameter & io buffer */
for (i = 0; i < NUM_ROW; i += H) {
alf_wb_create(task_handle, ALF_WB_SINGLE, 0,&wb_handle);
/* begins a new Data Transfer List for INPUT */
alf_wb_dtl_set_begin(wb_handle, ALF_BUF_IN, 0);
/* Add H*V element of mat_a as Input */
alf_wb_dtl_set_entry_add(wb_handle, &matrix_a[i][0], H * V, ALF_DATA_FLOAT);
/* Add H*V element of mat_b as Input */
alf_wb_dtl_set_entry_add(wb_handle, &matrix_b[i][0], H * V, ALF_DATA_FLOAT);
alf_wb_dtl_set_end(wb_handle);
/* begins a new Data Transfer List OUTPUT */
alf_wb_dtl_set_begin(wb_handle, ALF_BUF_OUT, 0);
/* Add H*V element of mat_c as Output */
alf_wb_dtl_set_entry_add(wb_handle, &matrix_c[i][0], H * V, ALF_DATA_FLOAT);
alf_wb_dtl_set_end(wb_handle);
/* pass parameters H and V to spu */
alf_wb_parm_add(wb_handle, (void *) (&parm), sizeof(parm), ALF_DATA_BYTE, 0);
/* enqueuing work block */
alf_wb_enqueue(wb_handle);
}
alf_task_finalize(task_handle);
|
Task wait and exit
After all the work blocks are on the processing queue, the program waits for the
accelerator to finish processing the work blocks. Then
alf_exit() is called to cleanly exit the ALF runtime
environment.
/* waiting for all work blocks to be done*/
alf_task_wait(task_handle, -1);
/* exit ALF runtime */
alf_exit(alf_handle, ALF_EXIT_WAIT, -1);
|
Preparing the accelerator side
On the accelerator side, you need to provide the actual computational kernel
that computes the addition of the two blocks of matrices. The ALF runtime on the
accelerator is responsible for getting the input buffer to the accelerator memory
before it runs the user-provided alf_accel_comp_kernel
function. After alf_accel_comp_kernel returns, the ALF
runtime is responsible for getting the output data back to host memory space.
Double-buffering or triple-buffering is employed as appropriate to ensure that the
latency for the input buffer to get into accelerator memory and the output buffer
to get to host memory space is well covered with computation.
int alf_accel_comp_kernel(void *p_task_context,
void *p_parm_context,
void *p_input_buffer,
void *p_output_buffer,
void *p_inout_buffer,
unsigned int current_count,
unsigned int total_count)
{
unsigned int i, cnt;
vector float *sa, *sb, *sc;
add_parms_t *p_parm = (add_parms_t *)p_parm_context;
cnt = p_parm->h * p_parm->v / 4;
sa = (vector float *) p_input_buffer;
sb = sa + cnt;
sc = (vector float *) p_output_buffer;
for (i = 0; i < cnt; i += 4) {
sc[i] = spu_add(sa[i], sb[i]);
sc[i + 1] = spu_add(sa[i + 1], sb[i + 1]);
sc[i + 2] = spu_add(sa[i + 2], sb[i + 2]);
sc[i + 3] = spu_add(sa[i + 3], sb[i + 3]);
}
return 0;
}
|
Exploring the accelerator
data partitioning example
It's a trick. You're really finished. The code remains the same on the host,
except for the work block creation. Oh, also, the code needs to specify that it
uses accelerator data partitioning in the task descriptor. An implementation for
the alf_accel_input_dtl_prepare and
alf_accel_output_dtl_prepare functions is required.
For a complete listing of this sample, refer to the ALF samples directory
matrix_add/common/accel_partitioning.
Resources Learn
Get products and technologies
Discuss
About the author  | 
|  | Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks. |
Rate this page
|  |