For this article, two large matrices are added together using ALF. The problem can be expressed simply as:
A[m,n] + B[m,n] = C[m,n]
where m and n are the
dimensions of the matrices.
This simple example, crafted for both host and accelerator data partitioning, demonstrates how to:
- Start the ALF runtime environment.
- Use task descriptor.
- Start a task on the accelerators.
- Create and add a work block to a task.
- Exit the ALF runtime environment correctly.
You can use this sample as a template to build more complicated applications.
Exploring the host data partitioning example
In this example, the host application:
- Initializes the ALF runtime environment.
- Creates a task descriptor.
- Creates a task based on that task descriptor.
- Creates work blocks with the appropriate data transfer lists that start invocations of the computational kernel on the accelerator.
- Waits for the computational kernel to finish and exits.
The accelerator application includes a simple computational kernel that computes the addition of the two matrices. The scalar code to add two matrices for a uni-processor computer is:
float mat_a[NUM_ROW][NUM_COL];
float mat_b[NUM_ROW][NUM_COL];
float mat_c[NUM_ROW][NUM_COL];
int main(void)
{
int i,j;
for (i=0; i<NUM_ROW; i++)
for (j=0; j<NUM_COL; j++)
mat_c[i][j] = mat_a[i][j] + mat_b[i][j];
return 0;
}
|
Exploring the host data partitioning source code
The following code listings show only the relevant sections of the code. For a
complete listing, refer to the ALF samples directory
matrix_add/STEP1a_partition_scheme_A/common/host_partition.
An ALF host program can be logically divided into these sections:
- Initialization
- Task setup
- Work block setup
- Task wait and exit
The following code segment shows how ALF is initialized and how accelerators are allocated for a specific ALF runtime.
alf_handle_t alf_handle; unsigned int nodes; /* initializes the runtime environment for ALF*/ alf_init(&config_parms, &alf_handle;); /* get the number of SPE accelerators available for from the Opteron */ rc = alf_query_system_info(alf_handle, ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes;); /* set the total number of accelerator instances (in this case, SPE) */ /* the ALF runtime will have during its lifetime */ rc = alf_num_instances_set (alf_handle, nodes); |
This section of an ALF host program contains information about the description
of a task and the creation of the task runtime. The
alf_task_desc_create function creates a task
descriptor. This descriptor can be used multiple times to create different
executable tasks. The function alf_task_create creates
a task to run an SPE program with the name spe_add_program.
/* variable declarations */ alf_task_desc_handle_t task_desc_handle; alf_task_handle_t task_handle; const char* spe_image_name; const char* library_path_name; const char* comp_kernel_name; /* describing a task that's executable on the SPE*/ alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &task_desc_handle;); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_TSK_CTX_SIZE, 0); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(add_parms_t)); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_IN_BUF_SIZE, H * V * 2 sizeof(float)); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_WB_OUT_BUF_SIZE, H * V * sizeof(float)); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_NUM_DTL_ENTRIES, 8); alf_task_desc_set_int32(task_desc_handle, ALF_TASK_DESC_MAX_STACK_SIZE, 4096); /* providing the SPE executable name */ alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L,(unsigned long long) spe_image_name); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L,(unsigned long) library_path_name); alf_task_desc_set_int64(task_desc_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L,(unsigned long) comp_kernel_name); |
This section shows how work blocks are created. After the program has created
the work block, it describes the input and output associated with each work block.
Each work block contains the input description for blocks in the input matrices of
size H * V starting at location
matrix[row][0] with H and
V representing the horizontal and vertical dimensions
of the block.
In this example, assume that the accelerator memory can contain the two input
buffers of size H * V elements, and assume that the
output buffer is of size H * V. The program calls
alf_wb_enqueue() to add the work block to the queue to
be processed. ALF employs an immediate runtime mode. As soon as the first work
block is added to the queue, the task starts processing the work block. The
function alf_task_finalize closes the work block queue.
alf_wb_handle_t wb_handle;
add_parms_t parm __attribute__((aligned(128)));
parm.h = H; /* horizontal size of the block */
parm.v = V; /* vertical size of the block */
/* creating work blocks and adding parameter & io buffer */
for (i = 0; i < NUM_ROW; i += H) {
alf_wb_create(task_handle, ALF_WB_SINGLE, 0,&wb_handle);
/* begins a new Data Transfer List for INPUT */
alf_wb_dtl_set_begin(wb_handle, ALF_BUF_IN, 0);
/* Add H*V element of mat_a as Input */
alf_wb_dtl_set_entry_add(wb_handle, &matrix_a[i][0], H * V, ALF_DATA_FLOAT);
/* Add H*V element of mat_b as Input */
alf_wb_dtl_set_entry_add(wb_handle, &matrix_b[i][0], H * V, ALF_DATA_FLOAT);
alf_wb_dtl_set_end(wb_handle);
/* begins a new Data Transfer List OUTPUT */
alf_wb_dtl_set_begin(wb_handle, ALF_BUF_OUT, 0);
/* Add H*V element of mat_c as Output */
alf_wb_dtl_set_entry_add(wb_handle, &matrix_c[i][0], H * V, ALF_DATA_FLOAT);
alf_wb_dtl_set_end(wb_handle);
/* pass parameters H and V to spu */
alf_wb_parm_add(wb_handle, (void *) (&parm), sizeof(parm), ALF_DATA_BYTE, 0);
/* enqueuing work block */
alf_wb_enqueue(wb_handle);
}
alf_task_finalize(task_handle);
|
After all the work blocks are on the processing queue, the program waits for the
accelerator to finish processing the work blocks. Then
alf_exit() is called to cleanly exit the ALF runtime
environment.
/* waiting for all work blocks to be done*/ alf_task_wait(task_handle, -1); /* exit ALF runtime */ alf_exit(alf_handle, ALF_EXIT_WAIT, -1); |
Preparing the accelerator side
On the accelerator side, you need to provide the actual computational kernel
that computes the addition of the two blocks of matrices. The ALF runtime on the
accelerator is responsible for getting the input buffer to the accelerator memory
before it runs the user-provided alf_accel_comp_kernel
function. After alf_accel_comp_kernel returns, the ALF
runtime is responsible for getting the output data back to host memory space.
Double-buffering or triple-buffering is employed as appropriate to ensure that the
latency for the input buffer to get into accelerator memory and the output buffer
to get to host memory space is well covered with computation.
int alf_accel_comp_kernel(void *p_task_context,
void *p_parm_context,
void *p_input_buffer,
void *p_output_buffer,
void *p_inout_buffer,
unsigned int current_count,
unsigned int total_count)
{
unsigned int i, cnt;
vector float *sa, *sb, *sc;
add_parms_t *p_parm = (add_parms_t *)p_parm_context;
cnt = p_parm->h * p_parm->v / 4;
sa = (vector float *) p_input_buffer;
sb = sa + cnt;
sc = (vector float *) p_output_buffer;
for (i = 0; i < cnt; i += 4) {
sc[i] = spu_add(sa[i], sb[i]);
sc[i + 1] = spu_add(sa[i + 1], sb[i + 1]);
sc[i + 2] = spu_add(sa[i + 2], sb[i + 2]);
sc[i + 3] = spu_add(sa[i + 3], sb[i + 3]);
}
return 0;
}
|
Exploring the accelerator data partitioning example
It's a trick. You're really finished. The code remains the same on the host,
except for the work block creation. Oh, also, the code needs to specify that it
uses accelerator data partitioning in the task descriptor. An implementation for
the alf_accel_input_dtl_prepare and
alf_accel_output_dtl_prepare functions is required.
For a complete listing of this sample, refer to the ALF samples directory
matrix_add/common/accel_partitioning.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Refer to Accelerated Library Framework for Cell Broadband Engine Programmerâs Guide and API Reference for the source material from which this article was extracted.
- Check out other articles in this Fun
with ALF series.
- Take a look at these other ALF-related
quick-read guides:
- "Introducing ALF."
- "10 major ALF concepts."
- "Programming with ALF: Basic ALF application structure."
- "Programming with ALF: Double buffering."
- "Programming with ALF: Handling ALF constraints."
- "Programming with ALF: Optimizing ALF applications."
- "ALF and hybrid x86."
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture® when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the exhaustive
library of Cell/B.E. documentation.
Discuss
- Participate in the discussion forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read
technology introductions.

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.
Comments (Undergoing maintenance)




