The overlapped input and output buffer (overlapped I/O buffer) is a work block buffer that contains both input and output data. The input and output sections are dynamically designated for each work block. This buffer is especially useful when you want to maximize the use of accelerator memory and when the input buffer can be overwritten by the output data.
The data for the input buffer can come from distinct sections of a large data set in host memory. These distinct data segments are gathered into the input data buffer on the accelerators. The ALF framework minimizes performance overhead by not duplicating input data unnecessarily.
The output data buffer is a single, contiguous buffer in the memory of the
accelerator. Output data can be transferred to distinct memory segments within a
large output buffer in host memory. After the compute kernel returns from
processing one work block, the data in this buffer is moved to the host memory
locations specified by the alf_wb_dtl_entry_add routine
when the work block is constructed.
The following two simple examples show the usage of overlapped I/O buffers. Both examples perform matrix addition.
- Implementation 1 implements C = A + B, in which A,
B, and C are different matrices. There are three separate matrices
on the host for matrices a, b, and c (Figure 1).
Figure 1. Overlapped I/O buffer implementing C = A + B
- Implementation 2 implements A = A + B, in which matrix A
is overwritten by the result. Storage is reserved on the host for matrix
a and matrix b. The result of a + b is stored in matrix
b (Figure 2).
Figure 2. Overlapped I/O buffer implementing A = A + B
The code is similar to the matrix_add example in
Fun with
ALF, Part 1: Adding large matrices together.
Listing 1 shows only the relevant code.
Listing 1. Setting up the matrix
/* ---------------------------------------------- */
/* matrix declaration for the two cases */
/* ---------------------------------------------- */
#ifdef C_A_B // C = A + B
alf_data_int32_t mat_a[ROW_SIZE][COL_SIZE]; // the matrix a
alf_data_int32_t mat_b[ROW_SIZE][COL_SIZE]; // the matrix b
alf_data_int32_t mat_c[ROW_SIZE][COL_SIZE]; // the matrix c
#else // A = A + B
alf_data_int32_t mat_a[ROW_SIZE][COL_SIZE]; // the matrix a
alf_data_int32_t mat_b[ROW_SIZE][COL_SIZE]; // the matrix b
#endif
|
The Listing 2 code segment shows the work block creation process for the two cases.
Listing 2. Creating the work block
for (i = 0; i < ROW_SIZE; i+=PART_SIZE){
if(i+PART_SIZE <= ROW_SIZE)
wb_parm.num_data = PART_SIZE;
else
wb_parm.num_data = ROW_SIZE - i;
alf_wb_create(task_handle, ALF_WB_SINGLE, 0, &wb_handle);
#ifdef C_A_B // C = A + B
// the input data A and B
alf_wb_dtl_begin(wb_handle, ALF_BUF_OVL_IN, 0); // offset at 0
alf_wb_dtl_entry_add(wb_handle, &mat_a[i][0], wb_parm.num_data*COL_SIZE,
ALF_DATA_INT32); // A
alf_wb_dtl_entry_add(wb_handle, &mat_b[i][0], wb_parm.num_data*COL_SIZE,
ALF_DATA_INT32); // B
alf_wb_dtl_end(wb_handle);
// the output data C is overlapped with input data A
// offset at 0, this is overlapped with A
alf_wb_dtl_begin(wb_handle, ALF_BUF_OVL_OUT, 0);
alf_wb_dtl_entry_add(wb_handle, &mat_c[i][0], wb_parm.num_data*COL_SIZE,
ALF_DATA_INT32); // C
alf_wb_dtl_end(wb_handle);
#else // A = A + B
// the input and output data A
alf_wb_dtl_begin(wb_handle, ALF_BUF_OVL_INOUT, 0); // offset 0
alf_wb_dtl_entry_add(wb_handle, &mat_a[i][0], wb_parm.num_data*COL_SIZE,
ALF_DATA_INT32); // A
alf_wb_dtl_end(wb_handle);
// the input data B is placed after A
alf_wb_dtl_begin(wb_handle, ALF_BUF_OVL_IN,
wb_parm.num_data*COL_SIZE*sizeof(alf_data_int32_t));
alf_wb_dtl_entry_add(wb_handle, &mat_b[i][0], wb_parm.num_data*COL_SIZE,
ALF_DATA_INT32); // B
alf_wb_dtl_end(wb_handle);
#endif
alf_wb_parm_add(wb_handle, (void *)&wb_parm, sizeof(wb_parm)/sizeof(unsigned int),
ALF_DATA_INT32, 0);
alf_wb_enqueue(wb_handle);
}
|
Setting up the accelerator code
Listing 3 shows the accelerator code. In both cases, the output
sc can be set to the same location in accelerator
memory as sa and sb.
Listing 3. Sample code listing at maximum width
/* ---------------------------------------------- */
/* the accelerator side code */
/* ---------------------------------------------- */
/* the computation kernel function */
int comp_kernel(void *p_task_context, void *p_parm_ctx_buffer,
void *p_input_buffer, void *p_output_buffer,
void *p_inout_buffer, unsigned int current_count,
unsigned int total_count)
{
unsigned int i, cnt;
int *sa, *sb, *sc;
my_wb_parms_t *p_parm = (my_wb_parms_t *) p_parm_context;
cnt = p_parm->num_data * COL_SIZE;
sa = (int *) p_inout_buffer;
sb = sa + cnt;
sc = sa;
for (i = 0; i < cnt; i ++)
sc[i] = sa[i] + sb[i];
return 0;
}
|
This article described two implementations you can use to do matrix addition with overlapped I/O buffers.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Refer to Accelerated Library Framework for Cell Broadband Engine Programmerâs Guide and API Reference for the source material from which this article was extracted.
- Check out other articles in this Fun
with ALF series.
- Take a look at these other ALF-related
quick-read guides:
- "Introducing ALF."
- "10 major ALF concepts."
- "Programming with ALF: Basic ALF application structure."
- "Programming with ALF: Double buffering."
- "Programming with ALF: Handling ALF constraints."
- "Programming with ALF: Optimizing ALF applications."
- "ALF and hybrid x86."
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture® when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the exhaustive
library of Cell/B.E. documentation.
Discuss
- Participate in the discussion forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read
technology introductions.

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.



