 | Level: Introductory Kane Scarlett (kane@us.ibm.com), Editor, Multicore acceleration, IBMã
09 May 2008 In this Cell Broadband Engine™ (Cell/B.E.) series, learn how to use the
Accelerated Library Framework (ALF) bundled work block distribution and
the task context to manage situations in which the work block cannot hold the
partitioned data because of a local memory size limit. The "ALF for Cell/B.E.
Programmer's Guide and API Reference, Version 3.0" (see Resources) is the source for the content.
Introduction
In this article, you'll learn how to use the ALF bundled work block
distribution and the task context to handle situations where the work
block cannot hold the partitioned data because of a local memory size limit.
 |
More fun with ALF
Look for more in the Fun
with ALF series:
- In "Fun with
ALF, Part 1: Adding large matrices together"
(developerWorks, March 2008), see how to use ALF to add two large matrices
together (with an example for host data partitioning and for accelerator data
partitioning).
- In "Fun with
ALF, Part 2: Converting I/O" (developerWorks, March 2008), learn how the task context buffer is used as a large
lookup table to convert the 16-bit input data to 8-bit output data.
- In "Fun with
ALF, Part 3: Finding minimum and maximum values" (developerWorks, April 2008), discover how you can use the task context to keep the
partial computing results for each task instance and then combine these
partial results into the final result.
- In "Overlapped I/O buffer," take a look at using overlapped I/O buffers to
do matrix addition.
- In "Task dependency," check out a simple simulation in which you can use task
dependency in a two-stage pipeline application.
And watch for similar Fun with series addressing DaCS, BLAS, and other
technologies to make your Cell/B.E. programming easier.
|
|
The example calculates the dot product of two lists of large vectors as:
Figure 1. The dot product of two
lists of large vectors
The dot product requires the element multiplication values of the vectors to be
accumulated. In the case where a single work block can hold the all the data for
vector Ai and Bi, the calculation is straightforward. However, when the size of the vector is too big to fit into a single work block,
the straightforward approach does not work.
For example, with the Cell/B.E.
processor, there are only 256KB of local memory on the SPE. It is impossible to
store two double-precision vectors when the dimension exceeds 16384. And,
if you consider the extra memory needed for double buffering, code storage, and so
on, you are able to handle only two vectors of 7500 double-precision
float-point elements each (7500 * 8[size of double] * 2[two vectors] * 2[double
buffer] is equal to approximately 240KB of local storage). In this case,
large vectors must be partitioned to multiple work blocks, and each work block can
return only the partial result of a complete dot product.
You can choose to accumulate the partial results of these work blocks on the
host to get the final result. But this is not an elegant solution, and the
performance is also affected. A better solution is to do these accumulations on
the accelerators and to do them in parallel.
ALF provides the following two implementations for this problem:
- The making use of task context and bundled work block distribution
implementation.
- The making use of multi-use work blocks together with task context or work
block parameter/context buffers implementation.
The source code for the two implementations (so you can compare) comes with the
samples in the following directories:
- The making use of task context and bundled work block distribution implementation in the task_context/dot_prod directory.
- The making use of multi-use work blocks together with task context or work
block parameter/context buffers implementation in the task_context/dot_prod_multi directory.
Making
use of task context and bundled work block distribution
For this implementation, all the work blocks of a single vector are put into a
bundle. All the work blocks in a single bundle are assigned to one task instance
in the order of enqueuing. This means it is possible to use the task context to
accumulate the intermediate results and write out the final result when the last
work block is processed.
The accumulator in task context is initialized to zero each time a new work
block bundle starts.
When the last work block in the bundle is processed, the accumulated value in
the task context is copied to the output buffer and then written back to the
result area.
Figure 2 shows a schematic of this implementation.
Figure 2. Making use of task
context and bundled work block distribution
Making
use of multi-use work blocks together with task context or work block
parameter and context buffers
The second implementation is based on multi-use work blocks and work block
parameter and context buffers. A multi-use work block is similar to an iteration
operation. The accelerator-side runtime repeatedly processes the work block until
it reaches the provided number of iterations. By using accelerator-side data
partitioning, it is possible to access different input data during each iteration
of the work block.
What this means is that you can use the application to handle data larger than a
single work block can handle (due to local storage limitations). Also, the
parameter and context buffer of the multi-use work block is retained through the
iterations, so you can instead choose to keep the accumulator in this buffer instead
of using the task context buffer.
Figure 3 shows a schematic of this implementation.
Figure 3. Making use of multi-use
work blocks together with task context or work block parameter and context buffers
Conclusion
Both implementationsâusing the task context and using multi-use
work blocksâare equally valid.
Resources Learn
Get products and technologies
Discuss
About the author  | 
|  | Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks. |
Rate this page
|  |