In this article, you'll learn how to use the ALF bundled work block distribution and the task context to handle situations where the work block cannot hold the partitioned data because of a local memory size limit.
The example calculates the dot product of two lists of large vectors as:
Figure 1. The dot product of two lists of large vectors
The dot product requires the element multiplication values of the vectors to be accumulated. In the case where a single work block can hold the all the data for vector Ai and Bi, the calculation is straightforward. However, when the size of the vector is too big to fit into a single work block, the straightforward approach does not work.
For example, with the Cell/B.E. processor, there are only 256KB of local memory on the SPE. It is impossible to store two double-precision vectors when the dimension exceeds 16384. And, if you consider the extra memory needed for double buffering, code storage, and so on, you are able to handle only two vectors of 7500 double-precision float-point elements each (7500 * 8[size of double] * 2[two vectors] * 2[double buffer] is equal to approximately 240KB of local storage). In this case, large vectors must be partitioned to multiple work blocks, and each work block can return only the partial result of a complete dot product.
You can choose to accumulate the partial results of these work blocks on the host to get the final result. But this is not an elegant solution, and the performance is also affected. A better solution is to do these accumulations on the accelerators and to do them in parallel.
ALF provides the following two implementations for this problem:
- The making use of task context and bundled work block distribution implementation.
- The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation.
The source code for the two implementations (so you can compare) comes with the samples in the following directories:
- The making use of task context and bundled work block distribution implementation in the task_context/dot_prod directory.
- The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation in the task_context/dot_prod_multi directory.
For this implementation, all the work blocks of a single vector are put into a bundle. All the work blocks in a single bundle are assigned to one task instance in the order of enqueuing. This means it is possible to use the task context to accumulate the intermediate results and write out the final result when the last work block is processed.
The accumulator in task context is initialized to zero each time a new work block bundle starts.
When the last work block in the bundle is processed, the accumulated value in the task context is copied to the output buffer and then written back to the result area.
Figure 2 shows a schematic of this implementation.
Figure 2. Making use of task context and bundled work block distribution
The second implementation is based on multi-use work blocks and work block parameter and context buffers. A multi-use work block is similar to an iteration operation. The accelerator-side runtime repeatedly processes the work block until it reaches the provided number of iterations. By using accelerator-side data partitioning, it is possible to access different input data during each iteration of the work block.
What this means is that you can use the application to handle data larger than a single work block can handle (due to local storage limitations). Also, the parameter and context buffer of the multi-use work block is retained through the iterations, so you can instead choose to keep the accumulator in this buffer instead of using the task context buffer.
Figure 3 shows a schematic of this implementation.
Figure 3. Making use of multi-use work blocks together with task context or work block parameter and context buffers
Both implementations—using the task context and using multi-use work blocks—are equally valid.
- Use an
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Refer to Accelerated Library Framework for Cell Broadband Engine Programmer’s Guide and API Reference for the source material from which this article was extracted.
- Check out other articles in this Fun
with ALF series.
- Take a look at these other ALF-related
- "Introducing ALF."
- "10 major ALF concepts."
- "Programming with ALF: Basic ALF application structure."
- "Programming with ALF: Double buffering."
- "Programming with ALF: Handling ALF constraints."
- "Programming with ALF: Optimizing ALF applications."
- "ALF and hybrid x86."
- To learn more on Cell/B.E. programming, try the
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the exhaustive
library of Cell/B.E. documentation.
- Participate in the discussion forum.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read
Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.