IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
Fun with ALF, Part 4: Determining the dot product of large vectors
skip to main content

developerWorks  >  Power Architecture technology  >

Fun with ALF, Part 4: Determining the dot product of large vectors

Examples show how to use the ALF bundled work block distribution with the task context to overcome memory limits

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Introductory

Kane Scarlett (kane@us.ibm.com), Editor, Multicore acceleration, IBM 

09 May 2008

In this Cell Broadband Engine™ (Cell/B.E.) series, learn how to use the Accelerated Library Framework (ALF) bundled work block distribution and the task context to manage situations in which the work block cannot hold the partitioned data because of a local memory size limit. The "ALF for Cell/B.E. Programmer's Guide and API Reference, Version 3.0" (see Resources) is the source for the content.

Introduction

In this article, you'll learn how to use the ALF bundled work block distribution and the task context to handle situations where the work block cannot hold the partitioned data because of a local memory size limit.

More fun with ALF

Look for more in the Fun with ALF series:

  • In "Fun with ALF, Part 1: Adding large matrices together" (developerWorks, March 2008), see how to use ALF to add two large matrices together (with an example for host data partitioning and for accelerator data partitioning).
  • In "Fun with ALF, Part 2: Converting I/O" (developerWorks, March 2008), learn how the task context buffer is used as a large lookup table to convert the 16-bit input data to 8-bit output data.
  • In "Fun with ALF, Part 3: Finding minimum and maximum values" (developerWorks, April 2008), discover how you can use the task context to keep the partial computing results for each task instance and then combine these partial results into the final result.
  • In "Overlapped I/O buffer," take a look at using overlapped I/O buffers to do matrix addition.
  • In "Task dependency," check out a simple simulation in which you can use task dependency in a two-stage pipeline application.
And watch for similar Fun with series addressing DaCS, BLAS, and other technologies to make your Cell/B.E. programming easier.

The example calculates the dot product of two lists of large vectors as:


Figure 1. The dot product of two lists of large vectors
The dot product of two lists of large vectors

The dot product requires the element multiplication values of the vectors to be accumulated. In the case where a single work block can hold the all the data for vector Ai and Bi, the calculation is straightforward. However, when the size of the vector is too big to fit into a single work block, the straightforward approach does not work.

For example, with the Cell/B.E. processor, there are only 256KB of local memory on the SPE. It is impossible to store two double-precision vectors when the dimension exceeds 16384. And, if you consider the extra memory needed for double buffering, code storage, and so on, you are able to handle only two vectors of 7500 double-precision float-point elements each (7500 * 8[size of double] * 2[two vectors] * 2[double buffer] is equal to approximately 240KB of local storage). In this case, large vectors must be partitioned to multiple work blocks, and each work block can return only the partial result of a complete dot product.

You can choose to accumulate the partial results of these work blocks on the host to get the final result. But this is not an elegant solution, and the performance is also affected. A better solution is to do these accumulations on the accelerators and to do them in parallel.

ALF provides the following two implementations for this problem:

  • The making use of task context and bundled work block distribution implementation.
  • The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation.

The source code for the two implementations (so you can compare) comes with the samples in the following directories:

  • The making use of task context and bundled work block distribution implementation in the task_context/dot_prod directory.
  • The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation in the task_context/dot_prod_multi directory.

Making use of task context and bundled work block distribution

For this implementation, all the work blocks of a single vector are put into a bundle. All the work blocks in a single bundle are assigned to one task instance in the order of enqueuing. This means it is possible to use the task context to accumulate the intermediate results and write out the final result when the last work block is processed.

The accumulator in task context is initialized to zero each time a new work block bundle starts.

When the last work block in the bundle is processed, the accumulated value in the task context is copied to the output buffer and then written back to the result area.

Figure 2 shows a schematic of this implementation.


Figure 2. Making use of task context and bundled work block distribution
Making use of task context and bundled work block distribution


Back to top


Making use of multi-use work blocks together with task context or work block parameter and context buffers

The second implementation is based on multi-use work blocks and work block parameter and context buffers. A multi-use work block is similar to an iteration operation. The accelerator-side runtime repeatedly processes the work block until it reaches the provided number of iterations. By using accelerator-side data partitioning, it is possible to access different input data during each iteration of the work block.

What this means is that you can use the application to handle data larger than a single work block can handle (due to local storage limitations). Also, the parameter and context buffer of the multi-use work block is retained through the iterations, so you can instead choose to keep the accumulator in this buffer instead of using the task context buffer.

Figure 3 shows a schematic of this implementation.


Figure 3. Making use of multi-use work blocks together with task context or work block parameter and context buffers
Making use of multi-use work blocks together with           task context or work block parameter and context buffers


Back to top


Conclusion

Both implementations—using the task context and using multi-use work blocks—are equally valid.



Resources

Learn

Get products and technologies

Discuss


About the author

Kane Scarlett

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


IBM and Power Architecture are trademarks of IBM Corporation in the United States, other countries, or both. Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc. Other company, product, or service names may be trademarks or service marks of others. Other company, product, or service names may be trademarks or service marks of others.


    About IBMPrivacyContact