Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Fun with ALF, Part 4: Determining the dot product of large vectors

Examples show how to use the ALF bundled work block distribution with the task context to overcome memory limits

Kane Scarlett, Editor, Multicore acceleration, IBM
Kane Scarlett
Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Summary:  In this Cell Broadband Engine™ (Cell/B.E.) series, learn how to use the Accelerated Library Framework (ALF) bundled work block distribution and the task context to manage situations in which the work block cannot hold the partitioned data because of a local memory size limit. The "ALF for Cell/B.E. Programmer's Guide and API Reference, Version 3.0" (see Resources) is the source for the content.

View more content in this series

Date:  09 May 2008
Level:  Introductory
Also available in:   Japanese

Activity:  23558 views
Comments:  

Introduction

In this article, you'll learn how to use the ALF bundled work block distribution and the task context to handle situations where the work block cannot hold the partitioned data because of a local memory size limit.

The example calculates the dot product of two lists of large vectors as:


Figure 1. The dot product of two lists of large vectors

The dot product of two lists of large vectors

The dot product requires the element multiplication values of the vectors to be accumulated. In the case where a single work block can hold the all the data for vector Ai and Bi, the calculation is straightforward. However, when the size of the vector is too big to fit into a single work block, the straightforward approach does not work.

More fun with ALF

Look for more in the Fun with ALF series:

  • In "Fun with ALF, Part 1: Adding large matrices together" (developerWorks, March 2008), see how to use ALF to add two large matrices together (with an example for host data partitioning and for accelerator data partitioning).
  • In "Fun with ALF, Part 2: Converting I/O" (developerWorks, March 2008), learn how the task context buffer is used as a large lookup table to convert the 16-bit input data to 8-bit output data.
  • In "Fun with ALF, Part 3: Finding minimum and maximum values" (developerWorks, April 2008), discover how you can use the task context to keep the partial computing results for each task instance and then combine these partial results into the final result.
  • In "Overlapped I/O buffer," take a look at using overlapped I/O buffers to do matrix addition.
  • In "Task dependency," check out a simple simulation in which you can use task dependency in a two-stage pipeline application.
And watch for similar Fun with series addressing DaCS, BLAS, and other technologies to make your Cell/B.E. programming easier.

For example, with the Cell/B.E. processor, there are only 256KB of local memory on the SPE. It is impossible to store two double-precision vectors when the dimension exceeds 16384. And, if you consider the extra memory needed for double buffering, code storage, and so on, you are able to handle only two vectors of 7500 double-precision float-point elements each (7500 * 8[size of double] * 2[two vectors] * 2[double buffer] is equal to approximately 240KB of local storage). In this case, large vectors must be partitioned to multiple work blocks, and each work block can return only the partial result of a complete dot product.

You can choose to accumulate the partial results of these work blocks on the host to get the final result. But this is not an elegant solution, and the performance is also affected. A better solution is to do these accumulations on the accelerators and to do them in parallel.

ALF provides the following two implementations for this problem:

  • The making use of task context and bundled work block distribution implementation.
  • The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation.

The source code for the two implementations (so you can compare) comes with the samples in the following directories:

  • The making use of task context and bundled work block distribution implementation in the task_context/dot_prod directory.
  • The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation in the task_context/dot_prod_multi directory.

Making use of task context and bundled work block distribution

For this implementation, all the work blocks of a single vector are put into a bundle. All the work blocks in a single bundle are assigned to one task instance in the order of enqueuing. This means it is possible to use the task context to accumulate the intermediate results and write out the final result when the last work block is processed.

The accumulator in task context is initialized to zero each time a new work block bundle starts.

When the last work block in the bundle is processed, the accumulated value in the task context is copied to the output buffer and then written back to the result area.

Figure 2 shows a schematic of this implementation.


Figure 2. Making use of task context and bundled work block distribution
Making use of task context and bundled work block distribution

Making use of multi-use work blocks together with task context or work block parameter and context buffers

The second implementation is based on multi-use work blocks and work block parameter and context buffers. A multi-use work block is similar to an iteration operation. The accelerator-side runtime repeatedly processes the work block until it reaches the provided number of iterations. By using accelerator-side data partitioning, it is possible to access different input data during each iteration of the work block.

What this means is that you can use the application to handle data larger than a single work block can handle (due to local storage limitations). Also, the parameter and context buffer of the multi-use work block is retained through the iterations, so you can instead choose to keep the accumulator in this buffer instead of using the task context buffer.

Figure 3 shows a schematic of this implementation.


Figure 3. Making use of multi-use work blocks together with task context or work block parameter and context buffers
Making use of multi-use work blocks together with           task context or work block parameter and context buffers

Conclusion

Both implementations—using the task context and using multi-use work blocks—are equally valid.


Resources

Learn

Get products and technologies

Discuss

About the author

Kane Scarlett

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=304837
ArticleTitle=Fun with ALF, Part 4: Determining the dot product of large vectors
publish-date=05092008
author1-email=kane@us.ibm.com
author1-email-cc=