Skip to main content

Fun with ALF, Part 4: Determining the dot product of large vectors

Examples show how to use the ALF bundled work block distribution with the task context to overcome memory limits

Kane Scarlett, Editor, Multicore acceleration, IBM
Kane Scarlett
Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Summary:  In this Cell Broadband Engine™ (Cell/B.E.) series, learn how to use the Accelerated Library Framework (ALF) bundled work block distribution and the task context to manage situations in which the work block cannot hold the partitioned data because of a local memory size limit. The "ALF for Cell/B.E. Programmer's Guide and API Reference, Version 3.0" (see Resources) is the source for the content.

View more content in this series

Date:  09 May 2008
Level:  Introductory
Activity:  2509 views

Introduction

In this article, you'll learn how to use the ALF bundled work block distribution and the task context to handle situations where the work block cannot hold the partitioned data because of a local memory size limit.

The example calculates the dot product of two lists of large vectors as:


Figure 1. The dot product of two lists of large vectors

The dot product of two lists of large vectors

The dot product requires the element multiplication values of the vectors to be accumulated. In the case where a single work block can hold the all the data for vector Ai and Bi, the calculation is straightforward. However, when the size of the vector is too big to fit into a single work block, the straightforward approach does not work.

More fun with ALF

Look for more in the Fun with ALF series:

  • In "Fun with ALF, Part 1: Adding large matrices together" (developerWorks, March 2008), see how to use ALF to add two large matrices together (with an example for host data partitioning and for accelerator data partitioning).
  • In "Fun with ALF, Part 2: Converting I/O" (developerWorks, March 2008), learn how the task context buffer is used as a large lookup table to convert the 16-bit input data to 8-bit output data.
  • In "Fun with ALF, Part 3: Finding minimum and maximum values" (developerWorks, April 2008), discover how you can use the task context to keep the partial computing results for each task instance and then combine these partial results into the final result.
  • In "Overlapped I/O buffer," take a look at using overlapped I/O buffers to do matrix addition.
  • In "Task dependency," check out a simple simulation in which you can use task dependency in a two-stage pipeline application.
And watch for similar Fun with series addressing DaCS, BLAS, and other technologies to make your Cell/B.E. programming easier.

For example, with the Cell/B.E. processor, there are only 256KB of local memory on the SPE. It is impossible to store two double-precision vectors when the dimension exceeds 16384. And, if you consider the extra memory needed for double buffering, code storage, and so on, you are able to handle only two vectors of 7500 double-precision float-point elements each (7500 * 8[size of double] * 2[two vectors] * 2[double buffer] is equal to approximately 240KB of local storage). In this case, large vectors must be partitioned to multiple work blocks, and each work block can return only the partial result of a complete dot product.

You can choose to accumulate the partial results of these work blocks on the host to get the final result. But this is not an elegant solution, and the performance is also affected. A better solution is to do these accumulations on the accelerators and to do them in parallel.

ALF provides the following two implementations for this problem:

  • The making use of task context and bundled work block distribution implementation.
  • The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation.

The source code for the two implementations (so you can compare) comes with the samples in the following directories:

  • The making use of task context and bundled work block distribution implementation in the task_context/dot_prod directory.
  • The making use of multi-use work blocks together with task context or work block parameter/context buffers implementation in the task_context/dot_prod_multi directory.

Making use of task context and bundled work block distribution

For this implementation, all the work blocks of a single vector are put into a bundle. All the work blocks in a single bundle are assigned to one task instance in the order of enqueuing. This means it is possible to use the task context to accumulate the intermediate results and write out the final result when the last work block is processed.

The accumulator in task context is initialized to zero each time a new work block bundle starts.

When the last work block in the bundle is processed, the accumulated value in the task context is copied to the output buffer and then written back to the result area.

Figure 2 shows a schematic of this implementation.


Figure 2. Making use of task context and bundled work block distribution
Making use of task context and bundled work block distribution

Making use of multi-use work blocks together with task context or work block parameter and context buffers

The second implementation is based on multi-use work blocks and work block parameter and context buffers. A multi-use work block is similar to an iteration operation. The accelerator-side runtime repeatedly processes the work block until it reaches the provided number of iterations. By using accelerator-side data partitioning, it is possible to access different input data during each iteration of the work block.

What this means is that you can use the application to handle data larger than a single work block can handle (due to local storage limitations). Also, the parameter and context buffer of the multi-use work block is retained through the iterations, so you can instead choose to keep the accumulator in this buffer instead of using the task context buffer.

Figure 3 shows a schematic of this implementation.


Figure 3. Making use of multi-use work blocks together with task context or work block parameter and context buffers
Making use of multi-use work blocks together with           task context or work block parameter and context buffers

Conclusion

Both implementations—using the task context and using multi-use work blocks—are equally valid.


Resources

Learn

Get products and technologies

Discuss

About the author

Kane Scarlett

Kane Scarlett is a technology journalist/analyst with 20 years in the business, working for such publishers as National Geographic, Population Reference Bureau, Miller Freeman, and IDG, and managing, editing, and writing for such august journals as JavaWorld, LinuxWorld, and of course, developerWorks.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=304837
ArticleTitle=Fun with ALF, Part 4: Determining the dot product of large vectors
publish-date=05092008
author1-email=kane@us.ibm.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers