 | Level: Intermediate Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio
16 Oct 2007 The first article in the series describes how to do a basic port to the Cell Broadband Engine process. This
second article goes further in hammering out the details, including removing limitations
based on DMA-transfer size, partitioning the program across multiple SPEs, and
improving the program's speed even more.
In the first article in the series, you learned the basic methodology for enhancing an
application for the Cell/B.E. processor by offloading
computation-intensive functions to the SPE. You developed a non-intrusive
framework for doing this, which enables you to add SPE-specific code to
multiplatform projects without adversely impacting the other platforms or
the build system. However, the system that resulted from the article was
suboptimal in several respects:
- It only accepted value lists that were multiples of 4.
- It accepted a maximum of 4,096 values.
- It did not vectorize on the SPE.
- It only used one SPE.
This article makes the code that you developed in the first article both faster and more robust.
Using basic macros
Before I begin, I'm going to define some basic macros that you will
need to
use often in the code for alignments, comparisons, dereferencing, and other
operations. Some of these were introduced in the first article.
Listing 1. Additional macros for speport.h
/* Alignment macros */
#define SPE_ALIGNMENT 16
#define SPE_ALIGNMENT_FULL 128
#define SPE_ALIGN __attribute__((aligned(16)))
#define SPE_ALIGN_FULL __attribute__((aligned(128)))
/* Rounding macros */
/* Go _up_ to the next address based on an alignment */
#define ROUND_UP_ALIGN(value, alignment) (((value) + \
((alignment) - 1))&(~((alignment)-1)))
/* Go _down_ to the next address based on an alignment */
#define ROUND_DOWN_ALIGN(value, alignment) ((value)&(~((alignment) - 1)))
/* Divide and round up */
#define DIVIDE_ROUND_UP(value, divisor) (((value) + (divisor) - 1) / (divisor))
/* In case these aren't already defined */
#ifndef MIN
#define MIN(X,Y) (((X) < (Y)) ? (X) : (Y))
#endif
#ifndef MAX
#define MAX(X,Y) (((X) > (Y)) ? (X) : (Y))
#endif
/* Simplify Dereferencing on the SPE (to deref other types, change the typecase at the */
/* beginning) */
#define SPE_DEREF_UINT32(base, offset) *((unsigned int *)(((char *)(base)) + (offset)))
/* Give a 16-byte structure for "status" so that it can be used in an array without */
/* causing alignment errors */
typedef struct {
int status SPE_ALIGN;
} spe_remote_function_status_t;
|
The importance of these functions will make sense as you proceed through the
examples.
Getting rid of the
limits
The first and most obvious improvement that you could do to the code is to get
rid of the limitations. Currently it is limited to value lists whose size
are a multiple of 4 values (4 * 4 byte floats = 16 bytes) due to the
alignment constraints of SPE DMA transfers. The size limitations of SPE DMA
transfers limit the program to a maximum of 4,096 values. Neither of these
is good.
The easiest limitation to fix is the size limitation. You can fix the size limitation
by simply putting the SPE code into a while loop
that processes a piece of the list at a time. This loop looks like this
(should be in spe_std_dev.spuc toward the
beginning of the function spe_calculate_standard_deviation):
Listing 2. Adding a loop to the
program
float spe_calculate_standard_deviation(int num_values, float *values_ea) {
/* Previous Data Definitions Go Here */
/* ..... */
/* Calculate the number of iterations we are going to need */
int num_values_remaining = num_values;
float *current_values_ea = values_ea;
while(num_values_remaining > 0) {
int values_this_cycle = MIN(num_values_remaining, MAX_VALUES);
/* Previous code goes here */
/* Replace values_ea with current_values_ea */
/* Replace num_values with values_this_cycle */
/* ..... */
num_values_remaining -= values_this_cycle;
current_values_ea += values_this_cycle;
}
/* Rest of function from original */
/* Finalize Computation */
avg = sum / (float)num_values;
variance = (sum_squares - (sum * avg)) / (float)num_values;
std_dev = sqrt(variance);
return std_dev;
}
|
By adding that loop, you removed the maximum size limit to the number of
values that you can process. Now you need to remove the condition in main to check for the number of values to process
(the condition begins with the comment /* Check
boundaries */; the else branch
should be preserved).
However, you still have a problem with the number of values: it still must be
a multiple of 4. To get around this, you have to DMA up to 3 values
individually. The reason for this is that the DMA system can either transfer
multiples of 16 bytes or transfer a naturally aligned value. That means that
transferring a single four-byte value can be done as long as it is aligned to
four bytes. The alignment is easy to guarantee. You already replaced the
malloc with one that allocates all values
on a sixteen-byte boundary.
For the transfer size, there are two ways to get the transfer integrated into
the program. The first way is to modify the loop itself so that it slows
down to one value at a time when it hits the unaligned portion of the list.
The other way is to tack on an additional loop to handle the special cases.
Also, it has to be able to handle alignment issues on the front side as well
as on the back side (this will be more important when you partition lists).
To accomplish this, rewrite your loop to slow down on edge cases:
Listing 3. Modifying the loop to
handle non-DMA-aligned values
while(num_values_remaining > 0) {
int values_this_cycle = MIN(num_values_remaining, MAX_VALUES);
/* Check to see if we the leading values are unaligned. If so, go one */
/* value at a time until they are aligned. */
if(((int)current_values_ea % DMA_TRANSFER_UNIT) != 0) {
values_this_cycle = 1;
} else {
/* Check to see if this is a size that is transferable in one block */
if((values_this_cycle % DMA_TRANSFER_UNIT_FLOAT) != 0) {
/* Check to see if we can transfer a larger sub-block in a */
/* single transfer */
int tmp_num_values = values_this_cycle - (values_this_cycle %
DMA_TRANSFER_UNIT_FLOAT);
if(tmp_num_values > 0) {
/* If so, use this value */
values_this_cycle = tmp_num_values;
} else {
/* Otherwise, just go one value at a time */
values_this_cycle = 1;
}
}
}
/* Load values in from main memory pointer */
/* If we are doing one at a time, then we need to check for */
/* alignment issues */
if(values_this_cycle == 1) {
/* Unaligned addresses require similarly unaligned destinations - */
/* calculate the offset */
int natural_offset = (((int)current_values_ea / sizeof(float)) %
DMA_TRANSFER_UNIT_FLOAT);
/* Perform the Transfer */
spu_mfcdma32(&ls_values[natural_offset],
(unsigned int)current_values_ea, values_this_cycle * sizeof(float),
DEFAULT_DMA_TAG, MFC_GET_CMD);
spu_mfcstat(MFC_TAG_UPDATE_ALL);
/* Move it back into position for processing */
ls_values[0] = ls_values[natural_offset];
} else {
/* Regular properly-aligned, properly-sized transfer */
spu_mfcdma32(ls_values, (unsigned int)current_values_ea,
values_this_cycle * sizeof(float), DEFAULT_DMA_TAG, MFC_GET_CMD);
spu_mfcstat(MFC_TAG_UPDATE_ALL);
}
for(i = 0; i < values_this_cycle; i++) {
sum += ls_values[i];
sum_squares += ls_values[i]*ls_values[i];
}
num_values_remaining -= values_this_cycle;
current_values_ea += values_this_cycle;
}
|
This requires a few extra defines to be placed in speport.h:
#define DMA_TRANSFER_UNIT 16
#define DMA_TRANSFER_UNIT_FLOAT (DMA_TRANSFER_UNIT / sizeof(float))
|
As you can see, what this is doing is slowing down and speeding up DMA
transfers, depending on the location and size of the block in memory. While
this still requires natural alignment, even if the values are not properly
DMA aligned, it is no longer a big deal. However, the Cell/B.E. processor
requires that the last four bits match in the source and destination
register, which is why there are some acrobatics for single-byte
transfers. The DMA transfer speeds up when it is properly aligned enough
to do large transfers, and it slows down when it is not. The data still must
be four-byte aligned, but that should be a fairly easy requirement to meet. Also,
the function will be much faster if the data is aligned on a
16-byte-boundary or even more for a 128-byte boundary.
Using multiple SPEs
In the previous section, you removed the operating limits of the function. In
this section, you are going to take advantage of the remaining computing
power of the Cell/B.E. processor by dividing the task across multiple SPEs.
Not all tasks are easily divisible among SPEs, but this one is
especially easy to partition.
There are basically two parts to the program: summation (getting sums
and sums of squares) and finalization (computing the final answer
from the sums). Finalization is trivial, and therefore, it can be just as easily
done on the PPE. Summation, because addition is commutative, can be divided
up any way you like. Often an algorithm, which in its naive form does
not have commutative components, can be rearranged so that its parts are
commutative.
So you are going to remove the finalization from the SPE portion
of the program and make it only add sums and sums of squares. The PPE will
be responsible for spawning the processes, dividing up the work, collecting
the results, and finalizing the solution. This requires changes to your
SPE application (it needs to now return the sums rather than the final
result) as well as changes to your driver program (it needs to spawn
multiple instances and collect and process the answers).
The first thing you need to do is redefine the struct that is passed between the PPE and the SPE in my_math_spe.h.
Listing 4. Struct for passing
partial results
#ifndef MY_MATH_SPE_H
#define MY_MATH_SPE_H
#include "speport.h"
typedef struct {
int num_values SPE_ALIGN;
float *values SPE_ALIGN;
float result_sum SPE_ALIGN;
float result_sum_squares SPE_ALIGN;
} spe_std_dev_params_t;
#endif
|
Now, for the SPE, you need to make a few simple changes to the way the
processing function works. First, you need to redefine (and rename, because it
will not perform the final calculations) the processing function in spe_std_dev.spuc so that it returns two floats
rather than one. The new prototype should be:
void spe_calculate_sums(int num_values, float *values_ea, float *result_sum,
float *result_sum_squares);
|
Similarly the function definition should be:
void spe_calculate_sums(int num_values, float *values_ea, float *result_sum, float
*result_sum_squares) {
|
Because you are returning more than one value, change the return value to a
void and simply pass the addresses to the variables at the end. This needs
to be altered both in the prototype declaration at the beginning of the file
and in the function definition itself. Then, when you call the
function, you need to pass it pointers to the two values in the struct:
/* Perform Task */
spe_calculate_sums(params.num_values, params.values, ¶ms.result_sum,
¶ms.result_sum_squares);
|
Then, at the end of the spe_calculate_sums, you
need to set these return values instead of finalizing the calculation (the PPE will do the finalization):
*result_sum = sum;
*result_sum_squares = sum_squares;
return;
|
The variables avg, variance, and std_dev are no longer
needed, and their declarations can be removed.
Now the hard part: getting the PPE to split the job up appropriately.
Basically what you are going to do is to have an array of SPE contexts and a
matching array of parameter structs:
- On the first run, get all of the contexts initialized.
- On every run, take the problem, split it up into work units, and
then set each of the SPEs off and running with its own set of values
to process.
- Wait for each of them to finish.
- As they finish, add the individual sums of the individual pieces
up until you have all of the values summed together.
At this point, you have the information you need to perform the final
calculation and return the answer.
Here is the new implementation of calculate_standard_deviation on the PPE (in my_math.c):
Listing 5. Adapting the PPE function
to multiple SPEs
/* ... header files ... */
#ifndef USE_SPE
/* Non-SPE Version */
/* ... */
#else
/* SPE-specific includes */
#include "speport.h"
#include "my_math_spe.h"
#define NUM_SPE_THREADS 4
float calculate_standard_deviation(int num_values, float *values) {
/* Keep SPE Thread Contexts Here */
static spe_remote_function_ptr_t std_dev_func[NUM_SPE_THREADS] =
{NULL, NULL, NULL, NULL};
/* Parameters to pass to contexts */
spe_std_dev_params_t params[NUM_SPE_THREADS] SPE_ALIGN;
spe_remote_function_status_t status[NUM_SPE_THREADS] SPE_ALIGN;
/* Working variables */
int thread_idx;
int workunit_size = ROUND_UP_ALIGN(DIVIDE_ROUND_UP(num_values, NUM_SPE_THREADS),
(SPE_ALIGNMENT_FULL/sizeof(float)));
int current_workunit_size;
int remaining_values = num_values;
float *current_value_position = values;
float sum = 0.0, sum_squares = 0.0, avg, variance, std_dev;
/* Start up SPE processes if this is our first run */
if(std_dev_func[0] == NULL) {
for(thread_idx = 0; thread_idx < NUM_SPE_THREADS; thread_idx++) {
std_dev_func[thread_idx] = spe_remote_function_start("./spe_std_dev",
NULL);
if(std_dev_func[thread_idx] == NULL) {
fprintf(stderr, "Error starting thread!\n");
exit(1);
}
}
}
/* Split up the Data and Call the Function */
for(thread_idx = 0; thread_idx < NUM_SPE_THREADS; thread_idx++) {
current_workunit_size = MIN(workunit_size, remaining_values);
/* Make parameters */
params[thread_idx].num_values = current_workunit_size;
params[thread_idx].values = current_value_position;
/* Call the function */
if(spe_remote_call(std_dev_func[thread_idx], ¶ms[thread_idx],
SPE_RUN_NONBLOCK, &status[thread_idx]) < 0) {
fprintf(stderr, "Error running function\n");
exit(1);
}
/* Move counter and pointer for next function call */
remaining_values -= current_workunit_size;
current_value_position += current_workunit_size;
}
/* Gather the results */
for(thread_idx = 0; thread_idx < NUM_SPE_THREADS; thread_idx++) {
/* Wait for thread to complete */
spe_wait_completion(&status[thread_idx], 0);
/* Gather data from parameters */
sum += params[thread_idx].result_sum;
sum_squares += params[thread_idx].result_sum_squares;
}
/* Finalize calculation */
avg = sum / (float)num_values;
variance = (sum_squares - (sum * avg)) / (float)num_values;
std_dev = sqrt(variance);
/* Return result */
return std_dev;
}
#endif
|
As you can see, it is a bit of work, but it makes the function very fast. On
large datasets, this version of the function is 4.25 times the speed of the
PPE-only function. Not only that, if you had another processing job to do
simultaneously on the PPE, you could do that at the same time, because this
requires nothing of the PPE.
You might be wondering about the calculation of workunit size at the beginning of the function. Basically, you are
dividing your work units among the available SPEs, while trying to keep each
work unit as a multiple of 128 so that it operates best with the DMA.
Without ROUND_UP_ALIGN, the present code can go
as slow as half speed when working on a dataset that isn't a multiple of 16
values. It can go a little slower when the dataset isn't a multiple of 128
values.
Performing other optimizations
There are several other optimizations that you can do to speed it up even
more:
- Double-buffer the SPE function.
- Use SPE vector floating-point intrinsics in the SPE portion of the code.
- Unroll the core loop and schedule instruction.
The first way to optimize is to double-buffer the SPE function. Right now, you are wasting a lot of
time waiting around for DMA to finish. If you double-buffered, you could
simultaneously load and process data. For more information about double-buffering, see
Programming high-performance applications on the Cell BE processor, Part 6: Smart Buffer Management with DMA transfers.
A second way to optimize is to use SPE vector floating-point intrinsics in the
SPE portion of the code. Sometimes the compiler is able to detect
parallelizable code and write the appropriate vector operations for you. In
this case, though, it is likely that because the sum and sum_squares variables are being
modified with each iteration, the compiler does not recognize that the
operations are relatively independent of each other and, therefore, serializes
the operations. By explicitly coding the SIMD operations, the code will be
much, much faster. For more about the SPE's vector intrinsics, see Programming high-performance applications on the Cell BE processor, Part 5: Programming the SPU in C/C++.
A third way to optimize is to unroll the core loop and schedule instruction.
By unrolling the loop, you can separate out all of the operations
that depend on instructions that have not yet completed,
and you can organize them
so that other instructions could be executed while waiting for the
dependencies to finish calculating. This procedure is described in the
article Programming high-performance applications on the Cell BE processor, Part 4: Program the SPU for performance.
Conclusion
Now you should be able to apply these procedures to any projects and
parallelize your applications for optimal performance on the Cell B./E.
processor without adversely affecting the build system or the function interface
of the rest of your application. You should now know the basics of how to
program SPE applications that avoid the basic SPE limitations, such as DMA
transfer size, and also how to use the Cell B./E. to its maximum potential by
splitting data sets across multiple SPEs. Each application has
slightly different needs in order to successfully partition the data, but
hopefully this article gave you a good idea of how to do it generally.
Resources Learn
- Use an RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- See all
the articles in the series, including Minimize recoding impact, Part 1: How to make an SPE and existing code work
together" (developerWorks, September 2007) for how to
integrate Cell/B.E. functionality into your existing projects by covering
optimal SPE workloads, a simple SPE RPC library, and delivering some sample
code to port.
- Refer to the Programming high-performance applications on the Cell BE
processor series for great information about
programming and maximizing the performance of SPEs in C/C++.
- Use POSIX Threads
Programming as your definitive source on Linux™ Pthreads.
- Go to "Changes
in libspe: How libspe2 affects Cell Broadband Engine programming"
(developerWorks, July 2007) for information about the libspe2
concepts and for how to do
basic SPE process management and communication with libspe2.
- Check out "POSIX
threads explained" (developerWorks, July 2000) and "POSIX Threads
Programming" for how to use pthreads in your code.
- Explore the Cell/B.E. memory model (developerWorks, April 2006) with experts
Mark Nutter and Max Aguilar.
- Refer to "Introduction to the Cell Multiprocessor" (IBM Journal of
Research and Development, 2005) for an introductory overview
of the Cell/B.E. multiprocessor's history, the program objectives and
challenges, the design concept, the architecture and programming models, and
the implementation.
- Use the Software Development Kit 2.1 Installation Guide Version 2.1 (PDF)
to walk through installation and configuration and many of the basics
you need to know to get started with development. Two companion pieces, "Cell/B.E. SDK 2.1: Setting up Fedora Core 6" and "Cell/B.E. SDK 2.1: Understanding the terminology"
(developerWorks, April 2007), can help get the requisite FC6 up and running
and provide a quick reference to Cell/B.E. terminology.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture when you sign up to receive Cell/B.E. news in your newsletter.
Get products and technologies
Discuss
About the author  | |  | Jonathan Bartlett is the author of the book
Programming from the Ground Up
which is an introduction to programming using Linux assembly language. He is the lead developer at New Medio, developing Web, video, kiosk, and desktop applications for clients. |
Rate this page
|  |