 | Level: Intermediate Peter Seebach (developerworks@seebs.net), Freelance writer, Wind River Systems
22 Apr 2008 In an
earlier article in this series, the author introduced a fractal-generation
program built around the IDL interface that showcased the strength of IDL's
straightforward API. Executing the program was almost like calling a function and
getting results. In this article (and using the same basic program), the author
demonstrates the Data Communication and Synchronization library's (DaCS) greater
flexibility and the tradeoff: additional complexity. With DaCS, it's possible to pass the fractal pattern in as an initial argument,
then use buffers to pass data back and forth as they are processed. While this requires
more design work, it might actually be more efficient. This article also shows that DaCS allows
for much more carefully tuned inputs and outputs.
Introduction
In an earlier article in this series, The
little broadband engine that could: Use multiple SPEs for a single task, the author introduced a fractal-generation
program built around the IDL interface. The key strength of IDL for this was the
fairly straightforward API, which was quite close in scope to simply calling a
function and getting results.
DaCS is different. It offers greater flexibility, but at the cost of
additional complexity. However, it offers a noticeable advantage in that it allows
for much more carefully tuned inputs and outputs.
In the IDL version of the program, each call had to provide both the fractal
pattern (a set of coordinates to use for transforms) and the fractal data (a set
of points to transform). There were some quirks as to allowable data set sizes.
With DaCS, it's possible to pass the fractal pattern in as an initial argument,
then use buffers to pass data back and forth as they are processed. This requires
more design work, but it might actually be more efficient. You're about to find out!
Understanding the
program
The program in question is an iterative fractal generator. It is best known for
the famous snowflake fractals. You start with a series of line segments,
then you recursively replace each line segment with the same pattern of segments that you
started with. This consumes exponential space, and it is very easily segmented
out. If you have a thousand segments to process, you can process them simultaneously in
batches, although you can't start on a new iteration until at least part of the
preceding iteration is done. Because the pattern stays the same for each iteration, it doesn't need to be
passed in multiple times.
In the DaCS model, each SPE process, once started,
already has the pattern ready. Each one takes in sets of line segments (points
actually) and emits sets of new points. This, in turn, implies a set of buffers
that the SPEs can read from and a set that they can write to. In fact, it's
practical to simply declare large buffers and to have the SPEs write at predictable
offsets.
Once the SPEs are prepared, mailbox communications should be sufficient to
initiate transfers. All that each SPE needs to know is:
- The address of the input buffer
- The address of the output buffer
- The points of the pattern
- Which segment of each buffer to work on
The first three items from the list can be previously allocated at program startup or
during the initialization code. The only thing that changes during runtime is
which segment to work on, which is easily small enough to be sent in a mailbox. Thus, as
blocks of data are prepared for sending, each can be assigned to the next SPE
(in a round-robin cycle) and then retrieved later when it's finished. The
operations in question are predictable in time taken, so there's no need for a
more sophisticated algorithm.
First, you need to pass the actual data in to the program. The DaCS
documentation describes the argv parameter as an array
of pointers to characters, ending with a null pointer. That's a very
likely sounding description, but in fact what gets passed is simply the address.
And the address of an array of pointers to characters is pretty much useless
to the SPE. One solution is to simply define a structure to hold the data that
needs to be passed in and pass the address of this structure. The structure, for
this program, looks like this:
Listing 1. Defining a fractal
typedef struct { float x, y; } point;
struct frac_args {
point frac[MAX_FRAC];
unsigned int points;
unsigned char padding[16];
};
|
You can use the code from the example in the previous
article to copy this in. Note that the
method used requires the size to be a multiple of 16 bytes, which implies some
unused padding on the end. A simple way to handle this is to add a full 16 bytes
of padding, but to copy only as much as is needed to make the size line up:
Listing 2. Copying in arguments
struct frac_args f;
size_t argsize;
/* fill in f, then... */
argsize = sizeof(struct frac_args);
argsize = argsize - (argsize % 16);
spu_writech(MFC_WrTagMask, -1);
spu_mfcdma32(&f, (unsigned int) argp, argsize, 0, MFC_GET_CMD);
(void)spu_mfcstat(2);
|
Strictly speaking, in the case where the structure ends up being an even
multiple of 16 bytes with padding, this copies an extra 16 bytes of padding.
Because this copy happens only once per SPE, it is not worth trying to fix
this and make the code more complicated.
Once the child processes are started, the essential routine is a loop. Running
on the PPE, the main
program sets up batches of numbers to crunch and then has the
SPEs crunch them. The PPE sends a chunk number to the SPE, which then grabs the
numbers, processes them, sends the results back, and sends the chunk number back
to the PPE for processing.
This is a pretty simplistic algorithm, but there are a number of potential
weaknesses. Most obviously, while there is a great deal of parallelism between the
SPEs, there is no parallelism between DMA operations and calculations on a given
SPE. Each SPE
initiates a DMA, waits for the DMA to complete, computes, initiates DMA again, and
waits for the DMA to complete again. Nonetheless, it's a reasonably simple
algorithm to implement.
Understanding the PPE's algorithm
Emulating the default behavior of the previous design, stick with a system
where the PPE works with x/y pairs and then converts them into arrays of points
for sending to the SPEs. Two buffers are needed for communication: one for
outgoing points and one for incoming points. The buffers are subdivided into two
chunks for each SPE. Note that in the initial version, only one buffer at a time is
really going to get used, but this can be addressed later.
The basic algorithm in pseudo code looks like this:
Listing 3. Sending chunks to the SPEs
chunk = 0;
while (length) {
if (sent[chunk]) {
retrieve(chunk);
}
send(chunk);
chunk = (chunk + 1) % CHUNKS;
}
for (i = 0; i < chunks; ++i) {
if (sent(chunk)) {
retrieve(chunk);
}
chunk = (chunk + 1) % CHUNKS;
}
|
For example, on a 6-SPE system that has 12 chunks, this algorithm sends
out up to 12 chunks before it starts retrieving any. In theory, this should give
the SPEs plenty of time to start writing the earliest chunks back before they are
retrieved.
There is one other subtlety hidden in the pseudo code. When the list is broken
up, the last point in each chunk must be the first point of the next chunk as
well. This is handled by some sneaky code to introduce what looks like an
off-by-one error when sending data, and then the code removes that error when retrieving data.
Retrieval is slightly complicated. The inputs to the SPE algorithms are an array
of N points (the fractal pattern) and a large block of points. In the PPE
algorithm, the N points from the fractal are generated for each point in the large
block. On the SPE however, it is necessary (assuming you care about performance)
to do sets of four points at once. Requiring each fractal pattern to have a
precise multiple of four points would be quite wasteful, so instead, each iteration
works with four points from the current working set. This means that the points
returned by the algorithm are not in the expected order. Unshuffling these turns
out to be the single hardest part of this algorithm to do correctly. The example
shows you how to do it in two parts: first, generate the shuffle pattern that should be used, then use
that to index into arrays when copying data out. The shuffle pattern looks like
this:
Listing 4. Really, shuffling makes sense once you look at it
#define SHUFFLE_SIZE(p) (4 * ((p) - 1))
int
make_shuffle() {
int i;
shuffle_table = malloc(SHUFFLE_SIZE(f.points) * sizeof(int));
for (i = 0; i < SHUFFLE_SIZE(f.points); ++i) {
div_t d = div(i, 4);
shuffle_table[i] = d.rem * (f.points - 1) + d.quot;
}
}
|
Assuming a fractal pattern with N points, each new iteration produces (N-1) new
points from each point in the previous iteration of the fractal. This shuffling
algorithm collates the items back, turning (N-1) groups of four objects into four
groups of (N-1) objects.
Exploring the SPE's algorithm
The computational work is essentially unchanged from the IDL-based version of
this program. The main loop on the SPE, however, is noticeably different. The
essential pattern is simply to accept a chunk number, request the corresponding
points, generate new points, and send them back. Chunk numbers are always
non-negative, so the special value -1 is used to tell an SPE that it is time to
end execution. The loop looks like this:
Listing 5. Doing one thing (loop) at a time
while (dacs_mailbox_read(&msg, DACS_DE_PARENT, DACS_PID_PARENT) ==
DACS_SUCCESS) {
if (msg == -1)
break;
dacs_get(old_points[buffer], old_mem,
CHUNK_XOFFSET(msg) * sizeof(float),
CHUNKSIZE * sizeof(point),
old_wids[buffer],
DACS_ORDER_ATTR_NONE,
DACS_BYTE_SWAP_DISABLE);
dacs_wait(old_wids[buffer]);
fractal(old_points[buffer], new_points[buffer], CHUNKSIZE);
dacs_put(new_mem, CHUNK_NXOFFSET(msg, f.points) * sizeof(float),
new_points[buffer],
CHUNKSIZE * (f.points - 1) * sizeof(point),
new_wids[buffer],
DACS_ORDER_ATTR_NONE,
DACS_BYTE_SWAP_DISABLE);
dacs_wait(new_wids[buffer]);
dacs_mailbox_write(&msg, DACS_DE_PARENT, DACS_PID_PARENT);
buffer = !buffer;
}
|
You might notice a half-implemented double-buffering scheme here, and you'd be
correct. The scheme is half-implemented for now, but tune in
next time to see where that can go. When an incoming message shows up, this code
simply grabs CHUNKSIZE points, iterates over them, and
sends back CHUNKSIZE * N-1 new points. The last set of
N-1 points generated is garbage. The points are generated based on the nonexistent
line between the last point in the set and the nonexistent point after it. That's
okay: the retrieval code on the PPE side knows better.
In the case where there are fewer than CHUNKSIZE
points to process, the SPE code ignores this and just does them all anyway. As
long as the chunk size is significantly smaller than the working set size, the
time one SPE spends rendering at most one extra
CHUNKSIZE-1 point is simply irrelevant.
Fine-tuning for
performance
So, with all of this work done to switch to DaCS, how does performance compare?
It comes out surprisingly close to performance of the IDL-based version developed
in the previous
article. Originally, it appeared it would be quite a bit slower, but two
key fixes can bring performance back in line with expectations.
First, enable compiler optimizations. Second, increase the
CHUNKSIZE preprocessor constant from 8 to 512.
Processing tiny chunks is very convenient for debugging, but performance can
suffer. As an interesting note, in the current incarnation, it appears that work
might be PPE-bound. Running with only one SPE yields the same result as running
with several.
This highlights two possible places to look for productive work. The first is
simplifying the PPE's workload for copying data to and from the SPEs. The second
is enabling double-buffering. A later article looks at both of these.
Resources Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Check out two flavors of
documentation on how great DaCS is:
"DaCS
for Cell/B.E. Programmer's Guide and API Reference"
and
"DaCS for Hybrid-x86 Programmer's Guide and API Reference"
(IBM Semiconductor solutions library, October 2007).
- Read
"Maximizing
the power of the Cell Broadband Engine® processor: 25 tips to optimal application performance"
(developerWorks, June 2006) to learn some
SPE programming tips.
- Ger information from Jonathan Bartlett's series on "Programming high
performance applications on the Cell/B.E. processor" (developerWorks, January 2007
to present) provides an
intro
to Linux® on the PS3,
programming the PS3's SPE,
an
intro to the SPU,
SPU performance programming,
C/C++ SPU programming,
and
managing smart buffer DMA transfers.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- Speaking of Cell/B.E. SDK documentation,
there's a new blog series that abstracts important topic sections of some of the
major SDK documentation to give you a quick-read on the topic (in case you don't
need a fuller explanation) -- they're called
Infobombs,
and some topics already covered include
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features -- the
"Forum watch"
of interesting questions and hot topics from the forum and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge bombs gleaned from
Cell/B.E. documentation).
Get products and technologies
Discuss
About the author  | 
|  | Peter Seebach has been fascinated with fractal generation since he first saw a
program like this running on an Amiga. This is less colorful, but much faster. |
Rate this page
|  |