In an earlier article in this series, The little broadband engine that could: Use multiple SPEs for a single task, the author introduced a fractal-generation program built around the IDL interface. The key strength of IDL for this was the fairly straightforward API, which was quite close in scope to simply calling a function and getting results.
DaCS is different. It offers greater flexibility, but at the cost of additional complexity. However, it offers a noticeable advantage in that it allows for much more carefully tuned inputs and outputs.
In the IDL version of the program, each call had to provide both the fractal pattern (a set of coordinates to use for transforms) and the fractal data (a set of points to transform). There were some quirks as to allowable data set sizes. With DaCS, it's possible to pass the fractal pattern in as an initial argument, then use buffers to pass data back and forth as they are processed. This requires more design work, but it might actually be more efficient. You're about to find out!
The program in question is an iterative fractal generator. It is best known for the famous snowflake fractals. You start with a series of line segments, then you recursively replace each line segment with the same pattern of segments that you started with. This consumes exponential space, and it is very easily segmented out. If you have a thousand segments to process, you can process them simultaneously in batches, although you can't start on a new iteration until at least part of the preceding iteration is done. Because the pattern stays the same for each iteration, it doesn't need to be passed in multiple times.
In the DaCS model, each SPE process, once started, already has the pattern ready. Each one takes in sets of line segments (points actually) and emits sets of new points. This, in turn, implies a set of buffers that the SPEs can read from and a set that they can write to. In fact, it's practical to simply declare large buffers and to have the SPEs write at predictable offsets.
Once the SPEs are prepared, mailbox communications should be sufficient to initiate transfers. All that each SPE needs to know is:
- The address of the input buffer
- The address of the output buffer
- The points of the pattern
- Which segment of each buffer to work on
The first three items from the list can be previously allocated at program startup or during the initialization code. The only thing that changes during runtime is which segment to work on, which is easily small enough to be sent in a mailbox. Thus, as blocks of data are prepared for sending, each can be assigned to the next SPE (in a round-robin cycle) and then retrieved later when it's finished. The operations in question are predictable in time taken, so there's no need for a more sophisticated algorithm.
First, you need to pass the actual data in to the program. The DaCS
documentation describes the argv parameter as an array
of pointers to characters, ending with a null pointer. That's a very
likely sounding description, but in fact what gets passed is simply the address.
And the address of an array of pointers to characters is pretty much useless
to the SPE. One solution is to simply define a structure to hold the data that
needs to be passed in and pass the address of this structure. The structure, for
this program, looks like this:
Listing 1. Defining a fractal
typedef struct { float x, y; } point;
struct frac_args {
point frac[MAX_FRAC];
unsigned int points;
unsigned char padding[16];
};
|
You can use the code from the example in the previous article to copy this in. Note that the method used requires the size to be a multiple of 16 bytes, which implies some unused padding on the end. A simple way to handle this is to add a full 16 bytes of padding, but to copy only as much as is needed to make the size line up:
Listing 2. Copying in arguments
struct frac_args f;
size_t argsize;
/* fill in f, then... */
argsize = sizeof(struct frac_args);
argsize = argsize - (argsize % 16);
spu_writech(MFC_WrTagMask, -1);
spu_mfcdma32(&f, (unsigned int) argp, argsize, 0, MFC_GET_CMD);
(void)spu_mfcstat(2);
|
Strictly speaking, in the case where the structure ends up being an even multiple of 16 bytes with padding, this copies an extra 16 bytes of padding. Because this copy happens only once per SPE, it is not worth trying to fix this and make the code more complicated.
Once the child processes are started, the essential routine is a loop. Running on the PPE, the main program sets up batches of numbers to crunch and then has the SPEs crunch them. The PPE sends a chunk number to the SPE, which then grabs the numbers, processes them, sends the results back, and sends the chunk number back to the PPE for processing.
This is a pretty simplistic algorithm, but there are a number of potential weaknesses. Most obviously, while there is a great deal of parallelism between the SPEs, there is no parallelism between DMA operations and calculations on a given SPE. Each SPE initiates a DMA, waits for the DMA to complete, computes, initiates DMA again, and waits for the DMA to complete again. Nonetheless, it's a reasonably simple algorithm to implement.
Understanding the PPE's algorithm
Emulating the default behavior of the previous design, stick with a system where the PPE works with x/y pairs and then converts them into arrays of points for sending to the SPEs. Two buffers are needed for communication: one for outgoing points and one for incoming points. The buffers are subdivided into two chunks for each SPE. Note that in the initial version, only one buffer at a time is really going to get used, but this can be addressed later.
The basic algorithm in pseudo code looks like this:
Listing 3. Sending chunks to the SPEs
chunk = 0;
while (length) {
if (sent[chunk]) {
retrieve(chunk);
}
send(chunk);
chunk = (chunk + 1) % CHUNKS;
}
for (i = 0; i < chunks; ++i) {
if (sent(chunk)) {
retrieve(chunk);
}
chunk = (chunk + 1) % CHUNKS;
}
|
For example, on a 6-SPE system that has 12 chunks, this algorithm sends out up to 12 chunks before it starts retrieving any. In theory, this should give the SPEs plenty of time to start writing the earliest chunks back before they are retrieved.
There is one other subtlety hidden in the pseudo code. When the list is broken up, the last point in each chunk must be the first point of the next chunk as well. This is handled by some sneaky code to introduce what looks like an off-by-one error when sending data, and then the code removes that error when retrieving data.
Retrieval is slightly complicated. The inputs to the SPE algorithms are an array of N points (the fractal pattern) and a large block of points. In the PPE algorithm, the N points from the fractal are generated for each point in the large block. On the SPE however, it is necessary (assuming you care about performance) to do sets of four points at once. Requiring each fractal pattern to have a precise multiple of four points would be quite wasteful, so instead, each iteration works with four points from the current working set. This means that the points returned by the algorithm are not in the expected order. Unshuffling these turns out to be the single hardest part of this algorithm to do correctly. The example shows you how to do it in two parts: first, generate the shuffle pattern that should be used, then use that to index into arrays when copying data out. The shuffle pattern looks like this:
Listing 4. Really, shuffling makes sense once you look at it
#define SHUFFLE_SIZE(p) (4 * ((p) - 1))
int
make_shuffle() {
int i;
shuffle_table = malloc(SHUFFLE_SIZE(f.points) * sizeof(int));
for (i = 0; i < SHUFFLE_SIZE(f.points); ++i) {
div_t d = div(i, 4);
shuffle_table[i] = d.rem * (f.points - 1) + d.quot;
}
}
|
Assuming a fractal pattern with N points, each new iteration produces (N-1) new points from each point in the previous iteration of the fractal. This shuffling algorithm collates the items back, turning (N-1) groups of four objects into four groups of (N-1) objects.
The computational work is essentially unchanged from the IDL-based version of this program. The main loop on the SPE, however, is noticeably different. The essential pattern is simply to accept a chunk number, request the corresponding points, generate new points, and send them back. Chunk numbers are always non-negative, so the special value -1 is used to tell an SPE that it is time to end execution. The loop looks like this:
Listing 5. Doing one thing (loop) at a time
while (dacs_mailbox_read(&msg, DACS_DE_PARENT, DACS_PID_PARENT) ==
DACS_SUCCESS) {
if (msg == -1)
break;
dacs_get(old_points[buffer], old_mem,
CHUNK_XOFFSET(msg) * sizeof(float),
CHUNKSIZE * sizeof(point),
old_wids[buffer],
DACS_ORDER_ATTR_NONE,
DACS_BYTE_SWAP_DISABLE);
dacs_wait(old_wids[buffer]);
fractal(old_points[buffer], new_points[buffer], CHUNKSIZE);
dacs_put(new_mem, CHUNK_NXOFFSET(msg, f.points) * sizeof(float),
new_points[buffer],
CHUNKSIZE * (f.points - 1) * sizeof(point),
new_wids[buffer],
DACS_ORDER_ATTR_NONE,
DACS_BYTE_SWAP_DISABLE);
dacs_wait(new_wids[buffer]);
dacs_mailbox_write(&msg, DACS_DE_PARENT, DACS_PID_PARENT);
buffer = !buffer;
}
|
You might notice a half-implemented double-buffering scheme here, and you'd be
correct. The scheme is half-implemented for now, but tune in
next time to see where that can go. When an incoming message shows up, this code
simply grabs CHUNKSIZE points, iterates over them, and
sends back CHUNKSIZE * N-1 new points. The last set of
N-1 points generated is garbage. The points are generated based on the nonexistent
line between the last point in the set and the nonexistent point after it. That's
okay: the retrieval code on the PPE side knows better.
In the case where there are fewer than CHUNKSIZE
points to process, the SPE code ignores this and just does them all anyway. As
long as the chunk size is significantly smaller than the working set size, the
time one SPE spends rendering at most one extra
CHUNKSIZE-1 point is simply irrelevant.
So, with all of this work done to switch to DaCS, how does performance compare? It comes out surprisingly close to performance of the IDL-based version developed in the previous article. Originally, it appeared it would be quite a bit slower, but two key fixes can bring performance back in line with expectations.
First, enable compiler optimizations. Second, increase the
CHUNKSIZE preprocessor constant from 8 to 512.
Processing tiny chunks is very convenient for debugging, but performance can
suffer. As an interesting note, in the current incarnation, it appears that work
might be PPE-bound. Running with only one SPE yields the same result as running
with several.
This highlights two possible places to look for productive work. The first is simplifying the PPE's workload for copying data to and from the SPEs. The second is enabling double-buffering. A later article looks at both of these.
Learn
- Use an
RSS
feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Check out two flavors of
documentation on how great DaCS is:
"DaCS
for Cell/B.E. Programmer's Guide and API Reference"
and
"DaCS for Hybrid-x86 Programmer's Guide and API Reference"
(IBM Semiconductor solutions library, October 2007).
- Read
"Maximizing
the power of the Cell Broadband Engine® processor: 25 tips to optimal application performance"
(developerWorks, June 2006) to learn some
SPE programming tips.
- Ger information from Jonathan Bartlett's series on "Programming high
performance applications on the Cell/B.E. processor" (developerWorks, January 2007
to present) provides an
intro
to Linux® on the PS3,
programming the PS3's SPE,
an
intro to the SPU,
SPU performance programming,
C/C++ SPU programming,
and
managing smart buffer DMA transfers.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Speaking of Cell/B.E. SDK documentation,
there's a new blog series that abstracts important topic sections of some of the
major SDK documentation to give you a quick-read on the topic (in case you don't
need a fuller explanation) -- they're called
Infobombs,
and some topics already covered include
- Getting a successful FC7 install on a PS3.
- The basic structure of an ALF application.
- Double buffering on ALF as an optimization.
- ALF and DaCS for x86 hybrids.
- Configuring and using the Basic Linear Algebra Subprograms.
- Glossaries on ALF and DaCS error codes, trace events, and ALF attributes.
- Refer to the Cell
Broadband Engine documentation section of the IBM Semiconductor Solutions Technical Library for a wealth of downloadable manuals,
specifications, and more.
- Sign up for the developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox each week.
Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features -- the
"Forum watch"
of interesting questions and hot topics from the forum and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge bombs gleaned from
Cell/B.E. documentation).
Get products and technologies
- Get Cell:
Contact
IBM about custom Cell-based or custom-processor based solutions.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the library of Cell/B.E. documentation.
- Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
- Contact IBM about custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Get fast answers from IBM experts and
real-world practitioners in developerWorks
Cell
Broadband Engine discussion forum
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
- Check out the Cell Broadband
Engine Architecture forum to get your technical questions about the processor answered.
Juicy problems and answers from the forums are rounded up periodically and highlighted
in the "Forum watch" blog series.
- Go to the Power Architecture blog for news, downloads,
instructional resources, and event notifications for Cell/B.E. and other Power Architecture-related technologies. You can find
the popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology updates, and the Infobomb quick-read
technology introductions.




