Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

The little broadband engine that could: DaCS--flexible and complex

Using DaCS to rebuild the fractal-generating program designed for IDL

Peter Seebach (developerworks@seebs.net), Freelance writer, Wind River Systems
Author photo
Peter Seebach has been fascinated with fractal generation since he first saw a program like this running on an Amiga. This is less colorful, but much faster.

Summary:  In an earlier article in this series, the author introduced a fractal-generation program built around the IDL interface that showcased the strength of IDL's straightforward API. Executing the program was almost like calling a function and getting results. In this article (and using the same basic program), the author demonstrates the Data Communication and Synchronization library's (DaCS) greater flexibility and the tradeoff: additional complexity. With DaCS, it's possible to pass the fractal pattern in as an initial argument, then use buffers to pass data back and forth as they are processed. While this requires more design work, it might actually be more efficient. This article also shows that DaCS allows for much more carefully tuned inputs and outputs.

View more content in this series

Date:  22 Apr 2008
Level:  Intermediate

Activity:  23524 views
Comments:  

Introduction

In an earlier article in this series, The little broadband engine that could: Use multiple SPEs for a single task, the author introduced a fractal-generation program built around the IDL interface. The key strength of IDL for this was the fairly straightforward API, which was quite close in scope to simply calling a function and getting results.

DaCS is different. It offers greater flexibility, but at the cost of additional complexity. However, it offers a noticeable advantage in that it allows for much more carefully tuned inputs and outputs.

In the IDL version of the program, each call had to provide both the fractal pattern (a set of coordinates to use for transforms) and the fractal data (a set of points to transform). There were some quirks as to allowable data set sizes. With DaCS, it's possible to pass the fractal pattern in as an initial argument, then use buffers to pass data back and forth as they are processed. This requires more design work, but it might actually be more efficient. You're about to find out!

Understanding the program

The program in question is an iterative fractal generator. It is best known for the famous snowflake fractals. You start with a series of line segments, then you recursively replace each line segment with the same pattern of segments that you started with. This consumes exponential space, and it is very easily segmented out. If you have a thousand segments to process, you can process them simultaneously in batches, although you can't start on a new iteration until at least part of the preceding iteration is done. Because the pattern stays the same for each iteration, it doesn't need to be passed in multiple times.

In the DaCS model, each SPE process, once started, already has the pattern ready. Each one takes in sets of line segments (points actually) and emits sets of new points. This, in turn, implies a set of buffers that the SPEs can read from and a set that they can write to. In fact, it's practical to simply declare large buffers and to have the SPEs write at predictable offsets.

Once the SPEs are prepared, mailbox communications should be sufficient to initiate transfers. All that each SPE needs to know is:

  • The address of the input buffer
  • The address of the output buffer
  • The points of the pattern
  • Which segment of each buffer to work on

The first three items from the list can be previously allocated at program startup or during the initialization code. The only thing that changes during runtime is which segment to work on, which is easily small enough to be sent in a mailbox. Thus, as blocks of data are prepared for sending, each can be assigned to the next SPE (in a round-robin cycle) and then retrieved later when it's finished. The operations in question are predictable in time taken, so there's no need for a more sophisticated algorithm.

First, you need to pass the actual data in to the program. The DaCS documentation describes the argv parameter as an array of pointers to characters, ending with a null pointer. That's a very likely sounding description, but in fact what gets passed is simply the address. And the address of an array of pointers to characters is pretty much useless to the SPE. One solution is to simply define a structure to hold the data that needs to be passed in and pass the address of this structure. The structure, for this program, looks like this:


Listing 1. Defining a fractal
                
typedef struct { float x, y; } point;

struct frac_args {
            point frac[MAX_FRAC];
            unsigned int points;
            unsigned char padding[16];
};

You can use the code from the example in the previous article to copy this in. Note that the method used requires the size to be a multiple of 16 bytes, which implies some unused padding on the end. A simple way to handle this is to add a full 16 bytes of padding, but to copy only as much as is needed to make the size line up:


Listing 2. Copying in arguments
                
struct frac_args f;
size_t argsize;

/* fill in f, then... */

argsize = sizeof(struct frac_args);
argsize = argsize - (argsize % 16);

spu_writech(MFC_WrTagMask, -1);
spu_mfcdma32(&f, (unsigned int) argp, argsize, 0, MFC_GET_CMD);
(void)spu_mfcstat(2);

Strictly speaking, in the case where the structure ends up being an even multiple of 16 bytes with padding, this copies an extra 16 bytes of padding. Because this copy happens only once per SPE, it is not worth trying to fix this and make the code more complicated.

Once the child processes are started, the essential routine is a loop. Running on the PPE, the main program sets up batches of numbers to crunch and then has the SPEs crunch them. The PPE sends a chunk number to the SPE, which then grabs the numbers, processes them, sends the results back, and sends the chunk number back to the PPE for processing.

This is a pretty simplistic algorithm, but there are a number of potential weaknesses. Most obviously, while there is a great deal of parallelism between the SPEs, there is no parallelism between DMA operations and calculations on a given SPE. Each SPE initiates a DMA, waits for the DMA to complete, computes, initiates DMA again, and waits for the DMA to complete again. Nonetheless, it's a reasonably simple algorithm to implement.


Understanding the PPE's algorithm

Emulating the default behavior of the previous design, stick with a system where the PPE works with x/y pairs and then converts them into arrays of points for sending to the SPEs. Two buffers are needed for communication: one for outgoing points and one for incoming points. The buffers are subdivided into two chunks for each SPE. Note that in the initial version, only one buffer at a time is really going to get used, but this can be addressed later.

The basic algorithm in pseudo code looks like this:


Listing 3. Sending chunks to the SPEs
                
chunk = 0;
while (length) {
       if (sent[chunk]) {
                   retrieve(chunk);
       }
       send(chunk);
       chunk = (chunk + 1) % CHUNKS;
}
for (i = 0; i < chunks; ++i) {
       if (sent(chunk)) {
                   retrieve(chunk);
       }
       chunk = (chunk + 1) % CHUNKS;
}

For example, on a 6-SPE system that has 12 chunks, this algorithm sends out up to 12 chunks before it starts retrieving any. In theory, this should give the SPEs plenty of time to start writing the earliest chunks back before they are retrieved.

There is one other subtlety hidden in the pseudo code. When the list is broken up, the last point in each chunk must be the first point of the next chunk as well. This is handled by some sneaky code to introduce what looks like an off-by-one error when sending data, and then the code removes that error when retrieving data.

Retrieval is slightly complicated. The inputs to the SPE algorithms are an array of N points (the fractal pattern) and a large block of points. In the PPE algorithm, the N points from the fractal are generated for each point in the large block. On the SPE however, it is necessary (assuming you care about performance) to do sets of four points at once. Requiring each fractal pattern to have a precise multiple of four points would be quite wasteful, so instead, each iteration works with four points from the current working set. This means that the points returned by the algorithm are not in the expected order. Unshuffling these turns out to be the single hardest part of this algorithm to do correctly. The example shows you how to do it in two parts: first, generate the shuffle pattern that should be used, then use that to index into arrays when copying data out. The shuffle pattern looks like this:


Listing 4. Really, shuffling makes sense once you look at it
                
#define SHUFFLE_SIZE(p) (4 * ((p) - 1))

int
make_shuffle() {
            int i;
            shuffle_table = malloc(SHUFFLE_SIZE(f.points) * sizeof(int));
            for (i = 0; i < SHUFFLE_SIZE(f.points); ++i) {
                        div_t d = div(i, 4);
                        shuffle_table[i] = d.rem * (f.points - 1) + d.quot;
            }
}

Assuming a fractal pattern with N points, each new iteration produces (N-1) new points from each point in the previous iteration of the fractal. This shuffling algorithm collates the items back, turning (N-1) groups of four objects into four groups of (N-1) objects.


Exploring the SPE's algorithm

The computational work is essentially unchanged from the IDL-based version of this program. The main loop on the SPE, however, is noticeably different. The essential pattern is simply to accept a chunk number, request the corresponding points, generate new points, and send them back. Chunk numbers are always non-negative, so the special value -1 is used to tell an SPE that it is time to end execution. The loop looks like this:


Listing 5. Doing one thing (loop) at a time
                
while (dacs_mailbox_read(&msg, DACS_DE_PARENT, DACS_PID_PARENT) ==
            DACS_SUCCESS) {
   if (msg == -1)
           break;
   dacs_get(old_points[buffer], old_mem,
           CHUNK_XOFFSET(msg) * sizeof(float),
           CHUNKSIZE * sizeof(point),
           old_wids[buffer],
           DACS_ORDER_ATTR_NONE,
           DACS_BYTE_SWAP_DISABLE);
   dacs_wait(old_wids[buffer]);
   fractal(old_points[buffer], new_points[buffer], CHUNKSIZE);
   dacs_put(new_mem, CHUNK_NXOFFSET(msg, f.points) * sizeof(float),
           new_points[buffer],
           CHUNKSIZE * (f.points - 1) * sizeof(point),
           new_wids[buffer],
           DACS_ORDER_ATTR_NONE,
           DACS_BYTE_SWAP_DISABLE);
   dacs_wait(new_wids[buffer]);
   dacs_mailbox_write(&msg, DACS_DE_PARENT, DACS_PID_PARENT);
   buffer = !buffer;
}

You might notice a half-implemented double-buffering scheme here, and you'd be correct. The scheme is half-implemented for now, but tune in next time to see where that can go. When an incoming message shows up, this code simply grabs CHUNKSIZE points, iterates over them, and sends back CHUNKSIZE * N-1 new points. The last set of N-1 points generated is garbage. The points are generated based on the nonexistent line between the last point in the set and the nonexistent point after it. That's okay: the retrieval code on the PPE side knows better.

In the case where there are fewer than CHUNKSIZE points to process, the SPE code ignores this and just does them all anyway. As long as the chunk size is significantly smaller than the working set size, the time one SPE spends rendering at most one extra CHUNKSIZE-1 point is simply irrelevant.


Fine-tuning for performance

So, with all of this work done to switch to DaCS, how does performance compare? It comes out surprisingly close to performance of the IDL-based version developed in the previous article. Originally, it appeared it would be quite a bit slower, but two key fixes can bring performance back in line with expectations.

First, enable compiler optimizations. Second, increase the CHUNKSIZE preprocessor constant from 8 to 512. Processing tiny chunks is very convenient for debugging, but performance can suffer. As an interesting note, in the current incarnation, it appears that work might be PPE-bound. Running with only one SPE yields the same result as running with several.

This highlights two possible places to look for productive work. The first is simplifying the PPE's workload for copying data to and from the SPEs. The second is enabling double-buffering. A later article looks at both of these.


Resources

Learn

Get products and technologies

Discuss

About the author

Author photo

Peter Seebach has been fascinated with fractal generation since he first saw a program like this running on an Amiga. This is less colorful, but much faster.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=302202
ArticleTitle=The little broadband engine that could: DaCS--flexible and complex
publish-date=04222008
author1-email=developerworks@seebs.net
author1-email-cc=