Skip to main content

The little broadband engine that could: Rendering fractals on the SPE

Peter Seebach (developerworks@seebs.net), Freelance writer, Wind River Systems
Author photo
Peter Seebach has been writing about Cell Broadband Engine development for a few years, and programming in general for many more. He sort of wishes modern vector processors would come with a built-in couch.

Summary:  In the previous article in the series, you learned some reasons why there were no appreciable performance gains when you migrated the fractal-rendering program from running on one SPE to running on multiple SPEs. This article is going to illuminate the challenge of rendering fractals on the SPE. The focus is on the SPEs copying their rendering results directly into the target data buffer.

View more content in this series

Date:  10 Jun 2008
Level:  Intermediate
Activity:  3054 views

Introduction

The previous article of this series explored SPE optimization of an iterative fractal pattern. A key bottleneck in this process appeared to be the time spent performing copies, especially copies on the PPE end, because it was difficult to arrange for the SPEs to work directly with the buffers that the PPE would eventually use. One of the advantages of a purely academic exercise, however, is the option of simply changing the problem specification when things get awkward. With that in mind, it's time to have a look at a totally different problem: rendering fractals on the SPE.

(The article "The little broadband engine that could: DaCS--flexible and complex" described how to rebuild the fractal-generating program designed for IDL on DaCS. At first, it seemed the performance only just approached that of IDL until you made two key fixes: enabling compiler optimizations and increasing the CHUNKSIZE preprocessor constant from 8 to 512. The article "The little broadband engine that could: Looking at some DaCS performance fine-tuning issues" showed you some reasons why the migration of the fractal generator from IDL to DaCS might not have produced performance gains, giving you a guide for fine-tuning your porting effort.)

Question: What is it with you and the fractals?

Answer from author Peter Seebach: Fractals have fascinated me for years. In fact, it was in helping a friend optimize a fractal renderer some years back that I had my most compelling firsthand experience of the importance of profiling in optimization. We spent about six hours one day trying to improve the speed of a Mandelbrot rendering program, and we found it extremely difficult to get more than small incremental gains in performance. We experimented with inventing algorithms, we tried elaborate tricks. Finally, we profiled.

Using profiling, we discovered that the majority of the program's CPU time was going toward plotting points. We had incorporated some code handed down from my friend's older brother that defined an image structure for a 1x1-pixel image, changed the color of the single pixel, and plotted it at a given location. When we changed this to use a native API call to simply plot a pixel, the performance difference was stunning. Not surprisingly, the time spent rendering the computed image utterly dwarfs the time spent calculating values.

The previous program's key limitation was a substantial dependency on moving data both to and from the SPEs and on copying data that the SPEs returned from communication buffers into final locations. What would happen if the project was such that the SPEs could copy their rendering results directly into the target data buffer? For this, the goal would be to have a project where very little data needs to be sent to the SPEs while they return a great deal of data.

One obvious candidate is the Mandelbrot set. Other sources go into detail explaining the Mandelbrot set, but essentially it is a set of points on the complex plane that never escape a given radius of the origin when plugged into a simple iterative formula. The key point to this definition is that even on an extremely fast processor, it turns out to be very difficult to perform an infinite number of iterations. Thus people typically don't render the set by performing an infinite number of iterations, but by performing a fairly large number, assuming that any point that has not escaped after some large number of iterations is unlikely to.

When rendering images, you typically record the number of iterations after which a point escaped or you record some magic value to indicate that it did not escape. Points that never escape are often rendered in black; points that escape are rendered in other colors.

Developing a test bed

While optimizing your algorithm, make sure that you have a useful test program that allows you to change various parameters quickly and effectively. There are two categories of parameters to consider:

  • Parameters governing the actual task, such as the area to plot or the number of points to plot
  • Parameters of the implementation, such as whether or not to use the SPEs, how many to use, or whether to display results

For a test-bed, anything that lets you visually confirm your output quickly is workable. The GNU plotutils library is an excellent, if grossly overpowered, candidate. A C program to plot points in arbitrary colors in a window is about twenty lines of simple, quick code. Expanding on the sample program to get something that could render values would take about fifteen minutes. Here's the inner loop for the display:


Listing 1. Plotting points
                
for (i = 0; i < y_size; i += 1) {
            for (j = 0; j < x_size; j += 1) {
                        m = mand_buffer[i * x_size + j];
                        pl_pencolor(red[m], green[m], blue[m]);
                        pl_point(j, i);
            }
}

A great deal about the program's structure can be derived from this. Obviously, calculated values are being stored in a single large buffer in row-major order. The reason for this is simple: allocating hundreds of separate buffers, one for each line, would create some of the same problems as the previous version did, including large overhead of managing shared memory access. There's no obvious reason to prefer row-major or column-major. For the example, iterating through horizontal lines seems more familiar. The red, green, and blue arrays are calculated in advance to hold the red, green, and blue values to be used for a given point. A test bed program provides two algorithms for calculating these values: one uses gray-scale values (almost black for low iteration counts and almost white for high iteration counts) while the other uses all bright colors that rotate through the hues. In both cases, points that reach the maximum number of iterations are rendered as pure black. Many other algorithms are possible.

The core function of the test bed is to set everything up, time the actual calculation, and then optionally display the results. The display is optional because benchmarking doesn't really rely on it.


Using the PPE version

The PPE version of the algorithm simply iterates through the assigned chunk of points calling a fairly naive calculator.


Listing 2. Is it or isn't it?
                
int
mand(double x, double y) {
            double x0 = x;
            double y0 = y;
            double new_x;
            int i = 0;

            while (((x*x + y*y) < csquared) && (i < max_iterations)) {
                        new_x = x*x - y*y + x0;
                        y = 2*x*y + y0;
                        x = new_x;
                        ++i;
            }
            return i;
}

There are two global variables used by this function (admittedly, this is bad style). The csquared variable holds a cached value c*c, where c is the distance out from the origin at which a point is considered to have escaped. The default value for c in the example program is 2. The other global variable, max_iterations, serves a purpose too arcane to explain without expanding this article unreasonably.

With this scaffolding in place, the first step is accomplished: the ability to draw brightly colored things that look somewhat like a Mandelbrot set.

As an interesting side note, the process of rendering to the display takes time generally comparable to the time spent calculating the points. With a large view and a reasonable iteration depth (such as 512), it might take seven seconds to calculate and six to plot. There are many advantages to doing these simultaneously, but it makes profiling more difficult. For a demonstration application, it's fine to separate them.


Targeting the SPE

The next stage is to use the SPEs. For a first-pass algorithm, it is sufficient to have SPEs render one scan-line of the set at a time. For a 1024-pixel-wide display, that is 1024 adjacent pixels with varying x values and a constant y value.

Actually, for the first pass, you can simply ignore the SPE's internal parallelization and perform scalar math. Admittedly this is not an efficient use of the SPE, but it's easy enough to do. The mand() function is simply copied directly from the PPE version.

As you might expect, given that this project was chosen to be easy for DaCS, the DaCS part of the code can be very easy to write. At the beginning of each iteration, a set of program parameters are sent to each SPE using dacs_send(). Then the SPE is sent a series of line numbers using dacs_mailbox_write(). At the end, a special sentinel value is sent.

The rendering buffer is made available to the SPEs using dacs_remote_mem_share(). This leads to a simple way to ensure synchronization at the end of the render without the need for the PPE to constantly accept individual confirmation messages. At the end of the render, each SPE releases its shared memory segment. The PPE uses dacs_remote_mem_destroy() to unshare the memory segment. DaCS ensures that this operation completes only when every client has released the memory segment. Thus when the destroy operation completes, each SPE has released its memory segment, which it does only when it has finished rendering lines.

You can design the SPE application to accept new sets of parameters rather than being given the parameters at startup. In a more advanced application, this would be quite useful for handling dynamic zoom operations or other changes in the display.

The central loop of the SPE looks like this:


Listing 3. Accepting and completing workloads
                
while (!done && (dacs_mailbox_read([...]) == DACS_SUCCESS)) {
   switch (msg) {
   case MAND_EXIT:
           done = 1;
           break;
   case MAND_DONE:
           free(buffer);
           buffer = NULL;
           /* release memory, so parent's destroy will unblock */
           dacs_remote_mem_release(&mem);
           break;
   case MAND_NEW:
           /* new parameters */
           dacs_recv(&params, params_size, [...]);
           dacs_remote_mem_accept([...], &mem);
                        buffer = malloc(params.count_x * sizeof(int)); 
                        [...]
            default:
           do_mand(buffer, msg);
           dacs_put(mem, ...);
                        break;
}

The three symbolic constants are negative numbers used to pass data other than row numbers to the SPE. MAND_EXIT indicates that the process should be terminated. The code assumes unconditionally that any given render will be complete before this happens. In a more complete application, this would almost certainly yield a mysterious lockup on exit.

The MAND_NEW and MAND_DONE commands are paired. MAND_NEW causes the SPE code to attach to a shared memory segment, allocate memory, and retrieve parameters. By contrast, MAND_DONE deallocates the local memory buffer and releases the shared memory segment. As explained, this provides easy notification for the PPE of when the SPEs are done processing. The actual do_mand() function, the calculating core of the application, simply iterates over the buffer calling the mand() function.


Listing 4. Filling in a single line
                
void
do_mand(int *buffer, int row) {
            int i;
            double px, py;
            py = min_y + ((double) row / count_y) * size_y;

            for (i = 0; i < count_x; ++i) {
                        px = min_x + ((double) i / count_x) * size_x;
                        buffer[i] = mand(px, py);
            }
}

In this first pass, no special thought goes into the assignment of workload. A straight round-robin approach is taken. On the Mandelbrot set, it's not at all clear that this is a good approach! While the previous fractal program's iterative math assured that time to process any given data set would be essentially fixed, this is not at all the case with the Mandelbrot set. You can easily imagine some segments taking longer than others. However, the round-robin scheduling reduces the risk. The most noticeable case for wild time differences is when large contiguous blocks are assigned to different processors. Some might end up with a block of points requiring the maximum number of iterations to test, while others end up with a block of points, all of which start out escaped.


Analyzing linear performance

As expected, the single-SPE version of the code is slightly slower than the PPE-only version. In fact, performance for the example does something utterly amazing: it matches expectations perfectly. The time for a single SPE to render a large test image (1600x1024) to a depth of 512 iterations is just barely over 8 seconds. The time for 2 SPEs to render is just barely over 4 seconds. In fact, in each case, multiplying the number of SPEs by the time taken yields a value between 8.039 seconds and 8.068 seconds. The variance between repeated runs is nearly as large as the variance between different numbers of SPEs. Whatever the overhead cost might have been, it can be measured in milliseconds. That's encouraging.


Conclusion

The results show that the predicted performance behaviors are correct, which means that this application is nearly entirely compute bound, not memory bound. The next course of action is to try to take advantage of the native parallelism of the SPE, which is noticeably weak at scalar code.

The next article offers a detailed walkthrough of the process of developing a reasonably simple, vectorized Mandelbrot calculator for the SPE. The walkthrough illustrates a number of the points vaguely asserted in earlier articles in the series. It also offers something new to the series: a straightforward attempt to vectorize, which directly and immediately yields the expected performance increases! Truly, this is an age of wonders.


Resources

Learn

Get products and technologies

Discuss

About the author

Author photo

Peter Seebach has been writing about Cell Broadband Engine development for a few years, and programming in general for many more. He sort of wishes modern vector processors would come with a built-in couch.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=312381
ArticleTitle=The little broadband engine that could: Rendering fractals on the SPE
publish-date=06102008
author1-email=developerworks@seebs.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers