The previous article of this series explored SPE optimization of an iterative fractal pattern. A key bottleneck in this process appeared to be the time spent performing copies, especially copies on the PPE end, because it was difficult to arrange for the SPEs to work directly with the buffers that the PPE would eventually use. One of the advantages of a purely academic exercise, however, is the option of simply changing the problem specification when things get awkward. With that in mind, it's time to have a look at a totally different problem: rendering fractals on the SPE.
(The article
"The little
broadband engine that could: DaCS--flexible and complex"
described how to rebuild the fractal-generating program designed for IDL on DaCS. At
first, it seemed the performance only just approached that of IDL until you made two
key fixes: enabling compiler optimizations and increasing the
CHUNKSIZE preprocessor constant from 8 to 512. The
article "The little
broadband engine that could: Looking at some DaCS performance fine-tuning issues"
showed you some reasons why the migration of the fractal generator from IDL to
DaCS might not have produced performance gains, giving you a guide for fine-tuning
your porting effort.)
The previous program's key limitation was a substantial dependency on moving data both to and from the SPEs and on copying data that the SPEs returned from communication buffers into final locations. What would happen if the project was such that the SPEs could copy their rendering results directly into the target data buffer? For this, the goal would be to have a project where very little data needs to be sent to the SPEs while they return a great deal of data.
One obvious candidate is the Mandelbrot set. Other sources go into detail explaining the Mandelbrot set, but essentially it is a set of points on the complex plane that never escape a given radius of the origin when plugged into a simple iterative formula. The key point to this definition is that even on an extremely fast processor, it turns out to be very difficult to perform an infinite number of iterations. Thus people typically don't render the set by performing an infinite number of iterations, but by performing a fairly large number, assuming that any point that has not escaped after some large number of iterations is unlikely to.
When rendering images, you typically record the number of iterations after which a point escaped or you record some magic value to indicate that it did not escape. Points that never escape are often rendered in black; points that escape are rendered in other colors.
While optimizing your algorithm, make sure that you have a useful test program that allows you to change various parameters quickly and effectively. There are two categories of parameters to consider:
- Parameters governing the actual task, such as the area to plot or the number of points to plot
- Parameters of the implementation, such as whether or not to use the SPEs, how many to use, or whether to display results
For a test-bed, anything that lets you visually confirm your output quickly is workable. The GNU plotutils library is an excellent, if grossly overpowered, candidate. A C program to plot points in arbitrary colors in a window is about twenty lines of simple, quick code. Expanding on the sample program to get something that could render values would take about fifteen minutes. Here's the inner loop for the display:
Listing 1. Plotting points
for (i = 0; i < y_size; i += 1) {
for (j = 0; j < x_size; j += 1) {
m = mand_buffer[i * x_size + j];
pl_pencolor(red[m], green[m], blue[m]);
pl_point(j, i);
}
}
|
A great deal about the program's structure can be derived from this. Obviously, calculated values are being stored in a single large buffer in row-major order. The reason for this is simple: allocating hundreds of separate buffers, one for each line, would create some of the same problems as the previous version did, including large overhead of managing shared memory access. There's no obvious reason to prefer row-major or column-major. For the example, iterating through horizontal lines seems more familiar. The red, green, and blue arrays are calculated in advance to hold the red, green, and blue values to be used for a given point. A test bed program provides two algorithms for calculating these values: one uses gray-scale values (almost black for low iteration counts and almost white for high iteration counts) while the other uses all bright colors that rotate through the hues. In both cases, points that reach the maximum number of iterations are rendered as pure black. Many other algorithms are possible.
The core function of the test bed is to set everything up, time the actual calculation, and then optionally display the results. The display is optional because benchmarking doesn't really rely on it.
The PPE version of the algorithm simply iterates through the assigned chunk of points calling a fairly naive calculator.
Listing 2. Is it or isn't it?
int
mand(double x, double y) {
double x0 = x;
double y0 = y;
double new_x;
int i = 0;
while (((x*x + y*y) < csquared) && (i < max_iterations)) {
new_x = x*x - y*y + x0;
y = 2*x*y + y0;
x = new_x;
++i;
}
return i;
}
|
There are two global variables used by this function (admittedly, this is bad
style). The csquared variable holds a cached value
c*c, where c is the distance
out from the origin at which a point is considered to have escaped. The default
value for c in the example program is 2. The other
global variable,
max_iterations, serves a purpose too arcane to explain
without expanding this article unreasonably.
With this scaffolding in place, the first step is accomplished: the ability to draw brightly colored things that look somewhat like a Mandelbrot set.
As an interesting side note, the process of rendering to the display takes time generally comparable to the time spent calculating the points. With a large view and a reasonable iteration depth (such as 512), it might take seven seconds to calculate and six to plot. There are many advantages to doing these simultaneously, but it makes profiling more difficult. For a demonstration application, it's fine to separate them.
The next stage is to use the SPEs. For a first-pass algorithm, it is sufficient to have SPEs render one scan-line of the set at a time. For a 1024-pixel-wide display, that is 1024 adjacent pixels with varying x values and a constant y value.
Actually, for the first pass, you can simply ignore the SPE's internal parallelization
and perform scalar math. Admittedly this is not an efficient use of the SPE, but
it's easy
enough to do. The mand() function is simply copied
directly from the PPE version.
As you might expect, given that this project was chosen to be easy for DaCS, the
DaCS part of the code can be very easy to write. At the beginning of each iteration,
a set of program parameters are sent to each SPE using
dacs_send(). Then the SPE is sent a series of line
numbers using dacs_mailbox_write(). At the end, a
special sentinel value is sent.
The rendering buffer is made available to the SPEs using
dacs_remote_mem_share(). This leads to a simple way to
ensure synchronization at the end of the render without the need for the PPE to
constantly accept individual confirmation messages. At the end of the render, each
SPE releases its shared memory segment. The PPE uses
dacs_remote_mem_destroy() to unshare the memory
segment. DaCS ensures that this operation completes only when every client has
released the memory segment. Thus when the destroy operation completes, each SPE
has released its memory segment, which it does only when it has finished rendering
lines.
You can design the SPE application to accept new sets of parameters rather than being given the parameters at startup. In a more advanced application, this would be quite useful for handling dynamic zoom operations or other changes in the display.
The central loop of the SPE looks like this:
Listing 3. Accepting and completing workloads
while (!done && (dacs_mailbox_read([...]) == DACS_SUCCESS)) {
switch (msg) {
case MAND_EXIT:
done = 1;
break;
case MAND_DONE:
free(buffer);
buffer = NULL;
/* release memory, so parent's destroy will unblock */
dacs_remote_mem_release(&mem);
break;
case MAND_NEW:
/* new parameters */
dacs_recv(¶ms, params_size, [...]);
dacs_remote_mem_accept([...], &mem);
buffer = malloc(params.count_x * sizeof(int));
[...]
default:
do_mand(buffer, msg);
dacs_put(mem, ...);
break;
}
|
The three symbolic constants are negative numbers used to pass data other than
row numbers to the SPE. MAND_EXIT indicates that the
process should be terminated. The code assumes unconditionally that any given
render will be complete before this happens. In a more complete application, this
would almost certainly yield a mysterious lockup on exit.
The
MAND_NEW and MAND_DONE
commands are paired. MAND_NEW causes the SPE code to
attach to a shared memory segment, allocate memory, and retrieve parameters. By
contrast, MAND_DONE deallocates the local memory buffer
and releases the shared memory segment. As explained, this provides easy
notification for the PPE of when the SPEs are done processing. The actual
do_mand() function, the calculating core of the
application, simply iterates over the buffer calling the
mand() function.
Listing 4. Filling in a single line
void
do_mand(int *buffer, int row) {
int i;
double px, py;
py = min_y + ((double) row / count_y) * size_y;
for (i = 0; i < count_x; ++i) {
px = min_x + ((double) i / count_x) * size_x;
buffer[i] = mand(px, py);
}
}
|
In this first pass, no special thought goes into the assignment of workload. A straight round-robin approach is taken. On the Mandelbrot set, it's not at all clear that this is a good approach! While the previous fractal program's iterative math assured that time to process any given data set would be essentially fixed, this is not at all the case with the Mandelbrot set. You can easily imagine some segments taking longer than others. However, the round-robin scheduling reduces the risk. The most noticeable case for wild time differences is when large contiguous blocks are assigned to different processors. Some might end up with a block of points requiring the maximum number of iterations to test, while others end up with a block of points, all of which start out escaped.
As expected, the single-SPE version of the code is slightly slower than the PPE-only version. In fact, performance for the example does something utterly amazing: it matches expectations perfectly. The time for a single SPE to render a large test image (1600x1024) to a depth of 512 iterations is just barely over 8 seconds. The time for 2 SPEs to render is just barely over 4 seconds. In fact, in each case, multiplying the number of SPEs by the time taken yields a value between 8.039 seconds and 8.068 seconds. The variance between repeated runs is nearly as large as the variance between different numbers of SPEs. Whatever the overhead cost might have been, it can be measured in milliseconds. That's encouraging.
The results show that the predicted performance behaviors are correct, which means that this application is nearly entirely compute bound, not memory bound. The next course of action is to try to take advantage of the native parallelism of the SPE, which is noticeably weak at scalar code.
The next article offers a detailed walkthrough of the process of developing a reasonably simple, vectorized Mandelbrot calculator for the SPE. The walkthrough illustrates a number of the points vaguely asserted in earlier articles in the series. It also offers something new to the series: a straightforward attempt to vectorize, which directly and immediately yields the expected performance increases! Truly, this is an age of wonders.
Learn
- Use an
RSS
feed
to request notification for the upcoming articles in this series. (Find out more
about
RSS feeds of developerWorks content.)
- Want to plot pixels, but don't feel like
writing X code?
GNU plotutils can
help.
- Wikipedia's introduction to the
Mandelbrot set is a good
starting point.
- Check out two flavors of documentation on how
great DaCS is:
"DaCS
for Cell/B.E. Programmer's Guide and API Reference"
and
"DaCS for Hybrid-x86 Programmer's Guide and API Reference"
(IBM Semiconductor solutions library, October 2007).
- Read
"Maximizing
the power of the Cell Broadband Engine® processor: 25 tips to optimal application performance"
(developerWorks, June 2006) to learn some
SPE programming tips.
- Get information from Jonathan Bartlett's series
on "Programming high performance applications on the Cell/B.E. processor"
(developerWorks, January 2007 to present) provides an
intro
to Linux® on the PS3,
programming the PS3's SPE,
an
intro to the SPU,
SPU performance programming,
C/C++ SPU programming,
and
managing smart buffer DMA transfers.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Speaking of Cell/B.E. SDK documentation,
there's a new blog series that abstracts important topic sections of some of the
major SDK documentation to give you a quick-read on the topic (in case you don't
need a fuller explanation) -- they're called
Infobombs,
and some topics already covered include
- Getting a successful FC7 install on a PS3.
- The basic structure of an ALF application.
- Double buffering on ALF as an optimization.
- ALF and DaCS for x86 hybrids.
- Configuring and using the Basic Linear Algebra Subprograms.
- Glossaries on ALF and DaCS error codes, trace events, and ALF attributes.
- Refer to the
Cell
Broadband Engine documentation
section of the IBM Semiconductor Solutions Technical Library for a wealth of
downloadable manuals, specifications, and more.
- Sign up for the
developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox
each week. Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features -- the
"Forum watch"
of interesting questions and hot topics from the forum and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge bombs gleaned
from Cell/B.E. documentation).
Get products and technologies
- Get Cell:
Contact
IBM about custom Cell-based or custom-processor based solutions.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the
library of Cell/B.E. documentation.
- Find all
Cell/B.E.-related articles, discussion forums, downloads, and more at the IBM
developerWorks
Cell
Broadband Engine resource center:
your definitive resource for all things Cell/B.E.
- Contact IBM about
custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Get fast answers from IBM experts and
real-world practitioners in developerWorks
Cell
Broadband Engine discussion forum
- Check out the
Cell Broadband
Engine Architecture forum
to get your technical questions about the processor answered.
- Check out the
Cell Broadband
Engine Architecture forum
to get your technical questions about the processor answered. Juicy problems and
answers from the forums are rounded up periodically and highlighted in the
"Forum watch" blog series.
- Go to the
Power Architecture blog
for news, downloads, instructional resources, and event notifications for
Cell/B.E. and other Power Architecture-related technologies. You can find the
popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology
updates, and the Infobomb quick-read technology introductions.
Comments (Undergoing maintenance)





