In the previous article in the series, you migrated the IDL version of an example fractal-rendering program from earlier in the series to run using the DaCS data library. The performance, while noticeably better than running on the PPE only, showed no noticeable gains when going from running on one SPE to running on multiple SPEs.
There are a number of possible explanations for this. The most likely, however, is simply that the PPE's speed at setting up and tearing down chunks of data for the SPEs is the limiting factor. Therefore, I thought it prudent as a first step to look at reducing the amount of work the PPE does when sending and receiving data as a way to improve performance.
In the current incarnation, the PPE is handling marshalling. Data are sent to the SPE in the most convenient possible form for the SPE and then returned in the form most convenient for the PPE. It is almost certainly faster to have the SPE do the shuffling.
In the first pass design, each chunk of N points was sent to the SPE as a set of N x-values followed by a chunk of N y-values. Marshalling for these is trivial, certainly, as shown in Listing 1.
Listing 1. Converting points to individual values (mashalling)
/* given an array of values, and a pair of pointers... */
for (i = 0; i < count; ++i) {
*ox++ = old[i].x;
*oy++ = old[i].y;
}
|
This is a great deal slower than a plain old memcpy. Similarly, the elaborate and complicated shuffling algorithm used before can be turned into a plain old memcpy. But this implies that now the same kinds of shuffling need to be done on the SPE.
Because the actual calculation loop is just fine, it's simple enough to just add additional buffers (after all, there was plenty of SPE local store left) and write small conversion routines that convert data to and from the expected formats.
As a first pass, just write the code as scalar code on the SPE. Listing 2 offers the code to take an array containing X values (and an immediately following array containing Y values) and to convert it to the same number of points.
Listing 2. Marshalling using scalar code on the SPE
void
outconvert(point *outgoing, float *new_points, int count) {
int i, j;
int l = (f.points - 1);
float *nx = new_points;
float *ny = new_points + (CHUNKSIZE * l);
for (i = 0; i < count; i += 4) {
for (j = 0; j < l; ++j) {
outgoing[j].x = nx[0 + (4 * j)];
outgoing[j].y = ny[0 + (4 * j)];
outgoing[j+l].x = nx[1 + (4 * j)];
outgoing[j+l].y = ny[1 + (4 * j)];
outgoing[j+l*2].x = nx[2 + (4 * j)];
outgoing[j+l*2].y = ny[2 + (4 * j)];
outgoing[j+l*3].x = nx[3 + (4 * j)];
outgoing[j+l*3].y = ny[3 + (4 * j)];
}
outgoing += (4 * l);
nx += (4 * l);
ny += (4 * l);
}
}
|
This code is a little complicated, and it's certainly not very efficient, but it works. What's the impact on performance? With multiple SPEs in use, it produces a noticeable gain in performance. But with a single SPE, performance reduces dramatically!
The reason for this is easy enough to understand. This code is actually very inefficient. It is complicated scalar code with multiple, nested loops—the worst case scenario for the SPE. With only a single SPE in use, the PPE ends up spending all of its time waiting for the SPE. However, in the case where there are two or more SPEs, the PPE is no longer stuck waiting and performance is noticeably better than the previous version. The previous version gained a performance boost from running on an SPE, but no additional boost from a second SPE. This new version is faster on two SPEs, but it seems to be pegged at that point.
The performance shift does highlight a key limitation. Right now, the SPE waits until a DMA operation is complete before doing any processing. Furthermore, it will not accept new incoming points until it is done sending old points. This seems like it might be a little inefficient. And this is a good place to look for the next revision.
The idea of double buffering is simple. While working on one buffer, start transferring another so that you spend less time waiting for transfers. The practice is a bit more complicated than it sounds, and it's pretty common to end up with deadlocks. The current algorithm for the SPE is simple, as expressed in the following pseudocode:
Listing 3. Single buffering
while (available_work()) {
get_points();
process_points();
send_points();
}
|
For double buffering, the algorithm is quite a bit more complicated. Substantial additional code begins accumulating points before starting on the main loop, and substantial additional code after the loop finishes processing and sending any remaining points.
Listing 4. Double buffering
while (available_work()) {
get_points(buffer[1]);
process_points(buffer[1]);
send_points(buffer[1]);
confirm_send(buffer[2]);
}
|
It is easy to introduce a deadlock in this algorithm. As with the previous version of the program, the PPE is sending blocks to the SPEs then waiting for previous blocks to be returned. If the SPEs do not send back confirmation of delivery soon enough though, it's quite possible for deadlock to occur: the SPE waits for a new work assignment from the PPE while the PPE waits for a previously-completed block. The key to a double-buffering algorithm is to ensure that each block is confirmed back to the PPE when the next block is assigned.
In the double-buffering algorithm, the SPE goes all the way through an incoming buffer before confirming delivery of the previous buffer. This means that very little time is spent waiting for points to be sent back to the PPE.
The double-buffering algorithm's performance is essentially the same as that of the first version to do shuffling on the SPE: horrible on only one SPE, but quite zippy on two or more. The algorithm could be modified to send outgoing points simultaneously but not wait for incoming points. Because the outgoing points are a larger amount of data, omitting the delay for outgoing points seems a better choice. That suggests further improvement of this algorithm. After all, there are two separate points where the algorithm might be blocked while briefly waiting on transfers.
Given that, a triple-buffering algorithm seems suitable, as shown in Listing 5.
Listing 5. Triple buffering (proposed)
while (available_work()) {
get_points(buffer[1]);
confirm_send(buffer[2]);
process_points(buffer[3]);
send_points(buffer[3]);
}
|
After each loop, the buffer that was just confirmed back to the PPE becomes the new incoming buffer. When one iteration through gets points into buffer 1, confirms the delivery of buffer 2, and then processes buffer 3 (and queues up the transfer of those points), the next iteration can get points into buffer 2, confirm the delivery of buffer 3, and then process buffer 1 (and queue up the transfer of those points). This offers an interesting change in performance. The best-case performance is the same as for the first one from this article, but it is now achieved with three SPEs rather than two.
At this point, the continued failure when you try to add more SPEs to the mix suggests that you might have reached a more fundamental bottleneck: the available bandwidth. How much bandwidth? A fairly deep fractal, such as a 12-iteration snowflake, can involve many millions of points. The last iteration alone has 16,777,217 points. Summing all the points, there is a total of 22,369,632 points that must be moved to calculate it (and each of those points is moved at least twice).
How does that compare with available bandwidth? In this example, the XDR memory on the PS3 for timing runs has a rated bandwidth of about 25.6GBps. A 12-iteration snowflake could be calculated in a bit over 0.6 seconds. Even casual observation reveals that the 15GB of bandwidth available in that time ought to be able to handle a mere 200MB of data being shuffled around. So, the underlying memory bandwidth is not the issue.
There are at least two other possible bottlenecks. One possible bottleneck is communication overhead or lag. If there are blocking operations occurring, such as with mailbox reads and writes, that could explain the performance issue. The other possible bottleneck is that the PPE is still doing too much work! While it might seem that the PPE is barely doing anything except issuing instructions, there's still the matter of copying data into and out of the data buffers. That's taking a noticeable amount of time. Can that time be eliminated? With the shuffling algorithm now occurring entirely on the SPE, the answer is a very definite maybe.
As an experiment, comment out the chunk of code that copied data back in from the SPE. The net result? A factor of 20 increase in runtime! A quick examination doesn't explain this one, but a guess would be that future iterations were working with garbled data sets, which caused floating point exception handlers to get invoked on the SPE. On the other hand, disabling the code to copy points to the outgoing data buffer trimmed about 10% off the program's runtime. Assuming comparable results, this suggests that the PPE is spending at least 40% of its time copying data to or from those buffers. That's pretty bad.
What if you revise the code to work on the data sets directly rather than
copying them into buffers? The first issue is the requirement for 16-byte
alignment. As long as CHUNKSIZE is an even number, the
data sent to the SPEs will always be aligned on a 16-byte boundary (assuming the
point arrays are aligned). Data retrieved from the SPEs, however, is vulnerable to
an off-by-one error. If 16 points are sent to the SPEs, those 16 points only
define 15 line segments. So, if the fractal pattern being generated has 3 line
segments, the last 3 values the SPE generates are nonsense because they are
filled in instead from the next chunk processed. That means that if the first
chunk was 16-byte aligned, the next chunk will be misaligned by 8 bytes. Of
course, if the fractal has an even number of line segments, all is well.
The other problem is the overhead of creating and deleting tens of thousands of shared memory blocks. This seems likely to be computationally expensive and bug-prone. For now, it's up to you to look into this issue.
It's been a long, hard road since the first experimental IDL-based versions of this program, so what have you realized?
- The DaCS version of the program appears to be somewhat more flexible than the IDL version, and the DaCS version is noticeably faster.
- On a large fractal, the performance out of the IDL was about 1 second of rendering time. While that's quite nice compared to the 5 seconds or so it takes on the PPE alone, the DaCS version's performance is down to 0.62 seconds, of which at least a good-sized fraction of a second appears to be the time spent copying to and from the DaCS remote memory buffers.
- While the performance on three SPEs isn't quite 10 times the performance of the simple PPE version of the code, it is enough faster to be a rewarding use of time.
- Moving the shuffling operations onto the SPE yielded a noticeable improvement in performance, but it required the use of a second SPE. Once the second SPE is involved, though, double-buffering on individual SPEs appears to provide little or no real performance benefit.
As always, the essentials of optimization stay the same from one platform to another. Write clear and understandable code first, and measure its performance before optimizing anything. When optimizing, start by profiling to identify where the bottlenecks are. On the Cell/B.E. processor, this can be a bit challenging. You can't just isolate pieces of code; you also have to take into account code running on multiple cores. Furthermore, time spent copying data can be a significant portion of your total runtime. In the example in this article, it seems likely that time spent copying data is still dominating runtime. On a more computationally-intensive task though, that might not be the case.
If you want to explore possibilities, you might find it interesting to develop a version of this program that uses pairs of SPEs with one SPE handling the data-marshalling and another doing computations. Because SPE-to-SPE transfers bypass the main memory, this might give a performance boost without affecting available system bandwidth.
Learn
- Use an
RSS
feed
to request notification for the upcoming articles in this series. (Find out more
about
RSS feeds of developerWorks content.)
- Check out two flavors of documentation on how
great DaCS is:
"DaCS
for Cell/B.E. Programmer's Guide and API Reference"
and
"DaCS for Hybrid-x86 Programmer's Guide and API Reference"
(IBM Semiconductor solutions library, October 2007).
- Read
"Maximizing
the power of the Cell Broadband Engine® processor: 25 tips to optimal application performance"
(developerWorks, June 2006) to learn some
SPE programming tips.
- Get information from Jonathan Bartlett's series
on "Programming high performance applications on the Cell/B.E. processor"
(developerWorks, January 2007 to present) provides an
intro
to Linux® on the PS3,
programming the PS3's SPE,
an
intro to the SPU,
SPU performance programming,
C/C++ SPU programming,
and
managing smart buffer DMA transfers.
- To learn more on Cell/B.E. programming, try the
developerWorks series:
- "Programming high-performance applications on the Cell/B.E. processor"
- "PS3 fab-to-lab"
- "The little broadband engine that could"
- Speaking of Cell/B.E. SDK documentation,
there's a new blog series that abstracts important topic sections of some of the
major SDK documentation to give you a quick-read on the topic (in case you don't
need a fuller explanation) -- they're called
Infobombs,
and some topics already covered include
- Getting a successful FC7 install on a PS3.
- The basic structure of an ALF application.
- Double buffering on ALF as an optimization.
- ALF and DaCS for x86 hybrids.
- Configuring and using the Basic Linear Algebra Subprograms.
- Glossaries on ALF and DaCS error codes, trace events, and ALF attributes.
- Refer to the
Cell
Broadband Engine documentation
section of the IBM Semiconductor Solutions Technical Library for a wealth of
downloadable manuals, specifications, and more.
- Sign up for the
developerWorks newsletter
and get the latest developer news and Cell/B.E. happenings delivered to your inbox
each week. Check Power Architecture
® when you sign up to receive Cell/B.E. news in your newsletter.
- The
Cell Broadband Engine/Power Architecture notebook
is a blog-based resource that hosts
news,
as well as two instructional features -- the
"Forum watch"
of interesting questions and hot topics from the forum and the
"Infobomb"
series (short, precise, task-specific, quick-read knowledge bombs gleaned
from Cell/B.E. documentation).
Get products and technologies
- Get Cell:
Contact
IBM about custom Cell-based or custom-processor based solutions.
- Get your copy of the
IBM SDK for Multicore Acceleration 3.0
or browse through the
library of Cell/B.E. documentation.
- Find all
Cell/B.E.-related articles, discussion forums, downloads, and more at the IBM
developerWorks
Cell
Broadband Engine resource center:
your definitive resource for all things Cell/B.E.
- Contact IBM about
custom
Cell/B.E.-based or custom-processor based solutions.
Discuss
- Participate in the discussion forum.
- Get fast answers from IBM experts and
real-world practitioners in developerWorks
Cell
Broadband Engine discussion forum
- Check out the
Cell Broadband
Engine Architecture forum
to get your technical questions about the processor answered.
- Check out the
Cell Broadband
Engine Architecture forum
to get your technical questions about the processor answered. Juicy problems and
answers from the forums are rounded up periodically and highlighted in the
"Forum watch" blog series.
- Go to the
Power Architecture blog
for news, downloads, instructional resources, and event notifications for
Cell/B.E. and other Power Architecture-related technologies. You can find the
popular "Forum watch" blog series (Q&A roundup), the "FixIt" technology
updates, and the Infobomb quick-read technology introductions.
Comments (Undergoing maintenance)





