Skip to main content

The little broadband engine that could: Looking at some DaCS performance fine-tuning issues

Peter Seebach (developerworks@seebs.net), Freelance writer, Wind River Systems
Author photo
Peter Seebach has been writing about Cell Broadband Engine development for a few years, and programming in general for many more. He sort of wishes modern vector processors would come with a built-in couch.

Summary:  Last time, you migrated the fractal-rendering program from earlier in the series to run using the DaCS data library with no appreciable performance gains when going from running on one SPE to running on multiple SPEs. This article explores ways to optimize performance.

View more content in this series

Date:  20 May 2008
Level:  Intermediate
Activity:  3036 views

Introduction

In the previous article in the series, you migrated the IDL version of an example fractal-rendering program from earlier in the series to run using the DaCS data library. The performance, while noticeably better than running on the PPE only, showed no noticeable gains when going from running on one SPE to running on multiple SPEs.

There are a number of possible explanations for this. The most likely, however, is simply that the PPE's speed at setting up and tearing down chunks of data for the SPEs is the limiting factor. Therefore, I thought it prudent as a first step to look at reducing the amount of work the PPE does when sending and receiving data as a way to improve performance.

In the current incarnation, the PPE is handling marshalling. Data are sent to the SPE in the most convenient possible form for the SPE and then returned in the form most convenient for the PPE. It is almost certainly faster to have the SPE do the shuffling.

Shuffling data

In the first pass design, each chunk of N points was sent to the SPE as a set of N x-values followed by a chunk of N y-values. Marshalling for these is trivial, certainly, as shown in Listing 1.


Listing 1. Converting points to individual values (mashalling)
                
     /* given an array of values, and a pair of pointers... */
for (i = 0; i < count; ++i) {
        *ox++ = old[i].x;
        *oy++ = old[i].y;
}

This is a great deal slower than a plain old memcpy. Similarly, the elaborate and complicated shuffling algorithm used before can be turned into a plain old memcpy. But this implies that now the same kinds of shuffling need to be done on the SPE.

Because the actual calculation loop is just fine, it's simple enough to just add additional buffers (after all, there was plenty of SPE local store left) and write small conversion routines that convert data to and from the expected formats.

As a first pass, just write the code as scalar code on the SPE. Listing 2 offers the code to take an array containing X values (and an immediately following array containing Y values) and to convert it to the same number of points.


Listing 2. Marshalling using scalar code on the SPE
                
void
outconvert(point *outgoing, float *new_points, int count) {
            int i, j;
            int l = (f.points - 1);
            float *nx = new_points;
            float *ny = new_points + (CHUNKSIZE * l);
            for (i = 0; i < count; i += 4) {
                        for (j = 0; j < l; ++j) {
                                    outgoing[j].x = nx[0 + (4 * j)];
                                    outgoing[j].y = ny[0 + (4 * j)];
                                    outgoing[j+l].x = nx[1 + (4 * j)];
                                    outgoing[j+l].y = ny[1 + (4 * j)];
                                    outgoing[j+l*2].x = nx[2 + (4 * j)];
                                    outgoing[j+l*2].y = ny[2 + (4 * j)];
                                    outgoing[j+l*3].x = nx[3 + (4 * j)];
                                    outgoing[j+l*3].y = ny[3 + (4 * j)];
                        }
                        outgoing += (4 * l);
                        nx += (4 * l);
                        ny += (4 * l);
            }
}

This code is a little complicated, and it's certainly not very efficient, but it works. What's the impact on performance? With multiple SPEs in use, it produces a noticeable gain in performance. But with a single SPE, performance reduces dramatically!

The reason for this is easy enough to understand. This code is actually very inefficient. It is complicated scalar code with multiple, nested loops—the worst case scenario for the SPE. With only a single SPE in use, the PPE ends up spending all of its time waiting for the SPE. However, in the case where there are two or more SPEs, the PPE is no longer stuck waiting and performance is noticeably better than the previous version. The previous version gained a performance boost from running on an SPE, but no additional boost from a second SPE. This new version is faster on two SPEs, but it seems to be pegged at that point.

The performance shift does highlight a key limitation. Right now, the SPE waits until a DMA operation is complete before doing any processing. Furthermore, it will not accept new incoming points until it is done sending old points. This seems like it might be a little inefficient. And this is a good place to look for the next revision.


Exploring double buffering

The idea of double buffering is simple. While working on one buffer, start transferring another so that you spend less time waiting for transfers. The practice is a bit more complicated than it sounds, and it's pretty common to end up with deadlocks. The current algorithm for the SPE is simple, as expressed in the following pseudocode:


Listing 3. Single buffering
                
while (available_work()) {
            get_points();
            process_points();
            send_points();
}

For double buffering, the algorithm is quite a bit more complicated. Substantial additional code begins accumulating points before starting on the main loop, and substantial additional code after the loop finishes processing and sending any remaining points.


Listing 4. Double buffering
                
while (available_work()) {
            get_points(buffer[1]);
            process_points(buffer[1]);
            send_points(buffer[1]);
            confirm_send(buffer[2]);
}

It is easy to introduce a deadlock in this algorithm. As with the previous version of the program, the PPE is sending blocks to the SPEs then waiting for previous blocks to be returned. If the SPEs do not send back confirmation of delivery soon enough though, it's quite possible for deadlock to occur: the SPE waits for a new work assignment from the PPE while the PPE waits for a previously-completed block. The key to a double-buffering algorithm is to ensure that each block is confirmed back to the PPE when the next block is assigned.

In the double-buffering algorithm, the SPE goes all the way through an incoming buffer before confirming delivery of the previous buffer. This means that very little time is spent waiting for points to be sent back to the PPE.

The double-buffering algorithm's performance is essentially the same as that of the first version to do shuffling on the SPE: horrible on only one SPE, but quite zippy on two or more. The algorithm could be modified to send outgoing points simultaneously but not wait for incoming points. Because the outgoing points are a larger amount of data, omitting the delay for outgoing points seems a better choice. That suggests further improvement of this algorithm. After all, there are two separate points where the algorithm might be blocked while briefly waiting on transfers.

Given that, a triple-buffering algorithm seems suitable, as shown in Listing 5.


Listing 5. Triple buffering (proposed)
                
while (available_work()) {
            get_points(buffer[1]);
            confirm_send(buffer[2]);
            process_points(buffer[3]);
            send_points(buffer[3]);
}

After each loop, the buffer that was just confirmed back to the PPE becomes the new incoming buffer. When one iteration through gets points into buffer 1, confirms the delivery of buffer 2, and then processes buffer 3 (and queues up the transfer of those points), the next iteration can get points into buffer 2, confirm the delivery of buffer 3, and then process buffer 1 (and queue up the transfer of those points). This offers an interesting change in performance. The best-case performance is the same as for the first one from this article, but it is now achieved with three SPEs rather than two.


Exploring other options

At this point, the continued failure when you try to add more SPEs to the mix suggests that you might have reached a more fundamental bottleneck: the available bandwidth. How much bandwidth? A fairly deep fractal, such as a 12-iteration snowflake, can involve many millions of points. The last iteration alone has 16,777,217 points. Summing all the points, there is a total of 22,369,632 points that must be moved to calculate it (and each of those points is moved at least twice).

How does that compare with available bandwidth? In this example, the XDR memory on the PS3 for timing runs has a rated bandwidth of about 25.6GBps. A 12-iteration snowflake could be calculated in a bit over 0.6 seconds. Even casual observation reveals that the 15GB of bandwidth available in that time ought to be able to handle a mere 200MB of data being shuffled around. So, the underlying memory bandwidth is not the issue.

There are at least two other possible bottlenecks. One possible bottleneck is communication overhead or lag. If there are blocking operations occurring, such as with mailbox reads and writes, that could explain the performance issue. The other possible bottleneck is that the PPE is still doing too much work! While it might seem that the PPE is barely doing anything except issuing instructions, there's still the matter of copying data into and out of the data buffers. That's taking a noticeable amount of time. Can that time be eliminated? With the shuffling algorithm now occurring entirely on the SPE, the answer is a very definite maybe.

As an experiment, comment out the chunk of code that copied data back in from the SPE. The net result? A factor of 20 increase in runtime! A quick examination doesn't explain this one, but a guess would be that future iterations were working with garbled data sets, which caused floating point exception handlers to get invoked on the SPE. On the other hand, disabling the code to copy points to the outgoing data buffer trimmed about 10% off the program's runtime. Assuming comparable results, this suggests that the PPE is spending at least 40% of its time copying data to or from those buffers. That's pretty bad.

What if you revise the code to work on the data sets directly rather than copying them into buffers? The first issue is the requirement for 16-byte alignment. As long as CHUNKSIZE is an even number, the data sent to the SPEs will always be aligned on a 16-byte boundary (assuming the point arrays are aligned). Data retrieved from the SPEs, however, is vulnerable to an off-by-one error. If 16 points are sent to the SPEs, those 16 points only define 15 line segments. So, if the fractal pattern being generated has 3 line segments, the last 3 values the SPE generates are nonsense because they are filled in instead from the next chunk processed. That means that if the first chunk was 16-byte aligned, the next chunk will be misaligned by 8 bytes. Of course, if the fractal has an even number of line segments, all is well.

The other problem is the overhead of creating and deleting tens of thousands of shared memory blocks. This seems likely to be computationally expensive and bug-prone. For now, it's up to you to look into this issue.


Conclusion

It's been a long, hard road since the first experimental IDL-based versions of this program, so what have you realized?

  • The DaCS version of the program appears to be somewhat more flexible than the IDL version, and the DaCS version is noticeably faster.
  • On a large fractal, the performance out of the IDL was about 1 second of rendering time. While that's quite nice compared to the 5 seconds or so it takes on the PPE alone, the DaCS version's performance is down to 0.62 seconds, of which at least a good-sized fraction of a second appears to be the time spent copying to and from the DaCS remote memory buffers.
  • While the performance on three SPEs isn't quite 10 times the performance of the simple PPE version of the code, it is enough faster to be a rewarding use of time.
  • Moving the shuffling operations onto the SPE yielded a noticeable improvement in performance, but it required the use of a second SPE. Once the second SPE is involved, though, double-buffering on individual SPEs appears to provide little or no real performance benefit.

As always, the essentials of optimization stay the same from one platform to another. Write clear and understandable code first, and measure its performance before optimizing anything. When optimizing, start by profiling to identify where the bottlenecks are. On the Cell/B.E. processor, this can be a bit challenging. You can't just isolate pieces of code; you also have to take into account code running on multiple cores. Furthermore, time spent copying data can be a significant portion of your total runtime. In the example in this article, it seems likely that time spent copying data is still dominating runtime. On a more computationally-intensive task though, that might not be the case.

If you want to explore possibilities, you might find it interesting to develop a version of this program that uses pairs of SPEs with one SPE handling the data-marshalling and another doing computations. Because SPE-to-SPE transfers bypass the main memory, this might give a performance boost without affecting available system bandwidth.


Resources

Learn

Get products and technologies

Discuss

About the author

Author photo

Peter Seebach has been writing about Cell Broadband Engine development for a few years, and programming in general for many more. He sort of wishes modern vector processors would come with a built-in couch.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=309472
ArticleTitle=The little broadband engine that could: Looking at some DaCS performance fine-tuning issues
publish-date=05202008
author1-email=developerworks@seebs.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers