Skip to main content

The little broadband engine that could: An introduction to using SPEs for Cell Broadband Engine development

Arise, my minions!

Peter Seebach, Freelance author, Plethora.net
Peter Seebach
Peter Seebach has always wanted to use the phrase "novel architecture" in a sentence. He has been programming recreationally long enough to remember that the phrase used to refer to the 68000. Besides, offloading work just sounds like a good concept to him.

Summary:  In this first article in a series on Cell Broadband Engine™ (Cell/B.E.) development, Peter Seebach introduces the API used to run programs on SPEs, focusing specifically on loading code on an SPE and sending data to it for processing.

View more content in this series

Date:  05 Jun 2007
Level:  Intermediate
Activity:  4174 views

The Cell Broadband Engine processor has a novel architecture. Programmers familiar with single-core systems or even with homogeneous multicore systems may find it challenging to make effective use of the Cell/B.E. processor's unusual architecture. In this series, I introduce Cell/B.E. programming from the perspective of an experienced programmer (me) who is new to the Cell/B.E. architecture, and I'll show you how to develop a feel for programming models that work on the Cell/B.E. processor. The series will be much more enlightening if you download the SDK 2.1 and follow along (see Resources).

If you have actual Cell/B.E. hardware (I used a Sony® PLAYSTATION® 3), you can use the SDK's development tools to target it. If you don't have Cell/B.E. hardware, the SDK also provides a simulator that offers a usable emulation of the hardware for testing purposes, although obviously the simulated hardware is slower. Use whichever one you want to play along with in this series. The simulator offers some additional debugging features that end-user hardware lacks, so you might find it useful even if you have hardware.

Beneath the basics

Because other articles discuss the Cell/B.E. architecture in more detail (see Resources), for now a brief overview should suffice. The Cell/B.E. processor offers a dual-threaded PowerPC® processor core (the Power Processor Element or PPE) and eight single-threaded SIMD processor cores (the Synergistic Processor Elements or SPEs). The PPE is a fairly standard PowerPC design, including AltiVec/VMX vector enhancements; for more information about the PPE's vector processing, you might look to the Unrolling AltiVec series from 2005 (see Resources). The PPE, although architecturally similar to other 64-bit PowerPC chips, isn't directly related to a specific product line. The SPEs are where the Cell/B.E. architecture gets interesting.

The SPE is a vector-only processor. Every operation works on multiple data elements in parallel, stored in 128-bit registers. There are no scalar (single-element) registers or operations. Each SPE has 256KB of local storage for both instructions and data. This storage is not a "cache" -- it does not share address space with main memory.

Communication between elements is also somewhat unusual, although it won't have significant effects on our early explorations of Cell/B.E. development. Data does not need to be managed entirely by the PPE. Each SPE can communicate directly with main memory and furthermore, the SPEs can communicate directly with each other.

Communication between the PPE and the SPEs happens through "mailboxes" (dedicated registers which can be written to by one core and retrieved by the other), through DMA, or both. This series begins with an overview of the API used to communicate between the PPE and the SPEs.

What's the API done for me lately?

You don't really need to use the provided API to run code on the SPEs. You could just write your own code using intrinsics, writing to registers, taking interrupts, and so on. It might be tempting to do this; the moment you create an API, you accept some degree of overhead. In any given usage, you might be able to shave a few cycles by omitting some of the steps which don't apply to that particular usage.

However, the Cell/B.E. chip is a fairly complicated processor with a lot to keep track of. One crucial bug-mitigation strategy is to come up with guarantees and invariants which are preserved to protect against bugs. A well-designed API can encapsulate these decisions and policies so that any program using the API correctly gets a free pass from a lot of very detailed analysis and coding that would otherwise need to be carefully vetted at every opportunity.

In short, if you want to write an interesting program in a reasonable amount of time, it's useful to have a handful of conventions, and the API provided with the Cell/B.E. SDK makes an excellent starting point.

Introducing the SPE API

The API for offloading code to the SPEs is distributed as a code library for the PPE, libspe. This series looks at the revised libspe2 implementation which offers a substantial API revision from the implementation released in earlier versions of the Cell/B.E. SDK. You can find complete documentation in the Cell/B.E. SDK (see Resources).

The SPE API is built around the notion of an SPE context; this is a representation of the current state of a single SPE containing the complete data set to be loaded to it (including any executable code). In the previous SDK, the API provided for launching "threads" on the SPEs. In the revised API, the call to run a context is synchronous: The call does not complete until the called program completes. Your code must create multiple threads (or processes, if you really prefer) to run multiple SPEs. On the other hand, you have more direct control over choices related to thread scheduling since you use the thread API directly.

SPE programs are compiled as standalone programs with a main function which simply performs the necessary work. Setup and tear down are handled automatically by the startup code linked in with the SPE program and by the library code running on the PPE; you can generally ignore them. SPE programs often do at least some of their own data transfer, however.

Important data types

The SPE API defines a number of opaque data types which are used as arguments to various API functions. Without going into the exhaustive details of all of them, here are the ones I'll be using in the first few examples:

  • spe_context_ptr_t.
  • spe_program_handle_t.
  • spe_stop_info_t.

Okay, you probably need a little more detail.

The first, spe_context_ptr_t, represents a virtualized SPE state; it includes the register states and the contents of local store. This is the core item which indicates a program on an SPE, whether it's being loaded, or running, or being queried as to why it stopped. This is an opaque handle; do not peer into its internals.

The second, spe_program_handle_t, is a handle that can be used to identify an SPE executable program built for use with libspe2. This type can be created from a file containing an SPE binary or embedded into a PPE program using the ppu-embedspu utility. This is another opaque handle type whose contents are used only by the library.

Finally, the spe_stop_info_t type is used to record the reason for which an SPE program stopped execution. Unlike the other two, this is not an opaque type; its structure is documented. The most important member is stop_reason which indicates the reason for which the SPE program stopped execution. The most common value should be SPE_EXIT (the program was successful and an exit status has been stored in the structure). Find a complete description of this type in the documentation for the SPE Runtime Management Library, found in the "pdfs" directory of the SDK distribution (see Resources).

API calls

To run a program on an SPE, create an SPE context, load a program into it, and run it. If you have an embedded SPE binary of type spe_program_handle_t, this can be reduced to four lines of code:


Listing 1. The SPE API
                
entry = SPE_DEFAULT_ENTRY;
context = spe_context_create(0, 0);
spe_program_load(context, my_program);
spe_context_run(context, &entry, 0, 0, 0 0);

Of course, like any trivialized example, this has a number of potential flaws. First, there's no error checking; that's always a bad thing. Second, the program is called without any arguments provided, meaning, it will do exactly the same thing every time; it has no way to acquire any data at runtime because it hasn't been given any information to tell it where the data would be. This is good enough to run "hello, world!," but not much else.

The entry argument allows you to have multiple entry points into a single SPE executable. If you have a few small programs, you could load them all at once, then use different entries when starting a run to run different programs without redoing the whole load process. The special value SPE_DEFAULT_ENTRY causes the default entry point in the SPE program's ELF headers to be used.

The SPE's view of the API

It's instructive to understand what happens on the SPE when the PPE is using the libspe2 API. The libspe2 code provides a runtime in which your program runs. From your perspective, execution begins at a function called main(); some library startup code obtains the arguments passed in from the PPE and then calls your main function with those arguments. This function returns an int (which is stored into the spe_stop_info_t structure on the PPE side) and takes three arguments -- an unsigned long long indicating which SPU the code is running on and a pair of 64-bit arguments which contain the argp and envp arguments to spe_context_run. These arguments are canonically of type unsigned long long, although some programs engage in mild type punning, treating them as 64-bit pointers or other types. In contrast to the UNIX® environment norm of multiple string arguments, there are a fixed number of arguments (two) and each is a pointer into main memory.

That's a pretty big difference because the SPE does not have direct access to main memory. Instead of having multiple directly accessible arguments, you get a pointer into main memory and the SPE has to issue DMA requests to get the pointed-to data from main memory. For instance, the following code would gather a chunk of data pointed to by argp:


Listing 2. Issuing a DMA request
                
spu_writech(MFC_WrTagMask, 1 << 0);
spu_mfcdma32((void *)(&data), (unsigned int)argp, sizeof(data), 0,
MFC_GET_CMD);
(void)spu_mfcstat(2);

This request asks for sizeof(data) bytes of data copied from the address in main memory denoted by argp to be copied into the object data. No indication of how much data has been provided is passed to the program; the SPE program has to know what data it will be called with, and it's up to the calling program on the PPE to provide the right data in the right order.

The spu_mfcdma32() function takes a 32-bit value even though the target processor (the PPE) has a native 64-bit address space. The spu_mfcdma64() function takes an additional argument specifying the top 32 bits of an effective address. The effective address is, under the hood, always split into a pair of 32-bit registers; the high-order register is optional and if it's not specified, it is treated as zero. On current systems, it turns out to be safe enough to assume that the address space of the Cell/B.E. processor will always fit in the first 32 bits. (And yes, you probably ought to feel very afraid when someone makes a statement like that.)

SPE programs that need more than one argument will generally build a structure containing named members and pass that over to be used the way arguments normally would be.

In some cases, the data passed in as arguments will have more addresses that are, in turn, DMA-ed over in whole or in part. For instance, the argument list might contain the address of a 16MB chunk of data to process. The SPE can't hold that much data at once in the 256KB local store, but it can DMA over one chunk at a time for processing. Whether it's an argument passed in or an address gotten some other way, the DMA targets must be aligned on cache lines (16 bytes).

Finally, the main function can return a status to the calling environment in a way familiar to most C programmers; the return value of the main() function is passed back to the calling environment. The result is a moderately simplistic, but sufficient, interface for handling a pretty good range of problems.

Firing up the API

You have a lot of flexibility in how you move data to and from an SPE. As you might expect, the simplest models are not the highest performers, however, they're a good starting point for understanding the problem. Furthermore, if simple code is "fast enough," you might as well use it during your prototyping stage. In particular, if you can simplify communication with the SPE, then you can focus your debugging effort on the actual calculations, which is a good thing.

The simplest usage is to have the SPE program take two pointers: one for an input buffer and one for an output buffer. The SPE program reads in the input buffer, processes the data, and then writes it to the output buffer. If the data set is small enough to fit into available local store, you don't need a more elaborate setup. You could even just use a single pointer and write data back into the buffer you got it from, although this might be a nuisance to program for on the other end. Paired buffers are easier to keep track of.

For now, though, I'm going to present a fairly minimalist sample program which uses the SPE to perform a memcpy() operation. Just as a simpler communications protocol lets you focus on your algorithm, a simpler algorithm lets you focus on the communications protocol. Here's the algorithm:


Listing 3. Maybe not the best use ever of the SPE's talents
                
memcpy(output, input, spe_args.count * sizeof(int));

Well, the obvious question here is, what's spe_args? Back to the API: The PPE can pass a single pointer into the program, so multiple arguments have to be bundled up. My solution is this:


Listing 4. A rigorous argument
                
typedef struct {
        int *input;
        int *output;
        int count;
        unsigned char pad[4];
} spe_arg_t;

This uncovers a couple of quirks of the API. In particular, note that the array is padded out to 16 bytes; this is the size of a cache line, and the DMA engine likes to work in cache lines. Apart from that, we have a pretty straightforward structure definition: two pointers and a number of objects to copy. By the way, this is not an API type; this is a user-defined type invented for this article. You can -- and have to -- make up your own.

Writing the SPE program

The SPE program is fairly short. As formatted originally, it's about 36 lines of code. (I have adapted the formatting a bit for display in the article.) I'll go through it one section at a time. Note that there's only one user-written function, main(), in the whole thing. I start out with a few headers:


Listing 5. Headers
                
#include "dma_1.h"
#include <spu_mfcio.h>
#include <stdio.h>
#include <string.h>

The "dma_1.h" file contains the spe_arg_t typedef which is shared between the SPE code and the PPE code. The API for interacting with the memory flow controller (MFC) is provided by the mfc headers. Error messages (used in debugging) are printed to standard error, using stdio. This is a great idea when debugging, but you probably won't want to use it in production code. Finally, memcpy() is declared in <string.h>. Onto the actual code:


Listing 6. I hear your arguments
                
int
main(unsigned long long id, unsigned long long argp)
        {
        spe_arg_t spe_args;
        int *input, *output;

        spu_writech(MFC_WrTagMask, -1);
        spu_mfcdma32(&spe_args,
                (unsigned int) argp,
                sizeof(spe_args),
                0,
                MFC_GET_CMD);
        (void)spu_mfcstat(2);

       if (spe_args.count < 0) {
                fputs("Unable to perform any task a negative number "
                        "of times.  Buy a quantum computer.",
                        stderr);
                return 1;
        }

Getting arguments is pretty straightforward. The spu_writech() command and the spu_mfcstat() command are used to verify successful completion of the DMA routine.

The spu_mfcdma32() function allows DMA from a 32-bit subset of the PPE's theoretical 64-bit address space which is plenty for our purposes. The arguments are the local address, the remote address, the number of bytes to transmit, a channel, and the command to perform (in this case, GET).

When the DMA transfer is complete, the spe_arg_t structure passed as an argument by the PPE program has been copied into the SPE program for argument validation.


Listing 7. Getting down to brass tacks
                
input = malloc(spe_args.count * sizeof(int));
output = malloc(spe_args.count * sizeof(int));

spu_writech(MFC_WrTagMask, 1 << 0);
spu_mfcdma32(input,
        spe_args.input,
        spe_args.count * sizeof(int),
        0,
        MFC_GET_CMD);
(void)spu_mfcstat(2);

memcpy(output, input, spe_args.count * sizeof(int));

Having found out how many items the PPE needs copied, the SPE program can allocate local storage, read the items in (using the same code as before), and then perform its algorithm, reviewed above. No problem here. Now how do you send the data back?


Listing 8. Enough with the arguments, how about some backtalk?
                
        spu_writech(MFC_WrTagMask, 1 << 0);
        spu_mfcdma32(output,
                spe_args.output,
                spe_args.count * sizeof(int),
                0,
                MFC_PUT_CMD);
        (void)spu_mfcstat(2);

        return 0;
}

Yes, it really is that easy. Note that there are two communications with the PPE in this code fragment, not one! The PUT command sends back the contents of the output buffer, but the return statement at the end of main() also sends back data; the status generated will be passed back to the PPE.

The PPE program

The PPE program is conceptually not much more elaborate, but it exposes some features (or perhaps quirks) of SPE/PPE communications. The core of the program looks roughly like this:


Listing 9. And I say to the SPE, copy, and it copieth
                
extern spe_program_handle_t spu_prog;
spe_arg_t spe_args;
int status;
int *input;
int *output;

/* initialize input and output */
[...]

/* fill in spe_args with two pointers and a count */
spe_args.input = input;
spe_args.output = output;
spe_args.count = 1024;

context = spe_context_create(0, 0);
spe_program_load(context, &spu_prog);
status = spe_context_run(context, &entry, 0, &spe_args, NULL, 0);

The address (in PPE address space) of the spe_args structure is passed in as an argument which will eventually become the argp argument of the SPE program's main. The spu_prog object is created by the ppu-embedspu program and consists of a PPE ELF object containing an SPE executable and some meta data about it. The NULL argument to spe_context_run would become a third argument to the main function on the SPE; there's no need for it in this example.

The spe_context_run function runs the program until it stops. The return value from main() in the SPE program is stored into the variable status in the PPE program. Previous versions of libspe used a threaded system in which you invoked a thread and then waited for it. In libspe2, the operation is synchronous, so if you want to run a thread in the background, you have to make your own pthread calls.

Alignment requirements

The initialization of the "input" and "output" objects was omitted from the above sample code for a good reason: It's more complicated than the rest of the code.

In fact, the code above won't quite work: It gets a bus error when the SPE tries to retrieve its arguments. The problem is that there are unusually strict alignment requirements on data that will be transferred through DMA. Even the alignment guaranteed by malloc() may not be strict enough. Variables can be declared using gcc's __attribute__ extension to compel alignment:


Listing 10. Lawful neutral
                
spe_arg_t spe_args __attribute__ ((aligned (16)));

Allocated memory is a bit trickier. You can't tell in advance what alignment an allocated region will have, so your best bet is to allocate a bit of extra storage and get an aligned region within it. Here's one way to do that:


Listing 11. Generating an aligned pointer
                
intptr_t addrfixup;
int *input, *input_save;
input_save = malloc(count * sizeof(int) + ALIGN);
addrfixup = (unsigned long) input_save;
addrfixup &= ~(ALIGN-1);
addrfixup += ALIGN;
input = (int *) addrfixup;

This code is a little disconcerting. It does have the hidden assumption that ALIGN is a power of two. 16 works just fine, for DMA to the SPE. However, for any power of two, this will give you a guarantee that no matter what your starting value is, the resulting value will be aligned to a multiple of that many bytes. The assumptions made about pointer/integer conversions aren't perfectly portable, but then neither are "DMA operations to the SPE." The original pointer has to be saved so it can be freed later.

With all that done, though, this is a viable framework for getting data to and from an SPE. When you want to do unit tests on SPE code or algorithms or get timing information on them, this is probably enough. It's not particularly efficient for a number of reasons, not the least of which is that running a task on only a single SPE and waiting for it to complete probably saves you no time at all. Similarly, the lack of buffering on the SPE end means that all the data has to be copied before the SPE even starts computing, and none of the output is copied back until the computations are finished. But this provides a framework for interacting with the SPEs and shows the basics of how the API works.

As the series continues, I'll update, tweak, and in some cases outright replace this framework, but the basic model of loading code to an SPE will generally serve.

Next up: Mailboxes

While DMA to and from the SPE is adequate for moving blocks of data, there's often a need for the ability to send smaller messages faster. This is done using mailboxes, special purpose registers available to the SPEs and to the PPE. The next article in this series looks at mailboxes and how they can be used to send data back and forth.


Resources

Learn

Get products and technologies

Discuss

About the author

Peter Seebach

Peter Seebach has always wanted to use the phrase "novel architecture" in a sentence. He has been programming recreationally long enough to remember that the phrase used to refer to the 68000. Besides, offloading work just sounds like a good concept to him.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=228654
ArticleTitle=The little broadband engine that could: An introduction to using SPEs for Cell Broadband Engine development
publish-date=06052007
author1-email=developerworks@seebs.plethora.net
author1-email-cc=developerworks@seebs.plethora.net

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers