Skip to main content

The little broadband engine that could: IDL is dead--long live DaCS!

A great substitute for IDL: The Data Communication and Synchronization library

Peter Seebach (developerworks@seebs.net), Freelance writer, Plethora.net
Author photo
Peter Seebach usually responds to a new API the way most people respond to a brightly-wrapped present--by shaking it. He keeps wrapping paper and old documentation alike for the sentimental value.

Summary:  In SDK 3.0, the Data Communication and Synchronization library (DaCS) provides a sparkling substitute for IDL. DaCS is a set of services to aid the development of applications and application frameworks in a heterogeneous multi-tiered system. This article takes you on a tour of the DaCS process model and explores general DaCS principles, including communication and memory access.

View more content in this series

Date:  04 Mar 2008
Level:  Intermediate
Activity:  2496 views

Introduction

Previous articles in the series reviewed the IDL tool included with the Cell Broadband Engine SDK versions up through 2.1. In version 3.0 though, the IDL tool is no longer provided. Instead, you're given comparable function through the Data Communication and Synchronization library (DaCS), which is a set of services designed to simplify the development of distributed applications on the Cell/B.E. (or similar) processors. The version of the library distributed with the SDK is moderately specific to the Cell/B.E. environment, but it is well-suited to program execution in this environment.

DaCS provides abstractions supporting development models in which the PPE assigns tasks to SPE units which then go about these tasks asynchronously. The DaCS library defines a variety of DaCS elements (DEs), which can be divided into:

  • Host elements (HEs)
  • Accelerator elements (AEs)

For purposes of Cell/B.E. development, you can consider these as abstractions of the PPE (a host element) and the SPEs (accelerator elements). On a typical Cell/B.E. blade server with two processors, it is possible to allocate a single blade with 16 SPE children or to allocate two CBE devices, each with eight children. The latter view might allow better control of processor affinity at some cost in complexity. In the Cell/B.E. SDK running on a PlayStation3 system, it is possible to allocate only the 6 available SPE children; the higher-level framework is not available.

Introducing the process model

While DaCS has abstractions allowing for multiple processes on a single DE, the SDK currently supports only a single process on each element. Each process is started in the usual way for SPE programs, and it can then initialize the SPU side of the DaCS library, allowing communication with the library running on the PPE side.

The simplest way to set up an SPE process using DaCS will be eerily familiar to anyone who's used libspe, because the DaCS library uses the same embedded binaries. An SPE program compiled with spu-gcc can be turned into an embedded object using the ppu-embedspu program, and then it can be linked into a program using DaCS. Setup is slightly more complicated than with libspe, although you don't need to manage your own threading. DaCS handles this transparently.


Listing 1. Running a program on an SPE
                
extern spe_program_handle_t spu_prog;
de_id_t spes[16];
dacs_process_id_t pids[16];
uint32_t children;
int32_t status;

dacs_runtime_init(NULL, NULL);
dacs_get_num_avail_children(DACS_DE_SPE, &children);
dacs_reserve_children(DACS_DE_SPE, &children, spes);
dacs_de_start(spes[0], &spu_prog, NULL, NULL,
                  DACS_PROC_EMBEDDED, &pids[0]);
dacs_de_wait(spes[0], pids[0], &status);
dacs_runtime_exit();

This program does the following:

  1. It queries the number of available children (the example assumes that at least one is available).
  2. It attempts to reserve all of those children.
  3. It begins a program on the first one.
  4. It waits for the program to terminate.
  5. It exits the DaCS runtime.

It's not a complicated program. The omission of error-checking is strictly for introductory purposes. In real code, you would want to check all of these operations carefully. The spu_prog handle is the same sort of embedded SPU code that was used in previous articles in the series, although the DaCS library obliges you to change from -m32 to -m64 when creating embedded binaries.

The status value is filled in with the return value of the main function of the SPE program. The two NULL arguments immediately after &spu_prog are passed in as argv and envp, respectively.

Now you are at the point of the simplest case in which a program runs entirely on data provided in its arguments (and so on). However, the real strength of DaCS is in manipulation of data once the child program is up and running.


Understanding the general principles

There are a number of common patterns throughout the entire DaCS API. The most significant patterns are:

  • The way the API handles return values
  • The need to consistently release resources
  • The alignment requirements across all the API calls

DaCS API and return values

The example code in Listing 1 omits error checking for brevity. As a result, it never uses the return values of any DaCS functions. Each function in the DaCS library returns only a success or failure indicator. Any data generated by the function are stored into objects pointed to by their arguments. For example, to get a count of available children, you can call dacs_get_num_avail_children(DACS_DE_SPE, &children);. The address of children is passed to the function, which modifies the object through this pointer. The return value is usually the constant DACS_SUCCESS, indicating a successful operation, but it could also be an error code.

This setup avoids the quirk of needing special sentinel values to indicate error returns. This setup also provides greater consistency across the API. On the other hand, it can be a little confusing at times, and there are a number of cases where you need to pass the address of an object to one function but the object itself to another. Keep a close eye on that. Don't just ignore compiler warnings about type mismatches!

Allocating and releasing resources

Each resource allocation API call has a corresponding call to release the resource. These calls are not optional, even if your program is about to exit. In fact, dacs_runtime_exit() can hang if resources have not been freed correctly! So, if you create a remote memory region with dacs_remote_mem_create, you must destroy it with dacs_remote_mem_destroy when you are done with it. If you have accepted access to a memory region with dacs_remote_mem_accept, you must release it with dacs_remote_mem_release when you are done. The dacs_remote_mem_destroy call blocks until all clients have been released.

While this might sound fairly intrusive, it's not bad at all (although returning an error condition would be preferable to blocking). Still, it tends to be easy to get these right, because actual code tends to have a clearly defined boundary for when a resource is no longer being used.

Alignment requirements

Nearly everything needs to be aligned suitably for the processor, which generally means aligned to a 16-byte boundary. Misaligned things don't always have the effects you'd expect, and sometimes they aren't detected gracefully. If you start seeing mysterious crashes, look for unaligned accesses, and not necessarily on the DE that's crashing! In one example, a misaligned object on a PPE can cause an SPE to crash. (Worse yet, changing the alignment of other objects on the PPE can influence the alignment of the object in question.)

While gcc's __attribute__ ((aligned (16))) is your friend, it does not necessarily have any effect (or at least, the desired effect) on stack variables. This means that in some cases, you will have to declare objects globally or allocate memory for them. Unfortunately, the alignment requirements don't mesh well with object sizes. For example, you can't simply declare an array of N integers to use as arguments for dacs_send messages, because if the first item in the array is aligned correctly, later ones won't be!

The documentation doesn't really give much detail for the alignment requirements. In practice, just align everything to 16 bytes, and you should be fine.


Regarding communication and memory access

The SPEs have local storage, but they have no direct access to the main memory the PPE uses. Most readers are probably aware of this, but it bears emphasis because all non-trivial Cell/B.E. programming ends up involving a fair amount of code to shuffle data. With IDL, data were moved automatically by the library without explicit operations, much as data are sent to a remote system using a remote procedure call (RPC) interface. In DaCS, data management is somewhat more explicit.

DaCS provides functions to send and receive messages, and it also provides a family of functions to share regions of memory between elements. In general, DaCS provides basic tools as opposed to a complete protocol. Operations are asynchronous, and overall structure is somewhat subject to negotiation. In general, you can send messages and then wait for confirmation of their receipt. What this means is that both sides need to cooperate. If you send a message and wait for the other side to receive it but the other side never receives it, you will wait forever.

Many of the communication functions offer byte-swapping primitives. While these are doubtless of significant importance in a mixed environment, such as an Intel-based or AMD-based server using Cell/B.E. blade servers, the only code you need to know to make effective use of a pure-Cell/B.E. system is DACS_BYTE_SWAP_DISABLE.

Wait identifiers and transfer completion

While wait identifiers in and of themselves don't do much, you can't use any of the communication systems without understanding wait identifiers. The message-passing functions and memory transfer functions defined by DaCS happen asynchronously. Rather than requiring a different test or wait protocol for each, they use a common facility called a wait identifier. A wait identifier, once allocated (or reserved), is passed as an argument to a communication function, and it can then be queried to see whether a given communication has completed. Note that a wait identifier can be used only for a transaction initiated on the local DE. You can't query someone else's wait identifiers.

Wait identifiers standardize the interaction with a variety of functions. Whether you're calling dacs_put or dacs_recv, you pass in a wait identifier to the function, then you call dacs_test to see whether it's done or dacs_wait to block until it completes.

Remote Direct Memory Access

For large blocks of data, your best choice by far is access to remote memory. The procedure seems a little convoluted at first, but it works well in practice. In particular, the Cell/B.E. processor does not have a shared memory map across all cores. The SPEs have local store, and they simply can't address main memory. Similarly, there is generally no way to address the local store of an SPE from the PPE. The only way to access remote memory is through DMA transfers.

DaCS abstracts this to the concept of a memory region, which is created on the DE that actually has the memory, and then it's shared out to another DE. For example, the PPE could share out a chunk of system main memory that would then be accepted by one of the SPEs. Once a chunk of memory has been accepted, it can be accessed through the dacs_put and dacs_get functions (or their corresponding list equivalents, which can perform multiple transfers).

A typical use of these might be to allocate a number of such buffers on the PPE and then share them out to the SPEs. The SPEs can then copy data out of these buffers as needed. It is possible for more than one SPE to have access to the same shared memory region. If you don't feel like tracking a large number of these regions, you could, in theory, just make a single large one and do reads and writes to different parts of it. It might be easier to understand, however, if you provide multiple separate buffers.

Message passing

For smaller chunks of data, DaCS provides a message passing system that allows you to send and receive messages. Messages are sent asynchronously, and a dacs_wait on the corresponding wait identifier blocks until the remote side has received the message.

A particular caveat: if the receiver's specified buffer isn't large enough to hold the whole message, the operation might fail silently. (Ideally, a future release of the library should yield an error, rather than failing silently.) Nonetheless, the simplest thing to do is to ensure that you never send messages larger than the receive buffers that you plan to use. A standard fixed message size might be the best choice.

Messages might be sent on streams, which are 32-bit values used to identify the kind of message. A later dacs_recv call might check for only messages on a particular stream, or they might specify the magic value DACS_STREAM_ALL to accept any messages that are waiting. Stream identifiers must be between 0 and the predefined constant DACS_STREAM_UB, inclusive.)

Mailboxes, too

The DaCS library provides an abstraction of the mailbox facility on the Cell/B.E. system. If you are sending simple 32-bit values, these might offer substantial improvements in efficiency. Furthermore, the alignment requirements are somewhat looser, so you can send individual members out of an array of 32-bit values without worrying about alignment. The obvious weakness of mailboxes is that mailbox operations are automatically blocking operations; you can't perform them asynchronously. They also can't perform byte swapping.

What you can do, then, is to use dacs_mailbox_test to see whether a read or write operation will block. Remember though: As with most such operations, this is subject to race conditions. If all six of your child DEs check at the same time whether they can write without blocking, they might end up all thinking they can and all writing at once.


Conclusion

DaCS offers a rather breathtaking array of tools. Because they're fairly specialized, you really need to know about most of them before you can start designing a useful protocol. Without the message passing or mailbox functions, you can't make much use of the remote memory access features. Without wait identifiers, you can't do much at all except send data back and forth through mailboxes.

The next article in the series shows you how these various tools can be combined to build a DaCS-based version of the fractal program from "The little broadband engine that could: Use multiple SPEs for a single task."


Resources

Learn

Get products and technologies

Discuss

About the author

Author photo

Peter Seebach usually responds to a new API the way most people respond to a brightly-wrapped present--by shaking it. He keeps wrapping paper and old documentation alike for the sentimental value.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=290913
ArticleTitle=The little broadband engine that could: IDL is dead--long live DaCS!
publish-date=03042008
author1-email=developerworks@seebs.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers