IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & solutions      Support & downloads      My account     
The little broadband engine that could: IDL is dead--long live DaCS!
skip to main content

developerWorks  >  Power Architecture technology  >

The little broadband engine that could: IDL is dead--long live DaCS!

A great substitute for IDL: The Data Communication and Synchronization library

developerWorks
Document options

Document options requiring JavaScript are not displayed

Discuss


Rate this page

Help us improve this content


Level: Intermediate

Peter Seebach (developerworks@seebs.net), Freelance writer, Plethora.net

04 Mar 2008

In SDK 3.0, the Data Communication and Synchronization library (DaCS) provides a sparkling substitute for IDL. DaCS is a set of services to aid the development of applications and application frameworks in a heterogeneous multi-tiered system. This article takes you on a tour of the DaCS process model and explores general DaCS principles, including communication and memory access.

Introduction

Previous articles in the series reviewed the IDL tool included with the Cell Broadband Engine SDK versions up through 2.1. In version 3.0 though, the IDL tool is no longer provided. Instead, you're given comparable function through the Data Communication and Synchronization library (DaCS), which is a set of services designed to simplify the development of distributed applications on the Cell/B.E. (or similar) processors. The version of the library distributed with the SDK is moderately specific to the Cell/B.E. environment, but it is well-suited to program execution in this environment.

DaCS provides abstractions supporting development models in which the PPE assigns tasks to SPE units which then go about these tasks asynchronously. The DaCS library defines a variety of DaCS elements (DEs), which can be divided into:

  • Host elements (HEs)
  • Accelerator elements (AEs)

For purposes of Cell/B.E. development, you can consider these as abstractions of the PPE (a host element) and the SPEs (accelerator elements). On a typical Cell/B.E. blade server with two processors, it is possible to allocate a single blade with 16 SPE children or to allocate two CBE devices, each with eight children. The latter view might allow better control of processor affinity at some cost in complexity. In the Cell/B.E. SDK running on a PlayStation3 system, it is possible to allocate only the 6 available SPE children; the higher-level framework is not available.

Introducing the process model

While DaCS has abstractions allowing for multiple processes on a single DE, the SDK currently supports only a single process on each element. Each process is started in the usual way for SPE programs, and it can then initialize the SPU side of the DaCS library, allowing communication with the library running on the PPE side.

The simplest way to set up an SPE process using DaCS will be eerily familiar to anyone who's used libspe, because the DaCS library uses the same embedded binaries. An SPE program compiled with spu-gcc can be turned into an embedded object using the ppu-embedspu program, and then it can be linked into a program using DaCS. Setup is slightly more complicated than with libspe, although you don't need to manage your own threading. DaCS handles this transparently.


Listing 1. Running a program on an SPE
                
extern spe_program_handle_t spu_prog;
de_id_t spes[16];
dacs_process_id_t pids[16];
uint32_t children;
int32_t status;

dacs_runtime_init(NULL, NULL);
dacs_get_num_avail_children(DACS_DE_SPE, &children);
dacs_reserve_children(DACS_DE_SPE, &children, spes);
dacs_de_start(spes[0], &spu_prog, NULL, NULL,
                  DACS_PROC_EMBEDDED, &pids[0]);
dacs_de_wait(spes[0], pids[0], &status);
dacs_runtime_exit();

This program does the following:

  1. It queries the number of available children (the example assumes that at least one is available).
  2. It attempts to reserve all of those children.
  3. It begins a program on the first one.
  4. It waits for the program to terminate.
  5. It exits the DaCS runtime.

It's not a complicated program. The omission of error-checking is strictly for introductory purposes. In real code, you would want to check all of these operations carefully. The spu_prog handle is the same sort of embedded SPU code that was used in previous articles in the series, although the DaCS library obliges you to change from -m32 to -m64 when creating embedded binaries.

The status value is filled in with the return value of the main function of the SPE program. The two NULL arguments immediately after &spu_prog are passed in as argv and envp, respectively.

Now you are at the point of the simplest case in which a program runs entirely on data provided in its arguments (and so on). However, the real strength of DaCS is in manipulation of data once the child program is up and running.



Back to top


Understanding the general principles

There are a number of common patterns throughout the entire DaCS API. The most significant patterns are:

  • The way the API handles return values
  • The need to consistently release resources
  • The alignment requirements across all the API calls

DaCS API and return values

The example code in Listing 1 omits error checking for brevity. As a result, it never uses the return values of any DaCS functions. Each function in the DaCS library returns only a success or failure indicator. Any data generated by the function are stored into objects pointed to by their arguments. For example, to get a count of available children, you can call dacs_get_num_avail_children(DACS_DE_SPE, &children);. The address of children is passed to the function, which modifies the object through this pointer. The return value is usually the constant DACS_SUCCESS, indicating a successful operation, but it could also be an error code.

This setup avoids the quirk of needing special sentinel values to indicate error returns. This setup also provides greater consistency across the API. On the other hand, it can be a little confusing at times, and there are a number of cases where you need to pass the address of an object to one function but the object itself to another. Keep a close eye on that. Don't just ignore compiler warnings about type mismatches!

Allocating and releasing resources

Each resource allocation API call has a corresponding call to release the resource. These calls are not optional, even if your program is about to exit. In fact, dacs_runtime_exit() can hang if resources have not been freed correctly! So, if you create a remote memory region with dacs_remote_mem_create, you must destroy it with dacs_remote_mem_destroy when you are done with it. If you have accepted access to a memory region with dacs_remote_mem_accept, you must release it with dacs_remote_mem_release when you are done. The dacs_remote_mem_destroy call blocks until all clients have been released.

While this might sound fairly intrusive, it's not bad at all (although returning an error condition would be preferable to blocking). Still, it tends to be easy to get these right, because actual code tends to have a clearly defined boundary for when a resource is no longer being used.

Alignment requirements

Nearly everything needs to be aligned suitably for the processor, which generally means aligned to a 16-byte boundary. Misaligned things don't always have the effects you'd expect, and sometimes they aren't detected gracefully. If you start seeing mysterious crashes, look for unaligned accesses, and not necessarily on the DE that's crashing! In one example, a misaligned object on a PPE can cause an SPE to crash. (Worse yet, changing the alignment of other objects on the PPE can influence the alignment of the object in question.)

While gcc's __attribute__ ((aligned (16))) is your friend, it does not necessarily have any effect (or at least, the desired effect) on stack variables. This means that in some cases, you will have to declare objects globally or allocate memory for them. Unfortunately, the alignment requirements don't mesh well with object sizes. For example, you can't simply declare an array of N integers to use as arguments for dacs_send messages, because if the first item in the array is aligned correctly, later ones won't be!

The documentation doesn't really give much detail for the alignment requirements. In practice, just align everything to 16 bytes, and you should be fine.



Back to top


Regarding communication and memory access

The SPEs have local storage, but they have no direct access to the main memory the PPE uses. Most readers are probably aware of this, but it bears emphasis because all non-trivial Cell/B.E. programming ends up involving a fair amount of code to shuffle data. With IDL, data were moved automatically by the library without explicit operations, much as data are sent to a remote system using a remote procedure call (RPC) interface. In DaCS, data management is somewhat more explicit.

DaCS provides functions to send and receive messages, and it also provides a family of functions to share regions of memory between elements. In general, DaCS provides basic tools as opposed to a complete protocol. Operations are asynchronous, and overall structure is somewhat subject to negotiation. In general, you can send messages and then wait for confirmation of their receipt. What this means is that both sides need to cooperate. If you send a message and wait for the other side to receive it but the other side never receives it, you will wait forever.

Many of the communication functions offer byte-swapping primitives. While these are doubtless of significant importance in a mixed environment, such as an Intel-based or AMD-based server using Cell/B.E. blade servers, the only code you need to know to make effective use of a pure-Cell/B.E. system is DACS_BYTE_SWAP_DISABLE.

Wait identifiers and transfer completion

While wait identifiers in and of themselves don't do much, you can't use any of the communication systems without understanding wait identifiers. The message-passing functions and memory transfer functions defined by DaCS happen asynchronously. Rather than requiring a different test or wait protocol for each, they use a common facility called a wait identifier. A wait identifier, once allocated (or reserved), is passed as an argument to a communication function, and it can then be queried to see whether a given communication has completed. Note that a wait identifier can be used only for a transaction initiated on the local DE. You can't query someone else's wait identifiers.

Wait identifiers standardize the interaction with a variety of functions. Whether you're calling dacs_put or dacs_recv, you pass in a wait identifier to the function, then you call dacs_test to see whether it's done or dacs_wait to block until it completes.

Remote Direct Memory Access

For large blocks of data, your best choice by far is access to remote memory. The procedure seems a little convoluted at first, but it works well in practice. In particular, the Cell/B.E. processor does not have a shared memory map across all cores. The SPEs have local store, and they simply can't address main memory. Similarly, there is generally no way to address the local store of an SPE from the PPE. The only way to access remote memory is through DMA transfers.

DaCS abstracts this to the concept of a memory region, which is created on the DE that actually has the memory, and then it's shared out to another DE. For example, the PPE could share out a chunk of system main memory that would then be accepted by one of the SPEs. Once a chunk of memory has been accepted, it can be accessed through the dacs_put and dacs_get functions (or their corresponding list equivalents, which can perform multiple transfers).

A typical use of these might be to allocate a number of such buffers on the PPE and then share them out to the SPEs. The SPEs can then copy data out of these buffers as needed. It is possible for more than one SPE to have access to the same shared memory region. If you don't feel like tracking a large number of these regions, you could, in theory, just make a single large one and do reads and writes to different parts of it. It might be easier to understand, however, if you provide multiple separate buffers.

Message passing

For smaller chunks of data, DaCS provides a message passing system that allows you to send and receive messages. Messages are sent asynchronously, and a dacs_wait on the corresponding wait identifier blocks until the remote side has received the message.

A particular caveat: if the receiver's specified buffer isn't large enough to hold the whole message, the operation might fail silently. (Ideally, a future release of the library should yield an error, rather than failing silently.) Nonetheless, the simplest thing to do is to ensure that you never send messages larger than the receive buffers that you plan to use. A standard fixed message size might be the best choice.

Messages might be sent on streams, which are 32-bit values used to identify the kind of message. A later dacs_recv call might check for only messages on a particular stream, or they might specify the magic value DACS_STREAM_ALL to accept any messages that are waiting. Stream identifiers must be between 0 and the predefined constant DACS_STREAM_UB, inclusive.)

Mailboxes, too

The DaCS library provides an abstraction of the mailbox facility on the Cell/B.E. system. If you are sending simple 32-bit values, these might offer substantial improvements in efficiency. Furthermore, the alignment requirements are somewhat looser, so you can send individual members out of an array of 32-bit values without worrying about alignment. The obvious weakness of mailboxes is that mailbox operations are automatically blocking operations; you can't perform them asynchronously. They also can't perform byte swapping.

What you can do, then, is to use dacs_mailbox_test to see whether a read or write operation will block. Remember though: As with most such operations, this is subject to race conditions. If all six of your child DEs check at the same time whether they can write without blocking, they might end up all thinking they can and all writing at once.



Back to top


Conclusion

DaCS offers a rather breathtaking array of tools. Because they're fairly specialized, you really need to know about most of them before you can start designing a useful protocol. Without the message passing or mailbox functions, you can't make much use of the remote memory access features. Without wait identifiers, you can't do much at all except send data back and forth through mailboxes.

The next article in the series shows you how these various tools can be combined to build a DaCS-based version of the fractal program from "The little broadband engine that could: Use multiple SPEs for a single task."

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!



Resources

Learn

Get products and technologies

Discuss


About the author

Author photo

Peter Seebach usually responds to a new API the way most people respond to a brightly-wrapped present--by shaking it. He keeps wrapping paper and old documentation alike for the sentimental value.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top


IBM is a trademark of IBM Corporation in the United States, other countries, or both. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. Other company, product, or service names may be trademarks or service marks of others.


    About IBMPrivacyContact