Skip to main content

Programming high-performance applications on the Cell BE processor, Part 5: Programming the SPU in C/C++

Use the language extensions to power up your applications

Jonathan Bartlett (johnnyb@eskimo.com), Director of Technology, New Medio
Jonathan Bartlett is the author of the book Programming from the Ground Up , an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.

Summary:  In Part 5 of the Programming high-performance applications on the Cell BE processor series, apply your knowledge of the synergistic processing unit (SPU) to programming the Cell Broadband Engine™ (Cell BE) processor in C/C++. Learn how to use the vector extensions, direct the compiler to do branch prediction, and perform DMA transfers in C/C++.

View more content in this series

Date:  20 Mar 2007
Level:  Intermediate
Activity:  6525 views

Previous discussions about the SPU have focused on the SPU's assembly language to help you get to know the processor intimately. Now I will switch to C/C++ so that you can see how to let the compiler do a large amount of the work for you. To utilize the SPU C/C++ language extensions, the header file spu_intrinsics.h must be included at the beginning of your code.

Vector basics on the SPU

The primary difference between vector processors and non-vector processors is that vector processors have large registers which allow them to store multiple values (called elements) of the same data type and process them with the same operation at once. On vector processors a register is treated both as a single unit and as multiple units. To represent this concept in C/C++, a vector keyword has been added to the language, which takes a primitive data type and uses it across a whole register. For instance, vector unsigned int myvec; creates a four-integer vector where the elements are to be loaded, processed, and stored altogether, and the variable myvec refers to all four of them simultaneously. The signed/unsigned keyword is required for non-floating point declarations. Vector constants are created by putting the type of vector in parentheses followed by the contents of the vector in curly braces. For instance, you can assign values to a vector named myvec like this:

vector unsigned int myvec = (vector unsigned int){1, 2, 3, 4};

In addition to direct assignment, there are four main primitives that are used to go between scalar and vector data: spu_insert, spu_extract, spu_promote, and spu_splats. spu_insert is used to put a scalar value into a specific element of a vector. spu_insert(5, myvec, 0) returns a copy of the vector myvec with the first element (element 0) of the new vector set to 5. spu_extract pulls out a specific element from a vector and returns it as a scalar. spu_extract(myvec, 0) returns the first element of myvec as a scalar. spu_promote converts a value to a vector, but only defines one element. The type of vector depends on the type of value promoted. spu_promote((unsigned int)5, 1) creates a vector of unsigned ints with 5 in the second element (element 1), and the remaining elements undefined. spu_splats works like spu_promote, except that it copies the value to all elements of the vector. spu_splats((unsigned int)5) creates a vector of unsigned ints with each element having the value 5.

It is tempting to think of vectors as short arrays, but in fact they act differently in several respects. Vectors are treated essentially as scalar values, while arrays are manipulated as references. For instance, spu_insert does not modify the contents of the vector. Instead, it returns a brand new copy of the vector with the inserted element. It is an expression that results in a value, not a modification to the value itself. For instance, just as myvar + 1 gives back a new value instead of modifying myvar, spu_insert(1, myvec, 0) does not modify myvec, but instead returns a new vector value that is equivalent with myvec but has the first element set to 1.

Here is a short program using these ideas (enter as vec_test.c):


Listing 1. Program to introduce SPU C/C++ language extensions
                
#include <spu_intrinsics.h>

void print_vector(char *var, vector unsigned int val) {
	printf("Vector %s is: {%d, %d, %d, %d}\n", var, spu_extract(val, 0),
	 spu_extract(val, 1), spu_extract(val, 2), spu_extract(val, 3));
}

int main() {
	/* Create four vectors */
	vector unsigned int a = (vector unsigned int){1, 2, 3, 4};
	vector unsigned int b;
	vector unsigned int c;
	vector unsigned int d;

	/* b is identical to a, but the last element is changed to 9 */
	b = spu_insert(9, a, 3);

	/* c has all four values set to 20 */
	c = spu_splats((unsigned int) 20);

	/* d has the second value set to to 5, and the others are garbage */
	/* (in this case they will all be set to 5, but that should not be relied upon) */
	d = spu_promote((unsigned int)5, 1);

	/* Show Results */
	print_vector("a", a);
	print_vector("b", b);
	print_vector("c", c);
	print_vector("d", d);

	return 0;
}

To compile and run the program under elfspe, simply do:

spu-gcc vec_test.c -o vec_test
./vec_test


Vector intrinsics

The C/C++ language extensions include data types and intrinsics that give the programmer nearly full access to the SPU's assembly language instructions. However, many intrinsics are provided which greatly simplify the SPU's assembly language by coalescing many similar instructions into one intrinsic. Instructions that differ only on the type of operand (such as a, ai, ah, ahi, fa, and dfa for addition) are represented by a single C/C++ intrinsic which selects the proper instruction based on the type of the operand. For addition, spu_add, when given two vector unsigned ints as parameters, will generate the a (32-bit add) instruction. However, if given two vector floats as parameters, it will generate the fa (float add) instruction. Note that the intrinsics generally have the same limitations as their corresponding assembly language instructions. However, in cases where an immediate value is too large for the appropriate immediate-mode instruction, the compiler will promote the immediate value to a vector and do the corresponding vector/vector operation. For instance, spu_add(myvec, 2) generates an ai (add immediate) instruction, while spu_add(myvec, 2000) first loads the 2000 into its own vector using il and then performs the a (add) instruction.

The order of operands in the intrinsics is essentially the same as those of the assembly language instruction except that the first operand (which holds the destination register in assembly language) is not specified in C/C++, but instead is used as the return value for the function. The compiler supplies the appropriate operand in the assembly language code it generates.

Here are some of the more common SPU intrinsics (types are not given as most of them are polymorphic):

  • spu_add(val1, val2)
    Adds each element of val1 to the corresponding element of val2. If val2 is a non-vector value, it adds the value to each element of val1.
  • spu_sub(val1, val2)
    Subtracts each element of val2 from the corresponding element of val1. If val1 is a non-vector value, then val1 is replicated across a vector, and then val2 is subtracted from it.
  • spu_mul(val1, val2)
    Because the multiplication instructions operate so differently, the SPU intrinsics do not coalesce them as much they do for other operations. spu_mul handles floating point multiplication (single and double precision). The result is a vector where each element is the result of multiplying the corresponding elements of val1 and val2 together.
  • spu_and(val1, val2), spu_or(val1, val2), spu_not(val), spu_xor(val1, val2), spu_nor(val1, val2), spu_nand(val1, val2), spu_eqv(val1, val2)
    Boolean operations operate bit-by-bit, so the type of operands the boolean operations receive is not relevant except for determining the type of value they will return. spu_eqv is a bitwise equivalency operation, not a per-element equivalency operation.
  • spu_rl(val, count), spu_sl(val, count)
    spu_rl rotates each element of val left by the number of bits specified in the corresponding element of count. Bits rotated off the end are rotated back in on the right. If count is a scalar value, then it is used as the count for all elements of val. spu_sl operates the same way, but performs a shift instead of a rotate.
  • spu_rlmask(val, count), spu_rlmaska, spu_rlmaskqw(val, count), spu_rlmaskqwbyte(val, count)
    These are very confusingly named operations. They are named "rotate left and mask," but they are actually performing right shifts (they are implemented by a combination of left shifts and masks, but the programming interface is for right shifts). spu_rlmask and spu_rlmaska shifts each element of val to the right by the number of bits in the corresponding element of count (or the value of count if count is a scalar). spu_rlmaska replicates the sign bit as bits are shifted in. spu_rlmaskqw operates on the whole quadword at a time, but only up to 7 bits (it performs a modulus on count to put it in the proper range). spu_rlmaskqwbyte works similarly, except that count is the number of bytes instead of bits, and count is modulus 16 instead of 8.
  • spu_cmpgt(val1, val2), spu_cmpeq(val1, val2)
    These instructions perform element-by-element comparisons of their two operands. The results are stored as all ones (for true) and all zeros (for false) in the resulting vector in the corresponding element. spu_cmpgt performs a greater-than comparison while spu_cmpeq performs an equality comparison.
  • spu_sel(val1, val2, conditional)
    This corresponds to the selb assembly language instruction. The instruction itself is bit-based, so all types use the same underlying instruction. However, the intrinsic operation returns a value of the same type as the operands. As in assembly language, spu_sel looks at each bit in conditional. If the bit is zero, the corresponding bit in the result is selected from the corresponding bit in val1; otherwise it is selected from the corresponding bit in val2.
  • spu_shuffle(val1, val2, pattern)
    This is an interesting instruction which allows you to rearrange the bytes in val1 and val2 according to a pattern, specified in pattern. The instruction goes through each byte in pattern, and if the byte starts with the bits 0b10, the corresponding byte in the result is set to 0x00; if the byte starts with the bits 0b110, the corresponding byte in the result is set to 0xff; if the byte starts with the bits 0b111, the corresponding byte in the result is set to 0x80; finally (and most importantly), if none of the previous are true, the last five bits of the pattern byte are used to choose which byte from val1 or val2 should be taken as the value for the current byte. The two values are concatenated, and the five-bit value is used as the byte index of the concatenated value. This is used for inserting elements into vectors as well as performing fast table lookups.

All of the instructions that are prefixed with spu_ will try to find the best instruction match based on the types of operands. However, not all vector types are supported by all instructions -- it is based on the availability of assembly language instructions to handle it. In addition, if you want a specific instruction rather than having the compiler choose one, you can perform almost any non-branching instruction with the specific intrinsics. All specific intrinsics take the form si_assemblyinstructionname where assemblyinstructionname is the name of the assembly language instruction as defined in the SPU Assembly Language Specification. So, si_a(a, b) forces the instruction a to be used for addition. All operands to specific intrinsics are cast to a special type called qword, which is essentially an opaque register value type. The return value from specific intrinsics are also qwords, which can then be cast into whatever vector type you wish.


Using the intrinsics

Now let's look at how to do the uppercase conversion function using C/C++ rather than assembly language. The basic steps for converting a single vector are:

  1. Convert all values using the uppercase conversion.
  2. Do a vector comparison of all bytes to see if they are between 'a' and 'z'.
  3. Use the comparison to choose between the converted and unconverted values using the select instruction.

In addition, to help better schedule instructions, the assembly language version performed several of these conversions simultaneously. In C/C++, you can call an inline function multiple times, and let the compiler take care of scheduling it appropriately. This doesn't mean that your knowledge of instruction scheduling is useless, but rather because you know how instruction scheduling works, you are able to give the compiler better raw material to work with. If you did not know that instruction scheduling improves your code, and that instruction scheduling can be helped by unrolling your loops, then you would not be able to help the compiler optimize your code.

So here is the C/C++ version of the convert_buffer_to_upper function (enter as convert_buffer_c.c in the same directory as the files from the previous articles -- you will need them to compile the full application):


Listing 2. Uppercase conversion in C/C++
                
#include <spu_intrinsics.h>

unsigned char conversion_value = 'a' - 'A';

inline vec_uchar16 convert_vec_to_upper(vec_uchar16 values) {
	/* Process all characters */
	vec_uchar16 processed_values = spu_absd(values, spu_splats(conversion_value));
	/* Check to see which ones need processing (those between 'a' and 'z')*/
	vec_uchar16 should_be_processed = spu_xor(spu_cmpgt(values, 'a'-1), 
	spu_cmpgt(values, 'z'));
	/* Use should_be_processed to select between the original and processed values */
	return spu_sel(values, processed_values, should_be_processed);
}

void convert_buffer_to_upper(vec_uchar16 *buffer, int buffer_size) {
	/* Find end of buffer (must be casted first because size is bytes) */
	vec_uchar16 *buffer_end = (vec_uchar16 *)((char *)buffer + buffer_size);

	while(__builtin_expect(buffer < buffer_end, 1)) {
		*buffer = convert_vec_to_upper(*buffer);
		buffer++;
		*buffer = convert_vec_to_upper(*buffer);
		buffer++;
		*buffer = convert_vec_to_upper(*buffer);
		buffer++;
		*buffer = convert_vec_to_upper(*buffer);
		buffer++;
	}
}

To compile and run, simply do:

spu-gcc convert_buffer_c.c convert_driver.s dma_utils.s -o spe_convert
embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert
./dma_convert

As you probably noticed, this program uses slightly different notation for vector type names than used previously. The SPU intrinsics documentation (see Resources) defines simplified vector type names starting with vec_. For integer types, the next character is u or s for signed/unsigned types. After that is the name of the basic type being used (char, int, float, and so on). Finally, at the end is the number of elements of that type which are in the vector. vec_uchar16, for instance, is a 16-element vector of unsigned chars, and vec_float4 is a 4-element vector of floats. This notation greatly simplifies the typing involved.

When computing the buffer_end the program did some casting gymnastics. Because size was in bytes, I had to convert the pointer to a char * so that when I added the size, it would move by bytes rather than by quadwords. Vector pointers, since the value they point to is 16-bytes long, move forward in increments of 16 bytes, while char pointers move forward in single-byte increments. That is why buffer++ works -- it is incrementing by a single vector length, which is 16 bytes.

Another interesting feature of the C/C++ version is __builtin_expect which helps the compiler do branch hinting. You cannot do branch hinting directly in C/C++ because you have neither the address of the branch nor the target. Therefore, you instead provide hints to the compiler, which can then generate appropriate branch hints. __builtin_expect(buffer < buffer_end, 1) generates branching code based off of the first argument, buffer < buffer_end, but produces branch hints based off of the second argument, 1. It tells the compiler to generate hints that expect the value of buffer < buffer_end to be 1.

Now, there are two compilers currently available for SPU programming, and, as one might expect, they excel in different areas. GCC, for instance, does a fantastic job of interleaving the instructions between invocations of convert_vec_to_upper so that instruction latency is minimized. However, in this particular program, __builtin_expect gives us almost no help at all. The IBM XLC compiler, on the other hand, is the opposite. It does not interleave the instructions between invocations of convert_vec_to_upper at all, but structures the loop so that the branch hint has a maximum effect, and in fact was able to guess the branch hint without it being supplied. Unsurprisingly, neither compiler does nearly as well as the hand-coded assembly language version from the previous article, but for this program XLC outperformed GCC. Code that was compiled without any optimization flags resulted in code that was approximately five times slower, so be sure to always compile with -O2 or -O3.


Composite intrinsics and MFC programming

The composite intrinsics are those that compile to multiple instructions. The composite intrinsics encapsulate common usage patterns on the SPE to simplify its programming. The two most important composite intrinsics are spu_mfcdma64 and spu_mfcstat. spu_mfcdma64 is almost exactly like the dma_transfer function I wrote and used in previous articles, except that the high and low parts of the effective address are split between two 32-bit parameters (dma_transfer used one 64-bit parameter for the effective address).

spu_mfcdma64 takes six parameters:

  1. the local store address for the transfer
  2. the high-order 32-bits of the effective address
  3. the low-order 32-bits of the effective address
  4. the size of the transfer
  5. a "tag" to give the transfer
  6. the DMA command to give

Often times you will have the effective address as a single 64-bit value. To separate it out into parts, use mfc_ea2h to extract the higher-order bits and mfc_ea2l to extract the lower-order bits. The tag is a number designated by the programmer between 0 and 31 used to identify a transfer or for a group of transfers for status queries and sequencing operations. The DMA command can take a range of values (see Resources for information on where to find the ones not listed here). DMA transfers are called PUTs if they transfer from the SPU local store to the system memory, and GETs if they go in the other direction. These DMA command names are prefixed with either MFC_PUT or MFC_GET, respectively. Then, MFC commands either operate individually or on a list. If the DMA command is a list command, the DMA command name has an L appended to it (see Resources for more information on DMA list commands). The DMA command can also have certain levels of synchronization applied to it. For barrier synchronization add a B, for fence synchronization add an F, and for no synchronization you do not need to add anything. Finally, all DMA command names have a _CMD suffix. So, the command name for a single transfer from the local store to system memory using fence synchronization would be MFC_PUTF_CMD.

By default DMA commands on the SPE's MFC are totally unordered -- the MFC may process them in any order that it wishes. However, tags, fences, and barriers can be used to force ordering constraints on MFC DMA transfers. A fence establishes the constraint that a given DMA transfer only execute after all previous commands using the same tag have completed. A barrier establishes the constraint that a given DMA transfer only execute after all previous commands using the same tag have completed (like a fence), but also that they must execute before all subsequent commands using the same tag.

Here are some examples of spu_mfcdma64:


Listing 3. Using spu_mfcdma64
                
typedef unsigned long long uint64;
typedef unsigned long uint32;
uint64 ea1, ea2, ea3, ea4, ea5; /* assume each of these have sensible values */
void *ls1, *ls2, *ls3, *ls4; /* assume each of these have sensible values */
uint32 sz1, sz2, sz3, sz4; /* assume each of these have sensible values */
int tag = 3; /* Arbitrary value, but needs to be the same for all 
synchronized transfers */

/* Transfer 1: System Storage -> Local Store, no ordering specified */
spu_mfcdma64(ls1, mfc_ea2h(ea1), mfc_ea2l(ea1), sz1, tag, MFC_GET_CMD);

/* Transfer 2: Local Storage -> System Storage, must perform after previous transfers */
spu_mfcdma64(ls2, mfc_ea2h(ea2), mfc_ea2l(ea2), sz2, tag, MFC_PUTF_CMD);

/* Transfer 3: Local Storage -> System Storage, no ordering specified */
spu_mfcdma64(ls3, mfc_ea2h(ea3), mfc_ea2l(ea3), sz3, tag, MFC_PUT_CMD);

/* Transfer 4: Local Storage -> System Storage, must be synchronized */
spu_mfcdma64(ls4, mfc_ea2h(ea4), mfc_ea2l(ea4), sz4, tag, MFC_PUTB_CMD);

/* Transfer 5: System Storage -> Local Storage, no ordering specified */
spu_mfcdma64(ls4, mfc_ea2h(ea5), mfc_ea2l(ea5), sz4, tag, MFC_GET_CMD);

The above example has several possible orderings. All of the following are possibilities:

  • 1, 2, 3, 4, 5
  • 3, 1, 2, 4, 5
  • 1, 3, 2, 4, 5

Because transfer 2 only uses a fence and transfer 3 doesn't specify any ordering at all, transfer 3 is free to float anywhere before the barrier (transfer 4). The only requirement for the first three transfers is that transfer 2 must be performed after transfer 1. Transfer 4, however, requires full synchronization of transfers before and after it.

Take a closer look at transfers 4 and 5. This is a useful idiom to take note of -- save and reload. If you are processing system memory data a piece at a time into local store and storing it back into system memory, you can queue up a save and a load at the same time, using a fence or barrier to order them. This puts all of the transferring logic into the MFC, and leaves your program free to do other computational tasks while the buffer waits for new data. We will make use of this in the next article when we talk about double buffering.

spu_mfcdma64 is quite a handy tool, but it is a little tedious, especially when you have to keep on using mfc_ea2h and mfc_ea2l to convert your addresses. Therefore, the specification also provides a number of utility functions to lessen the amount of redundant typing necessary. The mfc_ class of functions all take the same parameters as the spu_mfcdma64 function, except that the effective address is a single 64-bit parameter, and the DMA command is encoded into the function name. It also takes two extra parameters, the transfer class identifier and the replacement class identifier. Both of these can be safely set to zero in non-realtime applications (see Resources for references to further information on these two fields). Therefore, transfer 2 above can be rewritten as:

mfc_putf(ls2, ea2, sz2, tag, 0, 0);

Tags are useful not just for synchronizing data transfers, but also for checking on the status of transfers. On the SPE, there is a tag mask channel which is used to specify which tags are currently used for status checks, a channel which is used to issue status requests, and another channel to read the channel status back. Although these are pretty simple operations anyway, the specification gives special methods for performing these operations as well. mfc_write_tag_mask takes a 32-bit integer and uses it as a channel mask for future status updates. In the mask, set the bit position of each tag that you want to check the status of to 1. So, to check the status of channel 2 and 4, you would use mfc_write_tag_mask(20), or, to make it more readable, you can do mfc_write_tag_mask(1<<2 | 1<<4);. To actually perform the status update, you have to pick a status command, and send it using spu_mfcstat(unsigned int command). The commands are:

  • MFC_TAG_UPDATE_IMMEDIATE
    This command causes the SPE to immediately return with the status of the DMA channels. Each channel which was specified in the channel mask will be set to 1 if there are no remaining commands in the queue with that tag (in other words, all operations that may have been previously active, are completed), and set to 0 if there are commands remaining in the queue.
  • MFC_TAG_UPDATE_ANY
    This command causes the SPE to wait until at least one tag specified in the tag mask has no remaining commands before returning, then returns the status of the DMA channels that were specified in the tag mask.
  • MFC_TAG_UPDATE_ALL
    This command causes the SPE to wait until all tags specified in the tag mask have no remaining commands before returning. The return value will be 0.

To use these constants, you need to include spu_mfcio.h.

Using spu_mfcstat allows you to both check on the status of DMA requests and wait for them. Using MFC_TAG_UPDATE_ANY allows you to issue multiple DMA requests, let the MFC process them in whatever order it thinks is best, and then your code can respond based on the order that the MFC processes them.


Example MFC program

Now I'll apply this knowledge of the MFC composite intrinsics to the uppercase conversion program. Earlier in the article I rewrote the main conversion function in C, and now I am going to rewrite the main loop in C. The new code is fairly straightforward (enter as convert_driver_c.c):


Listing 4. Uppercase conversion MFC transfer code
                
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
typedef unsigned long long uint64;

#define CONVERSION_BUFFER_SIZE 16384
#define DMA_TAG 0

void convert_buffer_to_upper(char *conversion_buffer, int current_transfer_size);

char conversion_buffer[CONVERSION_BUFFER_SIZE];

typedef struct {
	int length __attribute__((aligned(16)));
	uint64 data __attribute__((aligned(16)));
} conversion_structure;

int main(uint64 spe_id, uint64 conversion_info_ea) {
	conversion_structure conversion_info; /* Information about the data from the PPE */

	/* We are only using one tag in this program */
	mfc_write_tag_mask(1<<DMA_TAG);

	/* Grab the conversion information */
	mfc_get(&conversion_info, conversion_info_ea, sizeof(conversion_info), DMA_TAG, 0, 0);
	spu_mfcstat(MFC_TAG_UPDATE_ALL); /* Wait for Completion */

	/* Get the actual data */
	mfc_get(conversion_buffer, conversion_info.data, conversion_info.length, DMA_TAG, 0, 0);
	spu_mfcstat(MFC_TAG_UPDATE_ALL);

	/* Perform the conversion */
	convert_buffer_to_upper(conversion_buffer, conversion_info.length);

	/* Put the data back into system storage */
	mfc_put(conversion_buffer, conversion_info.data, conversion_info.length, DMA_TAG, 0, 0);
	spu_mfcstat(MFC_TAG_UPDATE_ALL); /* Wait for Completion */
}

To compile and run, simply do:

spu-gcc convert_buffer_c.c convert_driver_c.c -o spe_convert
embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert
./dma_convert

This implementation in C follows the same basic structure as the original code, except that it's more readable to human beings, which, incidentally, makes it easier to revise and expand. For instance, one of the problems with the original code is that it is limited to the size of a DMA transfer. What if you wanted to remove that limitation? You could simply wrap the whole thing in a loop, and keep moving data a piece at a time until the whole string has been processed. Here's the revised code to do this:


Listing 5. Looping in the MFC transfer code
                
#include <spu_intrinsics.h>
#include <spu_mfcio.h> /* constant declarations for the MFC */
typedef unsigned long long uint64;
typedef unsigned int uint32;

/* Renamed CONVERSION_BUFFER_SIZE to MAX_TRANSFER_SIZE because it is now 
primarily used to limit the size of DMA transfers */
#define MAX_TRANSFER_SIZE 16384

void convert_buffer_to_upper(char *conversion_buffer, int current_transfer_size);

char conversion_buffer[MAX_TRANSFER_SIZE];

typedef struct {
	uint32 length __attribute__((aligned(16)));
	uint64 data __attribute__((aligned(16)));
} conversion_structure;

int main(uint64 spe_id, uint64 conversion_info_ea) {
	conversion_structure conversion_info; /* Information about the data from the PPE */

	/* New variables to keep track of where we are in the data */
	uint32 remaining_data; /* How much data is left in the whole string */
	uint64 current_ea_pointer; /* Where we are in system memory */
	uint32 current_transfer_size; /* How big the current transfer is (may be smaller 
	than MAX_TRANSFER_SIZE) */

	/* We are only using one tag in this program */
	mfc_write_tag_mask(1<<0);

	/* Grab the conversion information */
	mfc_get(&conversion_info, conversion_info_ea, sizeof(conversion_info), 0, 0, 0);
	spu_mfcstat(MFC_TAG_UPDATE_ALL); /* Wait for Completion */

	/* Setup the loop */
	remaining_data = conversion_info.length;
	current_ea_pointer = conversion_info.data;

	while(remaining_data > 0) {
		/* Determine how much data is left to transfer */
		if(remaining_data < MAX_TRANSFER_SIZE)
			current_transfer_size = remaining_data;
		else
			current_transfer_size = MAX_TRANSFER_SIZE;

		/* Get the actual data */
		mfc_getb(conversion_buffer, current_ea_pointer, current_transfer_size, 0, 0, 0);
		spu_mfcstat(MFC_TAG_UPDATE_ALL);

		/* Perform the conversion */
		convert_buffer_to_upper(conversion_buffer, current_transfer_size);

		/* Put the data back into system storage */
		mfc_putb(conversion_buffer, current_ea_pointer, current_transfer_size, 0, 0, 0);

		/* Advance to the next segment of data */
		remaining_data -= current_transfer_size;
		current_ea_pointer += current_transfer_size;
	}
	spu_mfcstat(MFC_TAG_UPDATE_ALL); /* Wait for Completion */
}

Compile and run using the same commands as you used in the previous example:

spu-gcc convert_buffer_c.c convert_driver_c.c -o spe_convert
embedspu -m64 convert_to_upper_handle spe_convert spe_convert_csf.o
gcc -m64 spe_convert_csf.o ppu_dma_main.c -lspe -o dma_convert
./dma_convert

So now you have just expanded the size of the data you can process to 4 gigabytes, though you could easily go beyond that by making the data size variables 64-bit instead of 32-bit. Notice that you don't explicitly code to ask the MFC to wait for your PUT to complete before you re-issue the GET. This is because you are using barriers with your transfers, and you are using the same DMA tag for them. This forces the transfers to be serialized by the MFC itself, so it will always wait until the current conversion is finished being PUT into system storage before GETting more data into the buffer. Just remember to wait for the completion at the end (notice the spu_mfcstat outside the loop), or else your last bit of data may not finish transferring before it is used in the program!

Another thing to be careful of when programming in C is to always make sure that you give function prototypes. It is real easy to accidentally mix up 32-bit and 64-bit values. On the PPE that isn't so bad, as the value is merely truncated or expanded. But in the SPE, if the prototype is wrong, the preferred slot for 32-bit and 64-bit values is offset in such a way that conversion between the two must be handled explicitly.


Helpful tips for C language SPE programming

Here are some tips to keep in mind when building SPE applications in C:

  • Vectors can be cast between vectors of other types, and back-and-forth between the vector types and the special quad type, but none of these casts perform any data conversion. If you need to convert between types, use an appropriate SPU intrinsic.
  • Vector and non-vector pointers can be cast between each other, but when converting from a scalar pointer to a vector pointer it is the programmer's responsibility to be sure that the pointer is quadword-aligned.
  • Declared vectors are always quadword-aligned when allocated.
  • Remember that DMA transfers of 16 bytes or more must be in 16-byte multiples and aligned to 16-byte boundaries on both the SPE and the PPE. Transfers smaller than that must be a power of two and be naturally aligned. Optimal transfers are multiples of 128 bytes that are on 128-byte boundaries.
  • If you are not sure about the alignment of data on the PPE, use memalign or posix_memalign to allocate an aligned pointer from the heap, and use memcpy or an equivalent to move the data to the aligned area.
  • Always compile with -Wall and especially pay attention to missing prototype messages. Incorrectly implied prototypes (especially between 32- and 64-bit types) can lead to bizarre error conditions.
  • Always store effective addresses as unsigned long longs, on both the PPE and the SPE. This way they can be treated in a unified fashion on the SPE and on the PPE, whether the PPE code is compiled for 32-bit or 64-bit execution.
  • Avoid integer multiplies (especially 32-bit multiplies) on the SPE. It takes five instructions to perform the multiply. If you must multiply, cast to an unsigned short before multiplying.
  • In scalar code on the SPE, declaring scalar values as vectors and vector pointers (even if you aren't using them as vectors) can speed up code because it doesn't have to do unaligned loads and stores.
  • Be aware that on the SPE, floats and doubles are implemented differently, and round differently as well. floats in particular deviate from the C99 standard. The next article will cover these further.

Conclusion

The intrinsics available for C allow programmers to make the best mix of C and assembly language knowledge. The SPU intrinsics allow programs to freely switch among high- and low-level code, but all within the semantic framework of the C language.

The next article applies this knowledge into a real-world numerical application.


Resources

Learn

Get products and technologies

Discuss

About the author

Jonathan Bartlett is the author of the book Programming from the Ground Up , an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration, Linux
ArticleID=203073
ArticleTitle=Programming high-performance applications on the Cell BE processor, Part 5: Programming the SPU in C/C++
publish-date=03202007
author1-email=johnnyb@eskimo.com
author1-email-cc=dwpower@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers