Instead of focusing on the SPU's assembly language to help you get to know the Cell Broadband Engine (Cell/B.E.) processor intimately, this tip, excerpted from the developerWorks article "Programming high-performance applications on the Cell BE processor, Part 5," provides a quick look at C/C++ so you can let the compiler do a large amount of the work for you. To use the SPU C/C++ language extensions, the header file spu_intrinsics.h must be included at the beginning of your code.
The C/C++ language extensions include data types and intrinsics that give the programmer nearly full access to the SPU's assembly language instructions. However, many intrinsics are provided which greatly simplify the SPU's assembly language by coalescing many similar instructions into one intrinsic.
Instructions that differ only on the type of operand (such as a, ai, ah, ahi, fa, and dfa for addition) are represented by a single C/C++ intrinsic which selects the proper instruction based on the type of the operand. For addition, spu_add, when given two vector unsigned ints as parameters, will generate the a (32-bit add) instruction. However, if given two vector floats as parameters, it will generate the fa (float add) instruction.
Note that the intrinsics generally have the same limitations as their corresponding assembly language instructions. However, in cases where an immediate value is too large for the appropriate immediate-mode instruction, the compiler will promote the immediate value to a vector and do the corresponding vector/vector operation. For instance, spu_add(myvec, 2) generates an ai (add immediate) instruction while spu_add(myvec, 2000) first loads the 2000 into its own vector using il and then performs the a (add) instruction.
The order of operands in the intrinsics is essentially the same as those of the assembly language instruction except that the first operand (which holds the destination register in assembly language) is not specified, but instead is used as the return value for the function. The compiler supplies the actual parameter in the code it generates.
For more on vector intrinsics, see "Programming on the Cell/B.E. processor, Part 5," the article from which this tip was taken.
This list will supply some of the more common SPU intrinsics; types are not given as most of them are polymorphic.
-
spu_add(val1, val2)
Adds each element ofval1to the corresponding element ofval2. Ifval2is a non-vector value, it adds the value to each element ofval1. -
spu_sub(val1, val2)
Subtract each element ofval2from the corresponding element ofval1. Ifval1is a non-vector value, thenval1is replicated across a vector, and thenval2is subtracted from it. -
spu_mul(val1, val2)
Because the multiplication instructions operate so differently, the SPU intrinsics do not coalesce them as much as they do for other operations.spu_mulhandles floating point multiplication (single and double precision). The result is a vector where each element is the result of multiplying the corresponding elements ofval1andval2together. -
spu_and(val1, val2),spu_or(val1, val2),spu_not(val),spu_xor(val1, val2),spu_nor(val1, val2),spu_nand(val1, val2),spu_eqv(val1, val2)
Boolean operations operate bit-by-bit, so the type of operands the boolean operations receive is not relevant except for determining the type of value they will return.spu_eqvis a bitwise equivalency operation, not a per-element equivalency operation. -
spu_rl(val, count),spu_sl(val, count)
spu_rlrotates each element ofvalleft by the number of bits specified in the corresponding element ofcount. Bits rotated off the end are rotated back in on the right. Ifcountis a scalar value, then it is used as the count for all elements ofval.spu_sloperates the same way, but performs a shift instead of a rotate. -
spu_rlmask(val, count),spu_rlmaska,spu_rlmaskqw(val, count),spu_rlmaskqwbyte(val, count)
These are very confusingly named operations. They are named "rotate left and mask," but they are actually performing right shifts (they are implemented by a combination of left shifts and masks, but the programming interface is for right shifts).spu_rlmaskandspu_rlmaskashifts each element ofvalto the right by the number of bits in the corresponding element ofcount(or the value ofcountifcountis a scalar).spu_rlmaskareplicates the sign bit as bits are shifted in.spu_rlmaskqwoperates on the whole quadword at a time, but only up to 7 bits (it performs a modulus oncountto put it in the proper range).spu_rlmaskqwbyteworks similarly, except thatcountis the number of bytes instead of bits, andcountis modulus 16 instead of 8. -
spu_cmpgt(val1, val2),spu_cmpeq(val1, val2)
These instructions perform element-by-element comparisons of their two operands. The results are stored as all ones (for true) and all zeros (for false) in the resulting vector in the corresponding element.spu_cmpgtperforms a greater-than comparison whilespu_cmpeqperforms an equality comparison. -
spu_sel(val1, val2, conditional)
This corresponds to theselbassembly language instruction. The instruction itself is bit-based, so all types use the same underlying instruction. However, the intrinsic operation returns a value of the same type as the operands. As in assembly language,spu_sellooks at each bit inconditional. If the bit is zero, the corresponding bit in the result is selected from the corresponding bit inval1; otherwise it is selected from the corresponding bit inval2. -
spu_shuffle(val1, val2, pattern)
This is an interesting instruction which allows you to rearrange the bytes inval1andval2according to a pattern, specified inpattern. The instruction goes through each byte inpattern, and if the byte starts with the bits0b10, the corresponding byte in the result is set to0x00; if the byte starts with the bits0b110, the corresponding byte in the result is set to0xff; if the byte starts with the bits0b111, the corresponding byte in the result is set to0x80; finally (and most importantly), if none of the previous are true, the last five bits of the pattern byte are used to choose which byte fromval1orval2should be taken as the value for the current byte. The two values are concatenated, and the five-bit value is used as the byte index of the concatenated value. This is used for inserting elements into vectors as well as performing fast table lookups.
All of the instructions that are prefixed with spu_ will try to find the best instruction match
based on the types of operands. However, not all vector types are
supported by all instructions -- it is based on the availability of
assembly language instructions to handle it.
In addition, if you want a
specific instruction rather than having the compiler choose one, you can
perform almost any non-branching instruction with the specific
instrinsics. All specific intrinsics take the form si_assemblyinstructionname where assemblyinstructionname is the name of the assembly
language instruction as defined in the SPU Assembly Language
Specification. So, si_a(a, b) forces the
instruction a to be used for addition.
All
operands to specific intrinsics are cast to a special type called qword, which is essentially an opaque register value
type. The return value from specific intrinsics are also qwords, which can then be cast into whatever vector
type you wish.
In "Programming on the Cell/B.E. processor, Part 5," the article from which this tip was taken, you can learn more about how to use the vector extensions and discover how to direct the compiler to do branch prediction and to perform DMA transfers in C/C++.
Learn
-
The full set of intrinsics is documented in the PPU
& SPU C/C++ Language Extension Specification.
-
Another (more extended) tutorial resource for Cell/B.E. programming on both
the SPE and the PPE is the official Cell/B.E. Programming Tutorial.
-
For a complete list of available DMA commands on the MFC, see chapter 7
of the Cell/B.E. Architecture Specification (1.01) and pages 508-510 of the Cell/B.E. Programming Handbook (1.0).
-
For more information on DMA list commands, see pages 51-62, 124-125,
129-130, and 157-158 of the Cell/B.E. Architecture Specification (1.01) and pages 73, 459-460, 509-510, and 527-530 of the Cell/B.E. Programming Handbook (1.0).
-
The transfer class ID and replacement class ID fields for MFC operations
is described on pages 78 and 114 of the Cell/B.E. Architecture Specification (1.01) and pages 155-158, 455-456, and 513-515 of the Cell/B.E. Programming Handbook (1.0).
-
The
IBM Semiconductor Solutions Technical Library
Cell/B.E.
documentation section contains a wealth of downloadable manuals,
specifications, and much more.
-
Find all Cell/B.E.-related articles, discussion forums, downloads,
and more at the IBM
developerWorks Cell
Broadband Engine resource center: your definitive resource for all
things Cell/B.E.
-
Keep abreast of all the latest in Cell/B.E. news and information:
subscribe to the
IBM microNews newsletter.
Get products and technologies
-
Get Cell/B.E. solutions: Contact
IBM about custom Cell/B.E.- or custom-processor-based solutions.
-
Get the alphaWorks Cell/B.E. downloads -- including the IBM Full System Simulator,
support libraries, toolchains, source code for libraries and samples.
Discuss
- Participate in the discussion forum.
-
Take part in the IBM developerWorks Power Architecture Cell/B.E. discussion forum
Jonathan Bartlett is the author of the book Programming from the Ground Up , an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.




