Tech tips: SPU vector intrinsics at your fingertips

Here's a quick list to keep you on the right side of common Cell/B.E. SPU vector intrinsics

Know these common C/C++ language extensions intrinsics and greatly simplify the arduous task of using the SPU's assembly language.


Jonathan Bartlett (, Director of Technology, New Medio

Jonathan Bartlett is the author of the book Programming from the Ground Up, an introduction to programming using Linux assembly language. He is the lead developer at New Media Worx, responsible for developing Web, video, kiosk, and desktop applications for clients.

01 May 2007

Instead of focusing on the SPU's assembly language to help you get to know the Cell Broadband Engine (Cell/B.E.) processor intimately, this tip, excerpted from the developerWorks article "Programming high-performance applications on the Cell BE processor, Part 5," provides a quick look at C/C++ so you can let the compiler do a large amount of the work for you. To use the SPU C/C++ language extensions, the header file spu_intrinsics.h must be included at the beginning of your code.

Vector intrinsics basics

The C/C++ language extensions include data types and intrinsics that give the programmer nearly full access to the SPU's assembly language instructions. However, many intrinsics are provided which greatly simplify the SPU's assembly language by coalescing many similar instructions into one intrinsic.

Instructions that differ only on the type of operand (such as a, ai, ah, ahi, fa, and dfa for addition) are represented by a single C/C++ intrinsic which selects the proper instruction based on the type of the operand. For addition, spu_add, when given two vector unsigned ints as parameters, will generate the a (32-bit add) instruction. However, if given two vector floats as parameters, it will generate the fa (float add) instruction.

Note that the intrinsics generally have the same limitations as their corresponding assembly language instructions. However, in cases where an immediate value is too large for the appropriate immediate-mode instruction, the compiler will promote the immediate value to a vector and do the corresponding vector/vector operation. For instance, spu_add(myvec, 2) generates an ai (add immediate) instruction while spu_add(myvec, 2000) first loads the 2000 into its own vector using il and then performs the a (add) instruction.

The order of operands in the intrinsics is essentially the same as those of the assembly language instruction except that the first operand (which holds the destination register in assembly language) is not specified, but instead is used as the return value for the function. The compiler supplies the actual parameter in the code it generates.

For more on vector intrinsics, see "Programming on the Cell/B.E. processor, Part 5," the article from which this tip was taken.

Basic SPU intrinsics

This list will supply some of the more common SPU intrinsics; types are not given as most of them are polymorphic.

  • spu_add(val1, val2)
    Adds each element of val1 to the corresponding element of val2. If val2 is a non-vector value, it adds the value to each element of val1.
  • spu_sub(val1, val2)
    Subtract each element of val2 from the corresponding element of val1. If val1 is a non-vector value, then val1 is replicated across a vector, and then val2 is subtracted from it.
  • spu_mul(val1, val2)
    Because the multiplication instructions operate so differently, the SPU intrinsics do not coalesce them as much as they do for other operations. spu_mul handles floating point multiplication (single and double precision). The result is a vector where each element is the result of multiplying the corresponding elements of val1 and val2 together.
  • spu_and(val1, val2), spu_or(val1, val2), spu_not(val), spu_xor(val1, val2), spu_nor(val1, val2), spu_nand(val1, val2), spu_eqv(val1, val2)
    Boolean operations operate bit-by-bit, so the type of operands the boolean operations receive is not relevant except for determining the type of value they will return. spu_eqv is a bitwise equivalency operation, not a per-element equivalency operation.
  • spu_rl(val, count), spu_sl(val, count)
    spu_rl rotates each element of val left by the number of bits specified in the corresponding element of count. Bits rotated off the end are rotated back in on the right. If count is a scalar value, then it is used as the count for all elements of val. spu_sl operates the same way, but performs a shift instead of a rotate.
  • spu_rlmask(val, count), spu_rlmaska, spu_rlmaskqw(val, count), spu_rlmaskqwbyte(val, count)
    These are very confusingly named operations. They are named "rotate left and mask," but they are actually performing right shifts (they are implemented by a combination of left shifts and masks, but the programming interface is for right shifts). spu_rlmask and spu_rlmaska shifts each element of val to the right by the number of bits in the corresponding element of count (or the value of count if count is a scalar). spu_rlmaska replicates the sign bit as bits are shifted in. spu_rlmaskqw operates on the whole quadword at a time, but only up to 7 bits (it performs a modulus on count to put it in the proper range). spu_rlmaskqwbyte works similarly, except that count is the number of bytes instead of bits, and count is modulus 16 instead of 8.
  • spu_cmpgt(val1, val2), spu_cmpeq(val1, val2)
    These instructions perform element-by-element comparisons of their two operands. The results are stored as all ones (for true) and all zeros (for false) in the resulting vector in the corresponding element. spu_cmpgt performs a greater-than comparison while spu_cmpeq performs an equality comparison.
  • spu_sel(val1, val2, conditional)
    This corresponds to the selb assembly language instruction. The instruction itself is bit-based, so all types use the same underlying instruction. However, the intrinsic operation returns a value of the same type as the operands. As in assembly language, spu_sel looks at each bit in conditional. If the bit is zero, the corresponding bit in the result is selected from the corresponding bit in val1; otherwise it is selected from the corresponding bit in val2.
  • spu_shuffle(val1, val2, pattern)
    This is an interesting instruction which allows you to rearrange the bytes in val1 and val2 according to a pattern, specified in pattern. The instruction goes through each byte in pattern, and if the byte starts with the bits 0b10, the corresponding byte in the result is set to 0x00; if the byte starts with the bits 0b110, the corresponding byte in the result is set to 0xff; if the byte starts with the bits 0b111, the corresponding byte in the result is set to 0x80; finally (and most importantly), if none of the previous are true, the last five bits of the pattern byte are used to choose which byte from val1 or val2 should be taken as the value for the current byte. The two values are concatenated, and the five-bit value is used as the byte index of the concatenated value. This is used for inserting elements into vectors as well as performing fast table lookups.

All of the instructions that are prefixed with spu_ will try to find the best instruction match based on the types of operands. However, not all vector types are supported by all instructions -- it is based on the availability of assembly language instructions to handle it.

In addition, if you want a specific instruction rather than having the compiler choose one, you can perform almost any non-branching instruction with the specific instrinsics. All specific intrinsics take the form si_assemblyinstructionname where assemblyinstructionname is the name of the assembly language instruction as defined in the SPU Assembly Language Specification. So, si_a(a, b) forces the instruction a to be used for addition.

All operands to specific intrinsics are cast to a special type called qword, which is essentially an opaque register value type. The return value from specific intrinsics are also qwords, which can then be cast into whatever vector type you wish.

But wait! There's more

In "Programming on the Cell/B.E. processor, Part 5," the article from which this tip was taken, you can learn more about how to use the vector extensions and discover how to direct the compiler to do branch prediction and to perform DMA transfers in C/C++.



Get products and technologies



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into developerWorks

Zone=Multicore acceleration
ArticleTitle=Tech tips: SPU vector intrinsics at your fingertips