A guide to inline assembly for C and C++

Basic, intermediate, and advanced concepts

First, the authors describe basic usage syntax for inline assembly (inline asm) embedded within C and C++ programs. Then they explain intermediate concepts, such as addressing modes, the clobbers list, and branching stanzas, as well as more advanced topics, such as memory clobbers, the volatile attribute, and locks are discussed for those who want to use inline asm in multithreaded applications.

Salma Elshatanoufy (elshatan@ca.ibm.com), Software Developer, IBM

Salma Elshatanoufy is a software developer in the IBM XL Compilers group, in Canada. Salma has been with IBM, in the test department, for three years. Besides compilers, her interests include multi-threaded applications.



William O'Farrell (billo@ca.ibm.com), Software Developer, IBM Master Inventor, IBM

Bill O'Farrell is a software developer in the IBM XL Compilers group, in Canada. During his 20 years at IBM, he has worked in several areas. Besides compilers, his interests include debuggers and concurrency. Bill has a PhD in parallel computing from Syracuse University.



01 November 2011

Also available in Vietnamese Portuguese

In this article, we discuss several use scenarios for inline assembly, also called inline asm. For beginners, we introduce basic syntax, operand referencing, constraints, and common pitfalls that new users need to be aware of. For intermediate users, we discuss the clobbers list, as well as branching topics that facilitate the use of branch instructions within inline asm stanzas in their C/C++ code. Lastly, we discuss memory clobbers and the volatile attribute for advanced users who use inline asm to optimize their code. We conclude with an example of multithreaded locking with inline asm.

Basic inline asm

In the asm block shown in code Listing 1, the addc instruction is used to add two variables, op1 and op2. In any asm block, assembly instructions appear first, followed by the inputs and outputs, which are separated by a colon. The assembly instructions can consist of one or more quoted strings. The first colon separates the output operands; the second colon separates the input operands. If there are clobbered registers, they are inserted after the third colon. If there are no clobbered inputs for the asm block, the third colon can be omitted, as Listing 2 shows.

Listing 1. Opcodes, inputs, outputs, and clobbers

int res=0;
int op1=20;
int op2=30;

asm ( " addc.    %0,%1,%2        \n"         
       : "=r"(res)			             
       : "b"(op1), "r"(op2)                    
       : "r0" 					 );

Listing 2. No clobbered inputs for the asm block, so third colon omitted

asm ( " addc.    %0,%1,%2        \n"         
       : "=r"(res)			                 
       : "b"(op1), "r"(op2) 		 );

Note:
The clobbers list is discussed later in this section.

Each instruction "expects" inputs and outputs to be passed in a certain format. In the previous example, the addc. instruction expects its operands to be passed through registers, hence op1 and op2 are passed into the asm block with the "b" and "r" constraints. For a complete listing of all legal asm constraints for the IBM XL C and C++ compiler, see the compiler language reference.

Register constraints on variable declarations

In some programs, you will want to tie variables to certain hardware registers. This is done at the variable declaration. The following example ties the variable res to GPR0 throughout the life of the program:

int register res asm("r0")=0;

When the variable type is not matched with the type of target hardware register, you will receive a compilation error notice.

After a variable is tied to a specific register, it is not possible to use another register to hold the same variable. For example, the following code will cause a compilation error, the variable res is associated at declaration time with GPR0, but in the asm block, the user attempts to use any register but GPR0 to pass in res.

Listing 3. Compilation error when conflicting constraints are used on a variable

int register res asm("r0")=0;

asm ( " addc.    %0,%1,%2        \n"         
       : "=b"(res)			             
       : "b"(op1), "r"(op2)                    
       : "r0" 					 );

In the example in Listing 4, there is no output operand for the stw instruction, hence the outputs section of the asm is empty. None of the registers is modified, so they are all input operands, and the target address is passed in with the input operands. However, something is modified: the addressed memory location. But that location is not explicitly mentioned in the instruction, so the output of the instruction is implicit rather than explicit.

Listing 4. Instructions with no output operands

int res [] = {0,0};
int a=45;
int *pointer = &res[0];

asm ( " stw    %0,%1(%2)        \n"            
	: 			  			      	                          
	: "r"(a), "i"(sizeof(int)),"r"(pointer));

Listing 5. Instructions with preserved operands

int res [] = {0,0};
int a=45;

asm ( " stw    %0,%1(%2)        \n"      
	: "+r"(res[0]) 
	: "r"(a), "i"(sizeof(int)),"r"(pointer));

In listing 5, if you want to preserve the initial value of a result variable that is not necessarily modified by the asm block, then you need to use the + (plus sign) constraint to preserve the initial value of that variable, as is shown with res[0].

Target memory addresses in inline asm

If an instruction specifies two of its arguments in a form similar to D(RA), where D is a literal value and RA is a general register, then this is taken to mean that D+RA is an effective address. In this case, the appropriate constraints are "m" or "o". Both "m" and "o" refer to memory arguments. Constraint "o" is described as an offsettable memory location. But in the IBM® POWER® architecture, nearly all memory references require an offset, so "m" and "o" are equivalent. In this case, you can use a single constraint to refer to two operands in the instruction. Listing 6 is an example.

Listing 6. A single constraint to refer to two operands in the instruction

int res [] = {0,0};
int a=45;

asm ( " stb %1,%0   	      \n"     	    
     : "=m"(res[1])                    
     : "r"(a));

The form of the instruction stb (from the assembly language reference) is: stb RS,D(RA).

Although the stb instruction technically takes three operands (a source register, an address register, and an immediate displacement), the asm description of it uses only two constraints. The "=m" constraint is used to notify the compiler that the memory address of res is to be used for the result of the store instruction (The "sync" instruction is often used for this purpose, but there are others available, as described in the POWER ISA See Resources for a link.) The "=m" indicates that the operand is a modified memory location. You do not need to know the address of the target location beforehand, because that task is left to the compiler. This allows the compiler to choose the right register (r1 for an automatic variable, for instance) and apply the right displacement automatically. This is necessary, because it would generally be impossible for an asm programmer to know what address register and what displacement to use. In other instances, you can also override this behavior by manually calculating the target address as in the following example.

Listing 7. Manually calculating the target address

int res [] = {0,0};
int a=45;

asm ( " stb    %0,%1(%2)        \n"            
	: 			  			      	                          
: "r"(a), "i"(sizeof(int)),"r"(&res));

In this code, the specification %1(%2) represents a base address and an offset, where %2 represents the base address, and res[0] and %1 represent the offset, sizeof(int). As a result, the store is performed at the effective address, res.

Note:
For some instructions, GPR0 cannot be used as a base address. Specifying GPR0 tells the assembler not to use a base register at all. To ensure that the compiler does not choose r0 for an operand, you can use the constraint "b" rather than "r".

Addressing modes for POWER and PowerPC instructions

The IBM POWER architecture type is RISC. Instructions typically operate either with three register arguments (two registers for source arguments, one register to hold a result) or with two registers and an immediate value (one register and one immediate value for the source arguments, and one register to hold the result). There are exceptions to this pattern, but mostly it is true.

Among the instructions that take two registers and an immediate value, there are two special subclasses: load instructions and store instructions. These instructions use the immediate value as an offset to the value in the source register to form an "effective address." The offset value is typically an offset onto the stack (r1 is the stack pointer), or it is an offset to the TOC (Table of Contents -- r2 is the TOC pointer). The TOC is used to promote the construction of position-independent code, which enables efficient dynamic loading of shared libraries on these machines.

When using inline asm, you do not have to use specific registers nor manually construct effective addresses. The argument constraints are used to direct the compiler to choose registers or construct effective addresses appropriate to the requirements of the instructions. Thus, if a general register is required by the instruction, you could use either the "r" or "b" constraint. The "b" constraint is of interest, because many instructions use the designation of register 0 specially –- a designation of register 0 does not mean that r0 is used, but instead a literal value of 0. For these instructions, it is wise to use "b" to denote the input operands to prevent the compiler from choosing r0. If the compiler chooses r0, and the instruction takes that to mean a literal 0, the instruction would produce incorrect results.

Listing 8. r0 and its special meaning in the stbx instruction

char res[8]={'a','b','c','d','e','f','g','h'};
char a='y';
int index=7;

asm ("  stbx %0,%1,%2       \n"     
        :             
        : "r"(a), "r"(index), "r"(res) );

Here, the expected result string is abcdefgy, but if the compiler chose r0 for %1, then the result would incorrectly be ybcdefgh. To prevent this from happening, use "b" as in Listing 9 shows.

Listing 9. Using "b" constraint to signify non-zero GPR

char res[8]={'a','b','c','d','e','f','g','h'};
char a='y';
int index=7;

asm ("  stbx %0,%1,%2       \n"     
        :             
        : "r"(a), "b"(index), "r"(res) );

Another example is in the following ASM block. While it appears that the asm block below does res=res+4, that is not the actual functional behavior of the code.

Listing 10. Meaning of r0 in the second operand with addi opcode

int register res asm("r0")=5;
int b=4;

asm ( " addi    %0,%0,%1        \n"             
	 : "+r"(res)                     	                        
	 : "i"(b)		   				 
      : "r0");     	

where:

addi    %0(result operand),%0(input operand res),%3(immediate operand b)

Because res is tied to r0, the translation of the asm code in assembly looks becomes:
addi 0,0,4

The second operand does not translate to register zero. Instead, it translates to the immediate number zero. In effect, the following is the result of the addi operation:
res=0+4

This case is special to the addi opcode. If, instead, res was tied to r1, then the original intended behavior would have been obtained:
res=res+4


Clobbers list

Basic clobbers list

In cases when registers that are not directly tied to the inputs/outputs are used within the asm block, the user must specify such registers within the clobbers list.

The clobbers list is used to notify the compiler that the registers contained within the list can potentially have their values altered. Hence, they should not be used to hold other data other than for the instructions that they are used for.

In the example in Listing 11, registers 8 and 7 are added to the clobbers list because they are used in the instructions but are not explicitly tied to any of the input/output operands. Also, condition register field zero is added to the clobbers list for the same reason. Although it is not present in the input/output operands, the mfocrf instruction reads that bit from the condition register and moves the value in register 8.

Listing 11. Clobbers list example

asm (" addc.    %0,%2,%3              \n"             
	" mfocrf   8,0x1                \n"                  
	" andi.    7,8,0xF              \n"             
     " stw      7,%1                  \n"             
	: "=r"(res),"=m"(c_bit)                                                  
        : : "b"(a), "r"(b)		    				      
	: "r0","r7","r8","cr0"      );     	  clobbers list

If, instead, the mfocrf instruction read from condition register field 1 (cr1), then that field would need to be added to clobbers list instead. Also, the period [full stop] at the end of the addc. and andi. instructions means their results are compared to zero, and the result of the comparison is stored in condition register field 0.

When clobbered registers are omitted from the clobbers list, the results from the asm operations might not be correct. This is because such clobbered registers might be reused to hold intermediate values for other operations. Unless the compiler detects that those registers are clobbered, the intermediate data can be used to perform the programmer's instructions, with inaccurate results. Also, the user's asm instructions may clobber values used by the compiler.

Exceptions to the clobbers list

Nearly all registers can be clobbered, except for those listed in Table 1.

Table 1. Registers that cannot be clobbered
RegisterDescription
r1stack pointer
r2toc pointer
r11environment pointer
r1364 bit mode thread local data pointer
r30often used by the compiler as a stack frame pointer, pointer to constant area
r31often used by the compiler as a stack frame pointer, pointer to constant area

Memory clobbers

Memory clobber implies a fence, and it also impacts how the compiler treats potential data aliases. A memory clobber says that the asm block modifies memory that is not otherwise mentioned in the asm instructions. So, for example, a correct use of memory clobbers would be when using an instruction that clears a cache line. The compiler will assume that virtually any data may be aliased with the memory changed by that instruction. As a result, all required data used after the asm block will be reloaded from memory after the asm completes. This is much more expensive than the simple fence implied by the "volatile" attribute (discussed later).

Remember, because the memory clobber says anything might be aliased, everything that is used needs to be reloaded after the asm, regardless of whether it had anything to do with the asm. A memory clobber can be added to the clobbers list by simply using the "memory" word instead of a register name.


Branching

Basic branching

Branching can be tricky with inline asm, this is because you need to know the address of the instruction to which to branch before compile time. Although this is not possible, you can use labels. Using labels, the branch-to address can be designated with a unique identifier that can be used as a target branch address.

Within a single source file, labels cannot be repeated within an inline asm block, nor within neighboring asm blocks within the same source. In a given program, each label is unique. There is an exception to this rule, however, and this is if you use relative branching (more on this later). With relative branching, more than one label with the same identifier can be found within the same program and within the same asm block.

Note:
Labels cannot be used in asm to define macros because of possible namespace clashes.

In the example in Listing 12, the branch occurs when the LT bit, bit 0, of the condition register is set. If is it not set, then the branch is not taken.

Listing 12. Example of branch taken when LT bit of CR0 is set (0x80000000)

asm ( "  addic. %0,%2,%4                      	\n"  	
      "  bc     0xC,0,here                        	\n"     	
      "  there: add %1,%2,%3                		\n"       
      "  here:  mul %0,%2,%3                		\n"       	
      :  "=r"(res),"=r"(res2)                                    
      :  "r" (a),"r"(b),"r"(c)                                          
      :  "cr0" );

Likewise, a branch would occur if the GT bit (bit 1) of the condition register is set, as in the code in Listing 13.

Listing 13. Example of branch taken when GT bit of CR0 is set (0x40000000)

asm ( "  addic. %0,%2,%4                      	\n"     	
      "  bc     0xC,1,here                        	\n"     	
      "  there: add %1,%2,%3                		\n"       
      "  here:  mul %0,%2,%3                		\n"       	
      :  "=r"(res),"=r"(res2)                                    
      :  "r" (a),"r"(b),"r"(c)                                         
      :  "cr0" );

With inline asm, it is perfectly legal to branch within the same asm block; however, it is not possible to branch between different asm blocks, even if they are contained within the same source.

Relative branching

As discussed earlier, relative branching allows you to reuse the name of a label more than once within the same program. It is predominantly used, however, to dictate the position of the target address relative to the branch instruction. These are examples of the relative branch codes that can be used:

  • F -forward
  • B -backward

Note:
That they must be suffixed to numeric labels to be syntactically correct.

In this example (Listing 14), notice that the target address is referenced as "Hereb". In this case, we use the label of the target address appended with a suffix that dictates where this label is located relative to the branch instruction itself. The label "Here" is located before the branch instruction, hence the use of the "b" suffix in "Hereb."

Listing 14. Needs caption

asm (	    "   10: lwarx %0,0,%2  	\n"
              "   cmpwi %0,0     		\n"
	         "   bne- 20f        		\n"
	         "   ori %0,%0,1    		\n"
	         "   stwcx. %0,0,%2 		\n"
	         "   bne- 10b        	\n"
	         "   sync          		\n"
	         "   ori  %1,%1,1   		\n"
	         "   20:                	\n"		    :)

The condition register

The condition register is used to capture information on results of certain instructions.

For non-floating point instructions with period (.) suffixes that set the CR, the result of the operation is compared to zero.

  • If the result is greater than zero, then bit 1 of the CR field is set (0x4).
  • If it is less than zero, then bit 0 is set (0x8).
  • If the result is equal to zero, then bit 2 is set (0x2).

For all compare instructions, the two values are compared, and any CR field can be set (not just CR0). Table 2 lists the bits and their corresponding meanings (there are eight such sets of 4 bits in the condition register, called "cr0, cr1, cr2 … cr7").

Table 2. Bits of a CR field and the meanings of different settings
BitNameDescription
0LTRA < 0
1GTRA > 0
2EQRA = 0
3UOverflow for integer operations.

Unordered, for floating point operations

Note:
For floating point instructions with a period suffix, CR1 is set to the upper 4 bits of the FPSCR.


Blocking the Volatile attribute

Making an inline asm block "volatile" as in this example, ensures that, as it optimizes, the compiler does not move any instructions above or below the block of asm statements.

asm volatile("  addic. %0,%1,%2\n" : "=r"(res): "=r"(a),"r"(a))

This can be particularly important in cases when the code is accessing shared memory. This will be illustrated in the next section on multithreaded locking.\


Multithreaded locking

One of the most common uses of inline asm is in writing short segments of instructions to manage multithreaded locks. Because of the loose memory model on the POWER architecture, constructing such locks requires careful use of a pair of instructions:

  • One instruction that loads the lock word and creates a "reservation"
  • Another that updates the lock word if the reservation hasn't been lost in the interim

Note:
If the reservation has been lost, a loop can be used to retry repeatedly.

Listing 15 shows a basic inline function that attempts to acquire a lock (there are several problems with this code, which we discuss after these examples).

Listing 15. Example of Acquire lock function coded in asm

inline bool acquireLock(int *lock){
	bool returnvalue = false;
	int lockval;
	asm (
	"0: lwarx %0,0,%2  \n" //load lock and reserve
	"   cmpwi 0,%0,0   \n" //compare the lock value to 0
	"   bne 1f         \n" //not 0 then exit function
	"   ori %0,%0,1    \n" //set the lock to 1
	"   stwcx. %0,0,%2 \n" //try to acquire the lock
	"   bne 0b         \n" //reservation lost, try again
	"   ori  %1,%1,1   \n" //set the return value to true
	"1:                \n" //didn't get lock, return false
	: "+r" (lockval), "+r" (returnvalue)
	: "r"(lock)            //parameter lock is an address
	: "cr0" );             //cmpwi, stwcx both clobber cr0
   return returnvalue;
}

Listing 16 is an example of how this inline function could be used.

Listing 16. Example of how the acquireLock function can be used

if (acquireLock(lockWord)){
   //begin to use the shared region
   temp = x + 1;
    .  .  .
}

Because the function is inline, the resulting code won't have an actual call in it. Instead, it will precede the use of the shared region x with the instructions to acquire the lock.

The first problem to notice with this code is the lack of a synchronization instruction. One of the key performance enhancements enabled by the loose memory model of the POWER architecture is the ability of the machine to reorder loads and stores to make more efficient use of internal pipelines. However, there are times when the programmer needs to curtail this reordering to some degree to properly access shared storage. In the case of a lock, you would not want a load of data from the shared region ("x" in the case above) to be reordered so that it occurs before the lock on the region is acquired. For this reason, a synchronization instruction should be inserted to tell the machine to limit reordering in this case. The sync instruction is often used for this purpose, but there are others available, as described in the POWER ISA (see Resources). In the code example in Listing 17, we inserted sync instruction to prevent reordering of loads of "x" (this is called an "import barrier"):

Listing 17. Sync example

inline bool acquireLock(int *lock){
	bool returnvalue = false;
	int lockval;
	asm (
	"0: lwarx %0,0,%2  \n" //load lock and reserve
	"   cmpwi 0,%0,0   \n" //compare the lock value to 0
	"   bne 1f         \n" //not 0 then exit function
	"   ori %0,%0,1    \n" //set the lock to 1
	"   stwcx. %0,0,%2 \n" //try to acquire the lock
	"   bne 0b         \n" //reservation lost, try again
	"   sync          \n" //import barrier
	"   ori  %1,%1,1   \n" //set the return value to true
	"1:                \n" //didn't get lock, return false
	: "+r" (lockval), "+r" (returnvalue)
	: "r"(lock)            //parameter lock is an address
	: "cr0" );             //cmpwi, stwcx both clobber cr0
   return returnvalue;
}

In that asm block, the sync will prevent any subsequent loads from occurring until after it is known which way the preceding branch went. That way the variable x will not be loaded unless the branch was not taken and the acquireLock returns true.

So, are we set now? Unfortunately not. We still have to worry what the compiler might do.

Modern optimizing compilers can be very aggressive in moving code around -- and even removing it completely -- if it appears that the changes might make the program run faster without changing the semantics of the code. However, compilers typically aren't aware of the complexities involved with accessing shared memory. For example, a compiler might move the statement temp = x + 1; to a place higher in the program if it determines that the result would be scheduled more efficiently (and it assumes that the "if" is usually taken). Of course, that would be disastrous from the viewpoint of accessing shared data. To prevent the movement of any loads (or any instructions at all) from below the inline asm to a location above it, you can use the keyword "volatile" (also known as the volatile attribute) to modify the asm block, as Listing 18 shows.

Listing 18. Volatile keyword example

inline bool acquireLock(int *lock){
	bool returnvalue = false;
	int lockval;
	asm volatile (
	"0: lwarx %0,0,%2  \n" //load lock and reserve
	. . .
	"1:                \n" //didn't get lock, return false
	: "+r" (lockval), "+r" (returnvalue)
	: "r"(lock)            //parameter lock is an address
	: "cr0" );             //cmpwi, stwcx both clobber cr0
   return returnvalue;
}

When you do this, an internal fence is placed before and after the asm block that prevents instructions from being moved past it. And remember that this asm block is inlined, so it will prevent the access to x from being moved above the asm-implemented lock.


Memory clobbers in multithreaded locking

The discussion of multithreaded locking would not be complete without a mention of memory clobbers. The keyword memory is often added to the clobber list in such situations, although it is not always clear why it would be needed. The use of memory in the clobbers list means that memory is altered unpredictably by the asm block.

However, memory modifications in the locking example given are quite predictable. Although the variable lock is a pointer (that points to a lock location), that isn't any more unpredictable that the expression "*lock" in a C program. In that case, a well-behaved compiler would likely associate the expression "*lock" with all variables of the appropriate type, and so would correctly reload any affected variables after the pointer was used for modifying data. Nonetheless, the use of memory clobbers appears to be a pervasive practice, which is probably driven by an abundance of caution when dealing with shared regions. Programmers should be aware, though, of the performance penalties involved and of alternative approaches.

When an inline asm includes "memory" in the clobbers list, it means that any variable in the program might have been modified by the asm, so it must be reloaded before it is used. This requirement can pretty much put a sledgehammer to optimization efforts by the compiler. A potentially lighter-weight approach would be to make the shared region volatile (in addition to the asm block itself). Making a variable volatile means its value must be reloaded before it is used in any given expression. If the shared region in question is a data structure, such as a list or queue, this will ensure that the updated structure is reloaded after the lock is acquired. However, all of the non-shared data accesses can enjoy the full complement of compiler optimizations.

Tip:

If the shared data structure is accessed by a pointer (say *p), be sure to declare the pointer so that you ndicate that it's the object pointed to that is volatile, not the pointer itself. For example, this declares that the list pointed to by p is volatile:

volatile list *p

Acknowledgments

Thank you Ian McIntosh, Christopher Lapkowski, Jim McInnes, and Jae Broadhurst. You've each played an important role in publishing this article.

Resources

Learn

Get products and technologies

  • Download a free trial version of Rational software.
  • Evaluate other IBM software in the way that suits you best: Download it for a trial, try it online, use it in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Rational software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Rational
ArticleID=767817
ArticleTitle=A guide to inline assembly for C and C++
publish-date=11012011