In this article, we discuss several use scenarios for inline assembly, also called inline asm. For beginners, we introduce basic syntax, operand referencing, constraints, and common pitfalls that new users need to be aware of. For intermediate users, we discuss the clobbers list, as well as branching topics that facilitate the use of branch instructions within inline asm stanzas in their C/C++ code. Lastly, we discuss memory clobbers and the volatile attribute for advanced users who use inline asm to optimize their code. We conclude with an example of multithreaded locking with inline asm.
In the asm block shown in code Listing 1, the
addc instruction is used to add two
variables, op1 and
op2. In any asm block, assembly
instructions appear first, followed by the inputs and outputs,
which are separated by a colon. The assembly instructions can
consist of one or more quoted strings. The first colon separates
the output operands; the second colon separates the
input operands. If there are clobbered registers, they are
inserted after the third colon. If there are no clobbered inputs
for the asm block, the third colon can be omitted, as Listing 2
shows.
Listing 1. Opcodes, inputs, outputs, and clobbers
int res=0;
int op1=20;
int op2=30;
asm ( " addc. %0,%1,%2 \n"
: "=r"(res)
: "b"(op1), "r"(op2)
: "r0" );
|
Listing 2. No clobbered inputs for the asm block, so third colon omitted
asm ( " addc. %0,%1,%2 \n"
: "=r"(res)
: "b"(op1), "r"(op2) );
|
Note:
The clobbers list is discussed later in this section.
Each instruction "expects" inputs and outputs to be passed in a
certain format. In the previous example, the
addc. instruction expects its
operands to be passed through registers, hence
op1 and
op2 are passed into the asm block
with the "b" and
"r" constraints. For a complete
listing of all legal asm constraints for the IBM XL C and C++
compiler, see the compiler language reference.
Register constraints on variable declarations
In some programs, you will want to tie variables to certain
hardware registers. This is done at the variable declaration.
The following example ties the variable
res to
GPR0 throughout the life of the
program:
int register res asm("r0")=0;
|
When the variable type is not matched with the type of target hardware register, you will receive a compilation error notice.
After a variable is tied to a specific register, it is not
possible to use another register to hold the same variable. For
example, the following code will cause a compilation error, the
variable res is associated at
declaration time with GPR0, but in
the asm block, the user attempts to use any register but
GPR0 to pass in
res.
Listing 3. Compilation error when conflicting constraints are used on a variable
int register res asm("r0")=0;
asm ( " addc. %0,%1,%2 \n"
: "=b"(res)
: "b"(op1), "r"(op2)
: "r0" );
|
In the example in Listing 4, there is no output operand
for the stw instruction, hence the
outputs section of the asm is empty. None of the registers is
modified, so they are all input operands, and the target address
is passed in with the input operands. However, something is
modified: the addressed memory location. But that location is
not explicitly mentioned in the instruction, so the output of
the instruction is implicit rather than explicit.
Listing 4. Instructions with no output operands
int res [] = {0,0};
int a=45;
int *pointer = &res[0];
asm ( " stw %0,%1(%2) \n"
:
: "r"(a), "i"(sizeof(int)),"r"(pointer));
|
Listing 5. Instructions with preserved operands
int res [] = {0,0};
int a=45;
asm ( " stw %0,%1(%2) \n"
: "+r"(res[0])
: "r"(a), "i"(sizeof(int)),"r"(pointer));
|
In listing 5, if you want to preserve the initial value of a
result variable that is not necessarily modified by the asm
block, then you need to use the + (plus sign)
constraint to preserve the initial value of that variable, as is
shown with res[0].
Target memory addresses in inline asm
If an instruction specifies two of its arguments in a form
similar to D(RA),
where D is a literal value and
RA is a general register, then this
is taken to mean that D+RA is an
effective address. In this case, the appropriate constraints are
"m" or "o". Both "m" and "o" refer to memory arguments.
Constraint "o" is described as an offsettable memory
location. But in the IBM® POWER® architecture,
nearly all memory references require an offset, so "m" and "o"
are equivalent. In this case, you can use a single constraint to
refer to two operands in the instruction. Listing 6 is an
example.
Listing 6. A single constraint to refer to two operands in the instruction
int res [] = {0,0};
int a=45;
asm ( " stb %1,%0 \n"
: "=m"(res[1])
: "r"(a));
|
The form of the instruction stb (from
the assembly language reference) is:
stb RS,D(RA).
Although the stb instruction
technically takes three operands (a source register, an address
register, and an immediate displacement), the asm description of
it uses only two constraints. The
"=m" constraint is used to notify the
compiler that the memory address of
res is to be used for the result of
the store instruction (The "sync" instruction is often used for
this purpose, but there are others available, as described in
the POWER ISA See Resources for a
link.) The "=m" indicates that the
operand is a modified memory location. You do not need to know
the address of the target location beforehand, because that task
is left to the compiler. This allows the compiler to choose the
right register (r1 for an automatic
variable, for instance) and apply the right displacement
automatically. This is necessary, because it would generally be
impossible for an asm programmer to know what address register
and what displacement to use. In other instances, you can also
override this behavior by manually calculating the target
address as in the following example.
Listing 7. Manually calculating the target address
int res [] = {0,0};
int a=45;
asm ( " stb %0,%1(%2) \n"
:
: "r"(a), "i"(sizeof(int)),"r"(&res));
|
In this code, the specification %1(%2)
represents a base address and an offset, where
%2 represents the base address, and
res[0] and
%1 represent the offset,
sizeof(int). As a result, the store
is performed at the effective address,
res.
Note:
For some instructions, GPR0
cannot be used as a base address. Specifying GPR0 tells the
assembler not to use a base register at all. To ensure that the
compiler does not choose r0 for an
operand, you can use the constraint
"b" rather than
"r".
Addressing modes for POWER and PowerPC instructions
The IBM POWER architecture type is RISC. Instructions typically operate either with three register arguments (two registers for source arguments, one register to hold a result) or with two registers and an immediate value (one register and one immediate value for the source arguments, and one register to hold the result). There are exceptions to this pattern, but mostly it is true.
Among the instructions that take two registers and an immediate
value, there are two special subclasses: load instructions and
store instructions. These instructions use the immediate value
as an offset to the value in the source register to form an
"effective address." The offset value is typically an offset
onto the stack (r1 is the stack
pointer), or it is an offset to the TOC (Table of Contents --
r2 is the TOC pointer). The TOC is
used to promote the construction of position-independent code,
which enables efficient dynamic loading of shared libraries on
these machines.
When using inline asm, you do not have to use specific registers
nor manually construct effective addresses. The argument
constraints are used to direct the compiler to choose registers
or construct effective addresses appropriate to the requirements
of the instructions. Thus, if a general register is required by
the instruction, you could use either the
"r" or "b"
constraint. The "b" constraint is of
interest, because many instructions use the designation of
register 0 specially –- a designation
of register 0 does not mean that
r0 is used, but instead a literal
value of 0. For these instructions,
it is wise to use "b" to denote the
input operands to prevent the compiler from choosing
r0. If the compiler chooses
r0, and the instruction takes that to
mean a literal 0, the instruction
would produce incorrect results.
Listing 8. r0 and its special meaning in the stbx instruction
char res[8]={'a','b','c','d','e','f','g','h'};
char a='y';
int index=7;
asm (" stbx %0,%1,%2 \n"
:
: "r"(a), "r"(index), "r"(res) );
|
Here, the expected result string is
abcdefgy, but if the compiler
chose r0 for %1, then the result
would incorrectly be ybcdefgh. To
prevent this from happening, use "b"
as in Listing 9 shows.
Listing 9. Using "b" constraint to signify non-zero GPR
char res[8]={'a','b','c','d','e','f','g','h'};
char a='y';
int index=7;
asm (" stbx %0,%1,%2 \n"
:
: "r"(a), "b"(index), "r"(res) );
|
Another example is in the following ASM block. While it appears that the asm block below does res=res+4, that is not the actual functional behavior of the code.
Listing 10. Meaning of r0 in the second operand with addi opcode
int register res asm("r0")=5;
int b=4;
asm ( " addi %0,%0,%1 \n"
: "+r"(res)
: "i"(b)
: "r0");
where:
addi %0(result operand),%0(input operand res),%3(immediate operand b)
|
Because res is tied to
r0, the translation of the asm code
in assembly looks becomes:
addi 0,0,4
The second operand does not translate to register zero. Instead,
it translates to the immediate number zero. In effect, the
following is the result of the addi operation:
res=0+4
This case is special to the
addi opcode. If, instead,
res was tied to
r1, then the original intended
behavior would have been
obtained:res=res+4
In cases when registers that are not directly tied to the inputs/outputs are used within the asm block, the user must specify such registers within the clobbers list.
The clobbers list is used to notify the compiler that the registers contained within the list can potentially have their values altered. Hence, they should not be used to hold other data other than for the instructions that they are used for.
In the example in Listing 11, registers 8 and 7 are added to the
clobbers list because they are used in the instructions but are
not explicitly tied to any of the input/output operands. Also,
condition register field zero is added to the clobbers list for
the same reason. Although it is not present in the input/output
operands, the mfocrf instruction
reads that bit from the condition register and moves the value
in register 8.
Listing 11. Clobbers list example
asm (" addc. %0,%2,%3 \n"
" mfocrf 8,0x1 \n"
" andi. 7,8,0xF \n"
" stw 7,%1 \n"
: "=r"(res),"=m"(c_bit)
: : "b"(a), "r"(b)
: "r0","r7","r8","cr0" ); clobbers list
|
If, instead, the mfocrf instruction
read from condition register field 1 (cr1), then that field
would need to be added to clobbers list instead. Also, the
period [full stop] at the end of the
addc. and
andi. instructions means their
results are compared to zero, and the result of the comparison
is stored in condition register field 0.
When clobbered registers are omitted from the clobbers list, the results from the asm operations might not be correct. This is because such clobbered registers might be reused to hold intermediate values for other operations. Unless the compiler detects that those registers are clobbered, the intermediate data can be used to perform the programmer's instructions, with inaccurate results. Also, the user's asm instructions may clobber values used by the compiler.
Exceptions to the clobbers list
Nearly all registers can be clobbered, except for those listed in Table 1.
Table 1. Registers that cannot be clobbered
| Register | Description |
|---|---|
| r1 | stack pointer |
| r2 | toc pointer |
| r11 | environment pointer |
| r13 | 64 bit mode thread local data pointer |
| r30 | often used by the compiler as a stack frame pointer, pointer to constant area |
| r31 | often used by the compiler as a stack frame pointer, pointer to constant area |
Memory clobber implies a fence, and it also impacts how the compiler treats potential data aliases. A memory clobber says that the asm block modifies memory that is not otherwise mentioned in the asm instructions. So, for example, a correct use of memory clobbers would be when using an instruction that clears a cache line. The compiler will assume that virtually any data may be aliased with the memory changed by that instruction. As a result, all required data used after the asm block will be reloaded from memory after the asm completes. This is much more expensive than the simple fence implied by the "volatile" attribute (discussed later).
Remember, because the memory clobber says anything might be aliased, everything that is used needs to be reloaded after the asm, regardless of whether it had anything to do with the asm. A memory clobber can be added to the clobbers list by simply using the "memory" word instead of a register name.
Branching can be tricky with inline asm, this is because you need to know the address of the instruction to which to branch before compile time. Although this is not possible, you can use labels. Using labels, the branch-to address can be designated with a unique identifier that can be used as a target branch address.
Within a single source file, labels cannot be repeated within an inline asm block, nor within neighboring asm blocks within the same source. In a given program, each label is unique. There is an exception to this rule, however, and this is if you use relative branching (more on this later). With relative branching, more than one label with the same identifier can be found within the same program and within the same asm block.
Note:
Labels cannot be used in asm to define macros because of
possible namespace clashes.
In the example in Listing 12, the branch occurs when the LT bit, bit 0, of the condition register is set. If is it not set, then the branch is not taken.
Listing 12. Example of branch taken when LT bit of CR0 is set (0x80000000)
asm ( " addic. %0,%2,%4 \n"
" bc 0xC,0,here \n"
" there: add %1,%2,%3 \n"
" here: mul %0,%2,%3 \n"
: "=r"(res),"=r"(res2)
: "r" (a),"r"(b),"r"(c)
: "cr0" );
|
Likewise, a branch would occur if the GT bit (bit 1) of the condition register is set, as in the code in Listing 13.
Listing 13. Example of branch taken when GT bit of CR0 is set (0x40000000)
asm ( " addic. %0,%2,%4 \n"
" bc 0xC,1,here \n"
" there: add %1,%2,%3 \n"
" here: mul %0,%2,%3 \n"
: "=r"(res),"=r"(res2)
: "r" (a),"r"(b),"r"(c)
: "cr0" );
|
With inline asm, it is perfectly legal to branch within the same asm block; however, it is not possible to branch between different asm blocks, even if they are contained within the same source.
As discussed earlier, relative branching allows you to reuse the name of a label more than once within the same program. It is predominantly used, however, to dictate the position of the target address relative to the branch instruction. These are examples of the relative branch codes that can be used:
- F -forward
- B -backward
Note:
That they must be suffixed to numeric
labels to be syntactically correct.
In this example (Listing 14), notice that the target address is referenced as "Hereb". In this case, we use the label of the target address appended with a suffix that dictates where this label is located relative to the branch instruction itself. The label "Here" is located before the branch instruction, hence the use of the "b" suffix in "Hereb."
Listing 14. Needs caption
asm ( " 10: lwarx %0,0,%2 \n"
" cmpwi %0,0 \n"
" bne- 20f \n"
" ori %0,%0,1 \n"
" stwcx. %0,0,%2 \n"
" bne- 10b \n"
" sync \n"
" ori %1,%1,1 \n"
" 20: \n" :)
|
The condition register is used to capture information on results of certain instructions.
For non-floating point instructions with period (.) suffixes that set the CR, the result of the operation is compared to zero.
- If the result is greater than zero, then bit 1 of the CR field is set (0x4).
- If it is less than zero, then bit 0 is set (0x8).
- If the result is equal to zero, then bit 2 is set (0x2).
For all compare instructions, the two values are compared, and
any CR field can be set (not just
CR0). Table 2 lists the bits and
their corresponding meanings (there are eight such sets of 4
bits in the condition register, called "cr0, cr1, cr2 …
cr7").
Table 2. Bits of a CR field and the meanings of different settings
| Bit | Name | Description |
|---|---|---|
| 0 | LT | RA < 0 |
| 1 | GT | RA > 0 |
| 2 | EQ | RA = 0 |
| 3 | U | Overflow for integer operations. Unordered, for floating point operations |
Note:
For floating point instructions with
a period suffix, CR1 is set to the upper 4 bits of the
FPSCR.
Blocking the Volatile attribute
Making an inline asm block "volatile" as in this example, ensures that, as it optimizes, the compiler does not move any instructions above or below the block of asm statements.
asm volatile(" addic. %0,%1,%2\n" : "=r"(res): "=r"(a),"r"(a))
|
This can be particularly important in cases when the code is accessing shared memory. This will be illustrated in the next section on multithreaded locking.\
One of the most common uses of inline asm is in writing short segments of instructions to manage multithreaded locks. Because of the loose memory model on the POWER architecture, constructing such locks requires careful use of a pair of instructions:
- One instruction that loads the lock word and creates a "reservation"
- Another that updates the lock word if the reservation hasn't
been lost in the interim
Note:
If the reservation has been lost, a
loop can be used to retry repeatedly.
Listing 15 shows a basic inline function that attempts to acquire a lock (there are several problems with this code, which we discuss after these examples).
Listing 15. Example of Acquire lock function coded in asm
inline bool acquireLock(int *lock){
bool returnvalue = false;
int lockval;
asm (
"0: lwarx %0,0,%2 \n" //load lock and reserve
" cmpwi 0,%0,0 \n" //compare the lock value to 0
" bne 1f \n" //not 0 then exit function
" ori %0,%0,1 \n" //set the lock to 1
" stwcx. %0,0,%2 \n" //try to acquire the lock
" bne 0b \n" //reservation lost, try again
" ori %1,%1,1 \n" //set the return value to true
"1: \n" //didn't get lock, return false
: "+r" (lockval), "+r" (returnvalue)
: "r"(lock) //parameter lock is an address
: "cr0" ); //cmpwi, stwcx both clobber cr0
return returnvalue;
}
|
Listing 16 is an example of how this inline function could be used.
Listing 16. Example of how the
acquireLock function can be
used
if (acquireLock(lockWord)){
//begin to use the shared region
temp = x + 1;
. . .
}
|
Because the function is inline, the resulting code won't have an actual call in it. Instead, it will precede the use of the shared region x with the instructions to acquire the lock.
The first problem to notice with this code is the lack of a
synchronization instruction. One of the key performance
enhancements enabled by the loose memory model of the POWER
architecture is the ability of the machine to reorder loads and
stores to make more efficient use of internal pipelines.
However, there are times when the programmer needs to curtail
this reordering to some degree to properly access shared
storage. In the case of a lock, you would not want a load of
data from the shared region ("x" in the case above) to be
reordered so that it occurs before the lock on the region is
acquired. For this reason, a synchronization instruction should
be inserted to tell the machine to limit reordering in this
case. The sync instruction is often
used for this purpose, but there are others available, as
described in the POWER ISA (see Resources). In the code example in Listing 17, we
inserted sync instruction to prevent
reordering of loads of "x" (this is called an "import
barrier"):
Listing 17. Sync example
inline bool acquireLock(int *lock){
bool returnvalue = false;
int lockval;
asm (
"0: lwarx %0,0,%2 \n" //load lock and reserve
" cmpwi 0,%0,0 \n" //compare the lock value to 0
" bne 1f \n" //not 0 then exit function
" ori %0,%0,1 \n" //set the lock to 1
" stwcx. %0,0,%2 \n" //try to acquire the lock
" bne 0b \n" //reservation lost, try again
" sync \n" //import barrier
" ori %1,%1,1 \n" //set the return value to true
"1: \n" //didn't get lock, return false
: "+r" (lockval), "+r" (returnvalue)
: "r"(lock) //parameter lock is an address
: "cr0" ); //cmpwi, stwcx both clobber cr0
return returnvalue;
}
|
In that asm block, the sync will prevent any subsequent loads
from occurring until after it is known which way the preceding
branch went. That way the variable x will not be loaded unless
the branch was not taken and the
acquireLock returns
true.
So, are we set now? Unfortunately not. We still have to worry what the compiler might do.
Modern optimizing compilers can be very aggressive in moving code
around -- and even removing it completely -- if it appears that
the changes might make the program run faster without changing
the semantics of the code. However, compilers typically aren't
aware of the complexities involved with accessing shared memory.
For example, a compiler might move the statement
temp = x + 1; to a place higher in
the program if it determines that the result would be scheduled
more efficiently (and it assumes that the "if" is usually
taken). Of course, that would be disastrous from the viewpoint
of accessing shared data. To prevent the movement of any loads
(or any instructions at all) from below the inline asm to a
location above it, you can use the keyword "volatile" (also
known as the volatile attribute) to modify the asm block, as
Listing 18 shows.
Listing 18. Volatile keyword example
inline bool acquireLock(int *lock){
bool returnvalue = false;
int lockval;
asm volatile (
"0: lwarx %0,0,%2 \n" //load lock and reserve
. . .
"1: \n" //didn't get lock, return false
: "+r" (lockval), "+r" (returnvalue)
: "r"(lock) //parameter lock is an address
: "cr0" ); //cmpwi, stwcx both clobber cr0
return returnvalue;
}
|
When you do this, an internal fence is placed before and after the asm block that prevents instructions from being moved past it. And remember that this asm block is inlined, so it will prevent the access to x from being moved above the asm-implemented lock.
Memory clobbers in multithreaded locking
The discussion of multithreaded locking would not be complete without a mention of memory clobbers. The keyword memory is often added to the clobber list in such situations, although it is not always clear why it would be needed. The use of memory in the clobbers list means that memory is altered unpredictably by the asm block.
However, memory modifications in the locking example given are
quite predictable. Although the variable lock is a pointer (that
points to a lock location), that isn't any more unpredictable
that the expression "*lock" in a C
program. In that case, a well-behaved compiler would likely
associate the expression "*lock" with
all variables of the appropriate type, and so would correctly
reload any affected variables after the pointer was used for
modifying data. Nonetheless, the use of memory clobbers appears
to be a pervasive practice, which is probably driven by an
abundance of caution when dealing with shared regions.
Programmers should be aware, though, of the performance
penalties involved and of alternative approaches.
When an inline asm includes "memory" in the clobbers list, it means that any variable in the program might have been modified by the asm, so it must be reloaded before it is used. This requirement can pretty much put a sledgehammer to optimization efforts by the compiler. A potentially lighter-weight approach would be to make the shared region volatile (in addition to the asm block itself). Making a variable volatile means its value must be reloaded before it is used in any given expression. If the shared region in question is a data structure, such as a list or queue, this will ensure that the updated structure is reloaded after the lock is acquired. However, all of the non-shared data accesses can enjoy the full complement of compiler optimizations.
Tip:
If the shared data structure is accessed by a pointer (say
*p), be sure to declare the pointer
so that you ndicate that it's the object pointed to that is
volatile, not the pointer itself. For example, this declares
that the list pointed to by p is
volatile:
volatile list *p
|
Thank you Ian McIntosh, Christopher Lapkowski, Jim McInnes, and Jae Broadhurst. You've each played an important role in publishing this article.
Learn
- For alternatives to the sync
instruction, see the IBM Power ISA (Instruction Set Architecture) PDF.
- Visit the Rational
software area on developerWorks for technical resources
and best practices for Rational Software Delivery Platform
products.
- Stay current with developerWorks technical events and webcasts focused on
a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
- Watch developerWorks on-demand demos, ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
- Improve your skills. Check the
Rational training and certification catalog, which
includes many types of courses on a wide range of topics. You
can take some of them anywhere, any time, and many of the
"Getting Started" ones are free.
Get products and technologies
- Download
a free trial version of Rational software.
- Evaluate
other IBM software in the way that suits you best:
Download it for a trial, try it online, use it in a cloud
environment, or spend a few hours in the SOA Sandbox learning how to implement service-oriented
architecture efficiently.
Discuss
- Get answers and get involved
in the C/C++ community in Rational Cafés.
- Join the Rational software forums to ask questions and
participate in discussions.
- Rate or review Rational software. It's quick and easy.
Really.
- Share your knowledge and help
others who use Rational software by writing a developerWorks article. You'll get worldwide
exposure, RSS syndication, a byline and a bio, and the benefit
of professional editing and production on the developerWorks
Rational website. Find out what makes a good developerWorks article and how to
proceed.
- Follow Rational software on Facebook, Twitter
(@ibmrational), and YouTube,
and add your comments and requests.
- Ask and answer questions and
increase your expertise when you get involved in the Rational forums, cafés, and wikis.
- Connect with others who share
your interests by joining the developerWorks community and responding to the developer-driven blogs.




