# Inline assembly - start from scratch

For C/C++ programmers, inline assembly is not a new feature to assist us to make full use of the computing power. However, most programmers seldom get a change to practice this. In fact, inline assembly only serves specific requirements, especially when it comes to the cutting edge of the high-level programming language.

This article presents two scenarios on the IBM POWER processor architecture. Using the examples provided in this article, we can find out where inline assembly is applied.

### Scenario 1: A better library

The C/C++ programming language supports logical operations. Therefore, in this example, the user takes bit as the basic unit. The user wrote an algorithm to calculate the number of bits that one 32-bit variable has taken.

Code A: Calculating the number of bits taken

```01        inline int bit_taken(int data)
02        {
03        int taken = 0;
04        while (data) {
05         data = (data >> 1);
06         taken++;
07          }
08        return taken;
09        }```

The code shows using loops together with shift operation. If the user compiles the code with the highest optimization level (-O3 for gcc, or -O5 for xlc), the user may find that several optimizations, such as unrolling, constant data propagation, and so on, are done automatically to generate fastest code in the world. The basic idea of algorithm does not change though.

Listing A: Description of cntlzw

Purpose

Places the number of leading zeros from a source general-purpose register in a general-purpose register.

The cntlzw instruction is able to get the number of leading zeros. Suppose, we have a number 15, whose binary representation is 0000, 0000, 0000, 0000, 0000, 0000, 0000, 1111, cntlzw would tell that there are 28 leading zeros totally. After reconsideration, the user decides to simplify his algorithm as shown in Code B.

Code B: Calculating the number of bits taken by inline assembly

```01        #ifdef __ARCH_PPC__
02        inline int bit_taken(int data)
03        {
04        int taken;
05        asm("cntlzw %0, %1\n\t"
06        : "=b" (taken)
07        : "b" (data)
08        );
09        return sizeof(data) * 8 – taken;
10        }
11        #else
...       ...
21        #endif```

A macro named __ARCH_PPC__ wraps the new code for PowerPC architecture only. Compared with Code A, the new code has eliminated all loops or shifts. The customer of the library provider may be happy to see performance improvement on `bit_taken` then. It runs much faster on PowerPC. And, application bound `bit_taken` works even better.

The story does not only tell that the user can improve his algorithm from rich instructions, but also that, inline assembly is the best assistant for performance work. By embedding the assembly code into C/C++, it minimizes user's effort on code changes.

### Scenarios 2: Atomic compare-and-swap (CAS)

Recently, as the whole computer industry is shifting its focus to multiple processing, multiple threads, it inevitably brings in more elements (such as synchronization in programming. To compose synchronization primitives such as semaphores and mutexes in multiple threading environment, we often refer to an atomic operation called compare-and-swap (CAS). Listing B shows the pseudocode for CAS.

Listing B: Pseudocode for CAS

##### Listing 1.
```          compare_and_swap (*p, oldval, newval):
if (*p == oldval)
*p = newval;
success;
else
fail;```

In Listing B, the content of a memory location p (*p) is first compared with a known value oldval (which should be the value of *p in current thread). Only if they are the same, newval is then written to *p. The comparison fails when the other thread has modified the memory location ahead.

To be accurate, CAS should be made atomic. The atomicity is beyond C/C++ ability to handle, but can be guaranteed by using a short piece of inline assembly code. Code C shows a simple CAS implemented for the PowerPC architecture.

Code C: Simple CAS implementation on PowerPC

```01        void inline compare_and_swap (volatile int * p, int oldval, int newval)
02        {
03        int fail;
04        __asm__ __volatile__ (
05           "0: lwarx %0, 0, %1\n\t"
06                 "      xor. %0, %3, %0\n\t"
07              " bne 1f\n\t"
08            " stwcx. %2, 0, %1\n\t"
09                 "      bne- 0b\n\t"
10            " isync\n\t"
11        "1: "
12        : "=&r"(fail)
13        : "r"(p), "r"(newval), "r"(oldval)
14        : "cr0");
15        }```

The code snippet implements a pseudocode in Listing B but seems too complex for us now. We will turn back to it after we finish introducing basic syntax.

However, to sum up, inline assembly is usually required under two conditions:

• Code optimization

Inline assembly may be helpful when the performance requirement is critical. As we can see from scenario 1, tuning compiler options would not always be the best choice. A handy piece of inline assembly code would enable the user to largely improve program performance.

• Hardware operations/OS services

The capability of C/C++ is limited in scenario 2. The latest features always need time to be standardized and implemented by compiler. As a result, to use the up-to-date hardware instructions, OS services, and so on, we often resort to inline assembly. And most of time, it is the best choice.

There might still be other reasons for making use of inline assembly. But generally, inline assembly acts as a complement for C/C++, both in functionality and performance.

## Use of inline assembly

The syntax of inline assembly looks completely different from C/C++. A reasonable explanation is that inline assembly is not designed from the view of a C/C++ programmer, but rather from the view of a compiler / assembler. The general statement of inline assembly is formed as shown in Lising C.

Listing C: Composition of inline assembly code block

```          __asm__ __volatile__(assembly template
: output operand list
: input operand list
: clobber list
);```

As shown in Listing C, the inline assembly is always made up of four components logically:

1. Keyword asm() or __asm__(). Modifier volatile or __volatile__: Keyword asm or __asm__ is used to demonstrate that the following strings are inline assembly code block. volatile or __volatile__ is optional and can be added behind asm to prohibit some optimization from compiler. Actually, asm and __asm__ are almost the same, except that asm might cause warning during compilation when inline assembly is used in a preprocessor macro. The same is true for volatile and __volatile__.

2. Assemble template:
Assemble template appears to be the first portion inside brackets. It consists of assembly instructions lines, which are embraced in double quotation marks (""), and ended up with line separators (\n\t or \n). The syntax of the inline assembly code is same but much simpler than the general assembly code. There can be many reasons for this. For example, it is not necessary to define data in an assembly template as it should always be referred from the C/C++ variable. And, seldom is it necessary to create a section (for executable) inside the assembly template. Generally, apart from assembly instructions, only some local labels are allowed. (We shall discuss it later).

Code D: Assembly template of inline assembly

`          __asm__ __volatile__ ("lwarx %0, 0, %1 \n\t" : "=&r"(ret) : "r"(p));`

Code D shows an example for assembly template.

1. The assembly instructions consist of opcode (lwarx), and operands (%0, 0, %1).
2. If the operand of an instruction is of the register / immediate type, it can be referred as a register with percentage-prefixed number. (%0, %1,...)
3. The number for a register, which refers to a variable, is ordered by the order it presents in the input/output list. In Code D example, ret is the first referred variable in the input/output list. Therefore, %0 is the register referring. Similarly, register %1 refers to the variable p.
4. Only some local labels are legal in inline assembly. You might see labels, such as 0 and 1 in Code C. They are the branching target of instruction bne- 0b\n\t and bne 1f\n\t. (The f suffix for the label means the label behind the branch instruction, and b is for the one ahead)

3. Input/Output operand list
The input/output list starts with a colon (:). Their entries are separated by commas (,). The list specifies the variable and their constraint in the assembly template. Considering Code D for example, lwarx sets the effective address, which is the register value of %1 plus an immediate value 0. It reads a word from the effective address to register %0. Here, %0 is an output operand as it stores the result and is written. And %1 is an input. So that, ret referred by %0 is put to the output list, while p referred by %1 is put to the input list.
Each variable listed in the input/output operand list:

• Must have a constraint. For example, the constraint for =&r (ret) is r, which means ret may be allocated in any general purpose register.
• May have an optional constraint modifier. For example, the modifier of =&r (ret) are = and &. = means that the variable is write only. And, & means that this variable cannot share the same register with any input operand. (An early clobber means the operand is modified before the instruction finishes using the input operands. Therefore it can not share register with input operand. For more information, refer to A guide to inline assembly for C and C++

The constraints are different between platforms. Commonly, the product document provides more detail in practice.

4. Clobber list
Clobber list notifies the compiler that some registers or memory is clobbered by an inline assembly block. A clobber list looks similar to an input/output list (beginning with a colon, and separated by commas). But it only takes register names (such as r1, f15) or memory as its entries.

In Code C example, the inline assembly code clobbers conditional register field implicitly. Therefore, the cr0 register field is put into the clobber list. And, if the user thinks that the code alters to an uncertain memory place, memory can also be put in the list. We shall discuss clobber list again in a later section.

Actually, not all the components showed in Listing C are required. A keyword and an assembly template are sufficient for compositing a basic inline assembly. All other parts are optional.

Now, we get back to Code C to explain more about the instructions.

lwarx %0, 0, %1
This instruction reads memory at effective address 0 + %1 into reigster %0 (*p actually). In addition, the instruction makes a resevation for later validation from instruction stwcx.

xor. %0, %3, %0
bne 1f

The instructions compare the value we just loaded into %0 with the oldval (%3). Branch is taken to label 1 when they are not equal, which means that the CAS operation fails.

stwcx. %2, 0, %1
bne- 0b

stwcx. checks the reservation made by lwarx. If the checking is successful, it writes the content of %2 (newval) back to the effective address of 0 + %1(p). If the write fails, branch is taken to label 0 for a retry.

isync
This instruction prevents the running of instructions following the isync until the instructions preceding the isync have completed.

Table A lists all the entries in the operand lists of the example and their register number correspondingly for Code B.

Table A: Constraints, modifier, and register reference for Code C

EntryConstraint (& modifier)Variable referredRegister
"=&r"(fail) =&r: writable, early clobber, general register fail %0
"r"(p) R: general register p %1
"r"(newval) R: general register newval %2
"r"(oldval) R: general register oldval %3

As we can see from the code, there is a retry step following writing back instruction stwcx.. If other threads have updated the address p hold, the retry will find that *p and oldval differ. Thus, the control branch to label 1 with CAS fails. We can judge this by comparing the variable fail with 0.

Thelwarx and stwcx. instructions are very special in the PowerPC architecture. They are very important to compose an atomic primitive. If you are interested, you can find more information from POWER ISA. [1] And for branching facilities, the document [2] gives best explanations.

## Common mistakes

For beginners who might tend to make mistakes, there are some guidelines that they should always check.

• Do not forget the line separators (\n\t)
• Do not forget the double quotation marks for lines ("")
• Do not mistake () for {}.

And there are some interesting mistakes we have ever come across:

1. Using a preprocessor macro in the inline assembly template.

Code E: Macro in inline assembly

```01        // This is the intention:
02        __asm__ __volatile__(
03            "stswi %0, %1, 4\n\t"
04                  :: "b" (t), "b" (b)
05                    );
…
01      // Macro, does not work:
02        #define F 4
03        __asm__ __volatile__(
04                  "stswi %0, %1, F\n\t"
05        :: "b" (t), "b" (b)
06                    );```

For some reason, the user may want to apply a C/C++ macro to an inline assembly template. Specifically, in the above example, the user tries to replace an immediate value. However, the compiler rejects the code. In fact, the user should not consider applying any action of C/C++ preprocessor into the assembly template. The only interface for the user to pass the C/C++ data into inline assembly is to use the input/output list. Code F shows a way to implement the user's intention.

Code F: Macro referred as immediate

```01        #define F 4
02        __asm__ __volatile__(
03        "stswi %0, %1, %2\n\t"
04         : : "b" (t), "b" (b), "i"(F)
05                    );```

Here, we use an immediate constraint for an operand referring to the macro. The user then can change the constant globally by changing the macro definition.

5. Missing colon of output operand list.

In code G, the stswi instruction means that it stores 4 bytes starts from register %1 (specifically, if %1 is assigned register r0, bytes are read from r0, r1, r2, r3... in order) to effective address at %0.

As to the inline assembly code, there is no output operand as no register stores the result and can be written. And, the input operand list includes all the variables (value and base) in list.

Code G: Missing colon

```01        // Require input only:
02        int base[5];
03        int value = 0x7a;
04        __asm__ __volatile__(
05            "stswi %0,%1,4\n\t"
06                    : : "b" (value), "b" (base)
07          );
…
01        // But mistaken as output :
02        __asm__ __volatile__(
03          "stswi %0,%1,4\n\t"
04                    : "b" (value), "b" (base)
05                    );```

In the latter code, the user unfortunately misses a colon. All the output now becomes input. Under such circumstances, the compiler might not even emit a warning. But, the user may find an error at run time finally. Such mistakes though might seem minor, break everything. Moreover, it is not easy to be found, just as it is hard to find errors, such as mistaking if (a==1) to if (a=1) in a C/C++ code. Therefore, the beginner should pay more attention to colons.

We shall continue discussing deeper reasons for this soon.

## Inline assembly, compiler, and assembler

People may find at writing inline assembly code, the biggest challenge is not to find out correct instructions by looking up the specification, but rather to make the input/output/clobber list work properly. There may be complaints and questions, such as: Why do we need these lists?, Why there are constraints and modifiers?, and so on.

Here, we list such questions and also their answers. We hope it helps the user understand more form the view of implementation. For simplicity, we only focus on instructions with register operands referring to the C/C++ variables.

Q1: Who will handle the inline assembly? The compiler or the assembler? Why do I get assembler error at compiling time?

A1: The answer is both (in most of time). Usually, the assembler supports the latest instructions before the compiler. Therefore, the compiler has to invoke the assembler to handle any unrecognizable instruction. But, it does not mean that the assembler handles everything. The association between variables and registers is done by the compiler. (Refer to Q2 and Q3.) The syntax check of inline assembly in C/C++ are also done by the compiler. But the assembly instruction itself is not included. As a result, the assembler would report errors if it found issues while checking assembly instructions.

Q2: How registers in an assembly template refer to C++ variables?

A2: As Q1 answers, the association is done by the compiler. Internally, the variable is mapped as register by a register allocation and assignment process. After this process, the assembly template turns into a short piece of real assembly code. It can then be accepted and handled by the assembler for the final binary code generation.

Q3: I know register allocation and assignment will associate register with a variable. But why there is an input/output list?

A3: In fact, for register allocation and assignment, the compiler might require inputs such as the constraints and the liveness. Without inline assembly, the compiler can find out such inputs by analyzing the code internally. But as the compiler thought instruction behavior is unknown inside the inline assembly block, it requires the user to provide extra information to help.
A constraint inside may be related to hardware. For example, an operand to be put in the general purpose register would be given the constraint r, and an operand to be put in the floating register would be given the constraint f. Further more, sometimes, certain hardware would prohibit some behavior under certain circumstances. For example, a constraint b in PowerPC (which prohibits the use of the r0 register) is one of this kind. (Refer to A guide to inline assembly for C and C++ - Basic, intermediate, and advanced concepts for more details). Conceptually, it is the user's responsibility to show the compiler things such as data type, instruction's restriction as the assembly code that the user provided is totally unknown to it.
The liveness can be affected by many aspects. The most significant one is whether a variable is read, written, or both. The input/output operands list and some constraint modifier help constructing the information. (For example, the constraint modifier "+" shows that an operand is read-write, while = shows that it is write-only.)
On the whole, the input/output list is used to provide information to the compiler.

Q4: How about the clobber list? Why do we need it?

A4: In many real world platforms, machine instructions may implicitly change registers. This could be thought as a kind of hardware constraint. Clobber list with a register name will make a compiler aware if any another register refers no variable is also altered. And, if an instruction unpredictably writes into an unexpected memory location, the compiler may not be aware if it changes any already in-register variable. (If this happens, the already in-register variable should be reload from memory.) By putting a memory clobber, we inform the compiler to do some handling to make sure that the code generation is right. (For memory clobbers, A guide to inline assembly for C and C++ - Basic, intermediate, and advanced concepts gives a better explanation).

Q5: Why using assembly directive is not recommended?

A5: Sometimes, people think that the inline assembly may have full functionality as assembly does. However, it is not always true. For example, using an assembly directive may cause severe problems if the user does not know that the inline assembly code is embedded into the code section of the final executable.

A typical case is that the user intends to define a new section .mysect inside the assembly template. The compiler works out the right assembly code and passes it to the assembler. But, as the assembler syntax tells, defining a .mysect section overwrites the current section to it. Consequently, the code that follows the inline assembly (which is generated by the compiler) is also assembled into .mysect section rather than .text (for code) section. As a result, the executable is totally broken.

In conclusion, it is not wise to use the assembly functionality that do not belong to the compiler's inline assembly specification. Using any content that is not officially supported could pose a risk to your code..

Now let us get back to the colon lost issue. Obviously, the root cause for the failure is that we provide incorrect information such as liveness or constraint to the compiler. Compiler would not complain because it does not check the correctness of any list (except for the C/C++ syntax error). And the assembler would also be happy as it only processes the instructions in a reasonable format. But in fact, the compiler works with bad information. And finally, the code fails. The failure warns us, and it is extremely important for the user to take care of the input/output/clobber list that the user writes. Or else, it is not surprising to get bad code.

## Conclusion

Although it is not hard to study the syntax for inline assembly, writing correct assembly code does not simply mean writing correct assembly instructions and making them embedded. For the sake that the compiler cannot analyze into the inline assembly block, the inline assembly user should provide the compiler with more information than the common C/C++ code. That could be error-prone. Anyway, you can make use of the following tips.

• Write only a short inline assembly block with a single functionality.
• Check the compiler’s document for the inline assembly section. Do not try to use the assembly functionality that do not belong to the compiler's inline assembly specification.
• Select instructions carefully. Make every detail clear. Do not miss anything instruction such as constraints, side effect, and so on.
• Double-check the input/output/clobber lists before compiling and running your code. Especially, check for the correct usage of colons.

## Acknowledgement

Thanks Jiang Jian and Ji Jinsong who are my colleagues from the IBM CDL Rational Compiler Team. Thanks for your careful review and comments on this article.