Contents


Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 4

Advanced features

Comments

Content series:

This content is part # of # in the series: Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 4

Stay tuned for additional content in this series.

This content is part of the series:Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 4

Stay tuned for additional content in this series.

IBM XL C/C++ compiler for Linux on z Systems Version 1.1 released in 2015 enables support for incorporating user's assembler instructions directly into C/C++ programs (inline assembly). This provides advanced users with greater flexibility to access instructions at the chip level. With inline assembly, software engineers are able to handcraft assembler codes for the most performance-sensitive parts of C/C++ programs. This can further accelerate the execution of the applications to the full extent of the programmers' ingenuity.

The objective of this article is to introduce the advanced features of inline assembly supported by IBM XL compiler for Linux on z Systems. It discusses in detail assembly labels, basic branching, relative branching, symbolic names for input and output operands, the matching constraints, and the registers on the clobber list. The scope of this article is within the assembler instructions involving general registers. Vectors registers and floating-point registers will be deliberated as a separate issue. The target audience is advanced software engineers interested in going beyond the extent of the optimizations provided by the Linux on Z Systems compiler to fine tune the most performance-sensitive code section of high-performing applications.

More in this series.

Assembly labels

During the compilation process, the compiler creates an internal name in the object file for each variable and function declared in the user's program. The name is also used to refer to the corresponding variable or function in the assembly code. The assembly label feature allows the user to control the internal names in the object file of certain variables and functions. When the assembly code is generated, the name specified by the assembly label is the name for the corresponding variable or function. Thus, the declaration int func( ) asm ("my_function") specifies that the name in the object file for function func is my_function, but not the conventional _func.

A possible usage of this feature is to allow users to define names for the linker that do not start with an underscore, even on a system where an underscore is normally prepended to the name of a function or variable. Note that the assembly label specification can only be applied to the declaration of global variables and function prototypes.

The C programs label_b.c and label_a.c shown in Listing 1 and Listing 2, respectively, are snippets demonstrating the usage of assembly label for a function prototype.

Listing 1. label_b.c defining the function func_asm
int func_asm() {        //func_asm is defined here
    return 55;
}

In file label_b.c, the func_asm( ) function is defined.

Listing 2. label_a.c associating function name with and assembly label
int func() asm("func_asm");        // func is associated with “func_asm”
int main() {
   return func();                 // func is called
}

In file label_a.c, function func is associated with the name func_asm by means of the assembly statement on line 1. On line 3, function func( ) is called, although there is no definition for it. The expectation is that func is bound to func_asm and the call to function func( ) will become a call to func_asm( ).

Compiling, linking, and running the executable from label_a.c and label_b.c should be successful. The execution returns 55 because the symbol func is bound to func_asm. Figure 1 show the assembly codes generated for program label_a.c. It confirms that the name func_asm is used in place of func: [ BRASL %r14, func_asm ].

Figure 1. label_a.c calls func_asm instead of func

Branch to a label

There are two ways to branch to a label: basic branching and relative branching. In basic branching, the branch instruction branches to a label based on certain conditions. A label must be uniquely defined in a given program. In relative branching, the target label is relative to the location of the branch instruction. If the target label is before the branch instruction, the character b (for backward) is added to the branch address. In the same manner, f (for forward) will be added to the branch address when the target label is after the branch instruction.

Basic branching

Listing 3 is an example of using basic branching.

Listing 3. Example of basic branching
int absoluteValue(int a) {
     asm (" CFI %0, 0\n"
          " BRC 0xA, DONE\n"
          " LCR %0, %0\n"
          " DONE:\n"
          :"+r"(a)
         );
    return a;
}

Table 1 shows the relationship between condition code and the mask for the instruction CFI (compare immediate) used on line 2 of Listing 3.

Table 1. Relationship between condition code and mask for instruction CFI
Compare a against 0Condition codeMask bits
a = 0 0 = 002 1000
a < 0 1 = 012 0100
a > 0 2 = 102 0010

On line 2 of Listing 3, the CFI instruction compares variable a (%0) with zero. If a == 0 or a > 0, the condition code will be set to either 0 or 2 (as per rows 2 and 4 of Table 1). The combined mask bits for condition codes 0 and 2 is 10102. In hexadecimal representation, 10102 is 0xA. Accordingly, if a >= 0, the branch instruction on line 3 will branch to the label DONE on line 5. The function returns the value of a without running the LCR instruction on line 4. On the other hand, if a < 0, the branching will not occur. Instruction LCR on line 4 loads the complement of a to itself before returning a. Thus, the function effectively returns the absolute value of a. The basic branching to label DONE in this example is used to skip the execution of LCR on line 4.

Relative branching

The example in Listing 4 uses relative branching to loop back.

Listing 4. Pseudo code using relative branching
asm ( "1:          \n"
        "DoSomeWork\n"
        "BRCT  %0, 1b  \n"
        :"+r"(limit)
       );

The BRCT (Branch Relative On Count) instruction subtracts 1 from the value of the first operand limit (%0) and stores the result back to the operand. When the result is not zero, it branches to the address specified in the second operand, which is 1b, that is, label 1-backward. Relative to the branch instruction, label 1 is backward on line 1. In this example, as long as limit is not zero, BRCT decrements it, and then loops back to label 1. When limit becomes zero, the loop terminates.

Note that for relative branching, the label name must contain numbers only. This requirement is not applicable for basic branching label. Also, the label has to be within the same assembly statement. Jumping to a label in a different assembly statement is not supported.

Symbolic names

The input and output operands can also be specified by symbolic names. The symbolic names can be referenced within the assembly code. Symbolic names are specified inside square brackets preceding the constraint string. Inside the assembly code, symbolic names can be referenced using %[name] instead of a percentage sign followed by the operand number. Symbolic names can be any valid C variable name, even if the names have been defined in the surrounding C code. Symbolic names, however, must be unique within each inline assembly statement.

The snippet in Listing 5 uses symbolic names [results], [first], and [second] to represent the 0th, 1st, and 2nd operands respectively. Instead of referring to %0, %1, and %2, the statement will refer to %[result], %[first], and %[second].

Listing 5. Example of using symbolic names
int main(){
   int sum = 0, one=1, two = 2; 
   asm ("AR  %[result], %[first]\n"  
             "AR  %[result], %[second]\n"  
             :[result] "+r"(sum)
             :[first]  "r"(one), [second] "r"(two) 
            );
   return sum == 3 ? 0 : 1;
}

Matching constraints

0, 1, …, 9 are matching constraints used to advise the compiler to allocate the same register for both the input operand and the numbered output operand. As such, the matching constraints can only be used with the input operands. This is essential when one of the operations uses the result of a previous one as its input. Without a matching constraint, the compiler does not know that the same register must be used for both the output and input operands.

The C program example07a.c in Listing 6 is an example where the execution might produce incorrect results due to the absence of a matching constraint.

Listing 6. example07a.c with incorrect result without a matching constraint
#include <stdio.h>
int main () {
   int a = 10, b = 200, c = 3000;
   printf ("INITIAL: a = %d, b = %d, c = %d\n", a, b, c );
   asm ("LR %0, %2\n"
             "LR %1, %3\n"
            :"=r"(a),"=r"(b)
            :"r"(c), "r"(a));
   printf ("RESULT : a = %d, b = %d, c = %d\n", a, b, c );
   return 0;
}

In the first LR (load registers) instruction on line 5, a is loaded with c. Because c is 3000, a will become 3000. Then comes the second LR instruction on line 6, where b is loaded with a. If the intent of the programmer is to load the updated value of a, which is 3000 after the first LR instruction, then example07a.c will not deliver that result. There is no guarantee that the compiler will use the same register for the same variable a between the two times LR is called. When it does not, the previous value of a, being 10, will be loaded to b. Listing 7 shows that compiling and running example07a.c will yield an incorrect result of b being 10 instead of 3000 in most cases.

Listing 7. Compiling and running example07a.c
xlc -o example07a example07a.c;
./example07a
INITIAL: a = 10, b = 200, c = 3000
RESULT : a = 3000, b = 10, c = 3000      <- b is loaded with a, but b is 10 while a is 3000

Because the intent of the user is to load the updated value of a to b, matching constraint must be used to indicate to the compiler that the output of LR instruction on line 5 is used as the input of LR instruction on line 6. When the matching constraint is used, the compiler will select the same register for variable a during the execution of both LR instructions. Listing 8 shows the C program example07b.c, which is the corrected version making use of a matching constraint.

Listing 8. example07b.c with a matching constraint
#include <stdio.h>
int main () {
   int a = 10, b = 200, c = 3000;
   printf ("INITIAL: a = %d, b = %d, c = %d\n", a, b, c );
   asm ("LR %0, %2\n"
             "LR %1, %3\n"
            :"=r"(a),"=r"(b)
            :"r"(c), "0"(a));
   printf ("RESULT : a = %d, b = %d, c = %d\n", a, b, c );
   return 0;
}

The program example07b.c uses the matching constraint "0"(a) on line 8 to inform the compiler that the input operand a (%3) must use that same register as the 0th output operand a. Because the first LR instruction on line 5 loads c being 3000 to a, and the second LR instruction on line 6 will use the same register for input operand a, the value 3000 will be loaded to b as expected.

Figure 4 displays the difference between the two assembly files generated for example07a.c (on the left side of the figure) and example07b.c (on the right side of the figure). When there is no matching constraint, (example07a.c), it is evident that the compiler uses two different registers r1 and r5 for output operand a and input operand a respectively. When a matching constraint is used in example07b.c, the same register r1 is used for both LR operations.

Figure 2. Code generation depending on the existence of matching constraint

Table 2 explains example07a.c, where a matching constraint is not used. The update that occurred with r1 is independent from the input value on r5. For that reason, the updated value is not used in the second LR instruction.

Table 2. Codes when matching constraint is not used (example07a.s)
Assembly codesExplanation
BRASL %r14,printfcalls printf INITIAL …
L %r3,168(,%r15)Loads value of c from r15+168 to register r3: r3 holds 3000
L %r5,176(,%r15)Loads value of a from r15+176 to register r5: r5 holds 10
#GS00000Starts inlining user’s assembler instructions
LR %r1, %r3Loads r3 (value of c, being 3000) to r1 (a)
LR %r0, %r5Loads r5 (value of previous a, being 10) to r0 (b)
#GE00000Ends inlining user’s assembler instructions

On the other hand, the assembly codes for example07b.c, where the matching constraint is used, reveals that for the same register r1 is used for variable a. The update occurred with r1 after running the first LR instruction becomes the input value for the second LR. For that reason b is correctly loaded with the updated value of a.

Table 3. Codes generated when matching constraint is used (example07b.s)
Assembly codesExplanation
BRASL %r14,printfcalls printf INITIAL …
L %r3,168(,%r15)Loads value of c from r15+168 to register r3: r3 holds 3000
L %r1,176(,%r15)Loads value of a from r15+176 to register r1: r1 holds 10
#GS00000Starts inlining user’s assembler instructions
LR %r1, %r3Loads r3 (value of c, being 3000) to r1 (a)
LR %r0, %r1Loads r1 (value of updated a, being 3000) to r0 (b)
#GE00000Ends inlining user’s assembler instructions

Register names on the clobber list

If the assembler instruction uses or updates registers that are not listed in the output and input operand lists, the user must list all impacted registers in the clobber list. Based on the information, the compiler facilitates the operations of the inline assembly statement.

Listing 9 displays an example where general register r7 is explicitly specified as the operand of the assembler instructions.

Listing 9. example09.c using register not in the input/output operand list
#include <stdio.h>
int main () {
   int a = 15, b = 20;
   printf ("INITIAL: a = %d, b = %d\n", a, b );
   asm ("LR   7, %1\n"
            "MSR %0,  7\n"
           :"+r"(a)
           :"r"(b)
           :"r7"
       );
   printf ("RESULT : a = %d, b = %d\n", a, b );
   return 0;
}

The LR instruction on line 5 specifies register r7 as its output operand. The MSR instruction on line 6 also uses r7 as its input operand. Register r7 is used as an operand of the assembler instructions, but it is not listed on the input and output operand list. For that reason, r7 must be added to the clobber list to inform the compiler that it is used. In general, to ensure the correctness of the program any register affected by the assembler instruction must be listed either in the operand lists or in the clobber list. The compiler relies on the information to adjust register allocation.

Comparing the difference in the codes when altering the register in use exposes how the clobbering of certain registers impacts the performance. In the example09a.c program exhibited in Listing 10, register r1 is used instead of register r7.

Listing 10. example09a.c clobbering a different register
#include <stdio.h>
int main () {
   int a = 15, b = 20;
   printf ("INITIAL: a = %d, b = %d\n", a, b );
   asm ("LR   1, %1\n"
            "MSR %0,  1\n"
           :"+r"(a)
           :"r"(b)
           :"r1"
       );
   printf ("RESULT : a = %d, b = %d\n", a, b );
   return 0;
}

Figure 3 compares the two assembly files generated by the compiler. The file on the left side uses r7 and the file on the right side uses r1.

Figure 3. Comparing the codes when clobbering different registers

The right side of Figure 3 shows that when register r1 is clobbered, the compiler selects register r3 for variable b [ L %r3,168(,%r15) ]. More importantly, the compiler does not save the contents of the clobbered register when r1 is selected. When r7 is selected, the content of register r7 is saved to the location R15+56 [ STG %r7,56(,%r15) ]. This means, clobbering register r1 instead of r7 reduces one STORE instruction. Figure 6 proves that selecting proper register to clobber might improve the performance.

If explicitly specifying a register is not preferred, users can modify the code so that the compiler will be responsible for selecting the correct register. In this particular example, the user can add a temporary register operand and use a matching constraint to facilitate the fact that the operand is used as both input and output operands. The specific code for example09b.c is displayed in Listing 11.

Listing 11. Code modification to let the compiler select register
#include <stdio.h>
int main () {
   int a = 15, b = 20, tmp = 1;
   printf ("INITIAL: a = %d, b = %d\n", a, b );
   asm ("LR  %1, %2\n"
             "MSR %0, %3\n"
            :"+r"(a), "=r"(tmp)
            :"r"(b) , "1"(tmp)
           );
   printf ("RESULT : a = %d, b = %d\n", a, b );
   return 0;
}

Conclusion

Inline assembly provides an avenue for users to incorporate assembler instructions directly into C/C++ programs. This feature allows advanced users to further improve the performance of the applications by handcrafting the assembler instructions for particular sections of the codes. IBM XL compilers perform highly sophisticated tasks to optimize the codes generated at each level of optimization. For that reason, accelerating performance with inline ASM requires the intrinsic knowledge of the user about the execution of the target codes. Careful analysis about the effects on the performance of the embedded assembler instructions, together with thorough planning and testing are the prerequisites for achieving performance gain.

Acknowledgements

I would like to thank Ms. Visda Vokhshoori and Ms. Nha-Vy Tran for their advice during the composition of this article.

Resources

References


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=1012991
ArticleTitle=Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 4: Advanced features
publish-date=08122015