Contents


Improve the performance of function calls with OpenPOWER ABI

Anatomy of function calls with OpenPOWER ABI supplement

Comments

The application binary interface (ABI) is the interface between two program modules, one of which is often a library or operating system at the level of machine code . It determines the set of conventions that allow programs written in different languages or compiled by different compilers to call each other's functions.

A new ABI definition for IBM PowerPC 64 bit little endian and OpenPOWER systems was published to replace the previous one, 64-bit PowerPC ELF Application Binary Interface Supplement 1.9 in 2014. This ABI introduces quite a few improvements to benefit overall application performance, one important part of which is getting rid of function descriptor. Different from the ELF ABI v1, it requires the callee to set up the table of contents (TOC) pointer (TOC BASE), introduces dual entry to avoid the redundant overhead in intra module calls, and uses an alternate approach to support lexical nesting function calls. If you are wondering how this ABI enhances function calls, or you are a system software engineer, a library creator, or an assembly coder on OpenPOWER, you might be interested on this article.

Function descriptor in previous ABI


Definition of function descriptor

A function descriptor is a three doubleword data structure that contains the following values:

  • The first doubleword contains the address of the entry point of the function.
  • The second doubleword contains the TOC base address for the function.
  • The third doubleword contains the environment pointer for languages such as Pascal and PL/1.

The above definition of function descriptor is from the specification 64-bit PowerPC ELF Application Binary Interface Supplement 1.9. It holds the information on the static and dynamic environment of the function. The first doubleword is the actual function entry, the second doubleword is the base of TOC referring to the position-independent code (PIC) data area, and the third one is to provide a lexical nesting environment.

For an externally visible function, the value of the symbol with the same name as the function is the address of the function descriptor. Symbol names with a dot (.) prefix are reserved for holding entry point addresses. The value of a symbol named, ".FN", if it exists, is the entry point of the function, "FN".

Function descriptor with examples

With the previous ABI, the function descriptor should be adopted while doing the inter module calls or function pointer calls. Because the called functions don't initialize their own TOC pointer and environment pointer, the callers are responsible to set up those values for the invocation. Refer to the following examples.

Listing 1. Inter module calls with function descriptor
$ cat func1.c
	int func1(int val)
	{
	   int ret = test() + val;
	   return ret;
	}

$ xlc -q64 -qpic -qmkshrobj func1.c -o func1.so

$ objdump -d func1.so
	...
	0000000000000700 <00000010.plt_call.test+0>:
		// Save the TOC pointer of the func1's module
	700:   f8 41 00 28     std     r2,40(r1)
		// Load the actual function entry of test into r11
	704:   e9 62 80 98     ld      r11,-32616(r2)
		// Move r11 into Count Register (function entry)      
	708:   7d 69 03 a6     mtctr   r11                
		// Load the environment pointer of test into r11
	70c:   e9 62 80 a8     ld      r11,-32600(r2)
		// Load the TOC pointer of test's module into r2
	710:   e8 42 80 a0     ld      r2,-32608(r2)
		// Jump as Count Register
	714:   4e 80 04 20     bctr
	...

Here, a static linker allocates 24 bytes of storage starting from address $r2-32616 for the function descriptor of test, and the dynamic linker is responsible to fill the corresponding values later. The related information is stored in the .opd section of the object file, which is an array of function descriptors. You can refer to Deeply understand 64-bit PowerPC ELF ABI - Function Descriptors for more details.

When inter module calls occurs, as you can see from Listing 1, the procedure linkage table (PLT) stub will prepare the environment of the callee first, then set up the TOC pointer and environment pointers using the information from the callee's function descriptors. After that, it will jump to the callee's function entry.

Listing 2. Indirect calls with function descriptor
$ cat func2.c
	int test();
	int func2(int val)
	{
	   int (*p)() = &test;
	   int ret = p() + val;
	   return ret;
	}
	
$ xlc -q64 -qpic func2.c -o func2.o -c
$ objdump -d func2.o
	
	0000000000000000 <.func2>:
	....
		// Load the function descriptor of test into r12
	18:   e9 82 00 00     ld      r12,0(r2)
		// Save it into automatic variable p
	1c:   f9 81 00 70     std     r12,112(r1)
		// Load the actual function entry of test into r0
	20:   e8 0c 00 00     ld      r0,0(r12)
		// Move r0 into Count Register (function entry)
	24:   7c 09 03 a6     mtctr   r0
		// Save the TOC pointer of func2's module (current module)
	28:   f8 41 00 28     std     r2,40(r1)
		// Load the environment pointer of test into r11
	2c:   e9 6c 00 10     ld      r11,16(r12)
		// Load the TOC pointer of test's module into r2
	30:   e8 4c 00 08     ld      r2,8(r12)
		// Jump as Count Register
	34:   4e 80 04 21     bctrl
		// Restore the TOC pointer of func2's module                     
	38:   e8 41 00 28     ld      r2,40(r1)          
	....

Here, we can see that when indirect calls happen, the process seems more complex than what we will expect. The caller should obtain the function descriptor of the callee first, then load the corresponding key values such as TOC pointer, function entry, and environment from there, and finally jump as actual entry address. The same as inter module calls in Listing 1, the caller should set up the invocation environment before the actual calling through function entry.

Design issues

As we can see from the above examples, the cost for caller to establish the environment of the called function and the load-load latency to compute the entry point of function isn't negligible. Comparing with the current industry practice, this design based on function descriptor is out of date. Some of the following requirements or changes have occurred in programming:

  • Smaller functions get called a lot: With object-oriented programming become very popular, the average size of application programs has dropped from millions of instructions to tens of instructions in object-oriented applications, reducing the fixed cost per function invocation is more important.
  • Less environment setup requirement: With the function descriptor, the caller is responsible to set up the environment of the called function, but it has no information about runtime behavior of the called function. So it has to establish the full environment conservatively. We should be smarter to set up if and only if it is really necessary.
  • Fewer nesting functions: With programming languages evolving rapidly, the lexical nesting is rarely used in current languages. Therefore, there is no need to pass the environment point in most cases.
  • Fewer global data access: Lots of short functions do not access global variables, and don't need to set up the TOC pointer for global data accessing.
  • Hardware improvement: Consider future hardware innovations such as PC-relative addressing.

Consequently, this IBM Power Architecture® 64-bit ELF V2 ABI (called OpenPOWER ABI later) pushes environment initialization to the callee, which means that the caller doesn't need to establish the environment with function descriptor any more. Better performance and better programming is the result. Let's take a look at how OpenPOWER ABI does.

TOC pointer initialization

First, OpenPOWER ABI makes the called function initialize its TOC pointer at the entry of the function. To support this, it dedicates the register r12 at the beginning of the function prologue to hold the current function entry address, introduces one symbol .TOC. to stand for the TOC pointer. The initialization code sequence appears as shown in Listing 3.

Listing 3. TOC pointer initialization code sequence
	l0:   addis r2, r12, (.TOC.-l0)@ha	// Calculate the upper 16 bits       
	      addi r2, r2, (.TOC.-l0)@l	        // Add the low 16 bits

Here, the label l0 also stands for the function entry address. These two lines of instruction will initialize the value of the TOC pointer into r2.

If compiling the above func1.c on an OpenPOWER platform, notice that the plt stub ensures r12 as the function entry of the callee.

Listing 4. TOC pointer initialization
$ xlc -qmkshrobj func1.c -o func1.so     # -qpic and -q64 are set implicitly on LE
$ objdump -d func1.so
	...	
	0000000000000650 <00000017.plt_call.test>:
		// Save the TOC pointer of the func1's module
	650:   18 00 41 f8     std     r2,24(r1)
		// Load the function entry of test into r12
		// (global entry, explain it later)      
	654:   50 80 82 e9     ld      r12,-32688(r2)    
		// Move r12 into Count Register (function entry)
	658:   a6 03 89 7d     mtctr   r12
		// Jump as Count Register
	65c:   20 04 80 4e     bctr                      
	...

Comparing with Listing 1, this plt stub code sequence looks more compact. The target address of the branch instruction is the actual function entry. When the call invocation jumps to the function test, the test is responsible to set up its own TOC pointer. Refer to Listing 5.

Listing 5. test.c setting up TOC pointer
$ cat test.c
    int t();
    int test(){
    return t();
}
		
$ xlc -q64 -c test.c
$ objdump -dr test.o
		
0000000000000000 <test>:
0:   00 00 4c 3c     addis   r2,r12,0
                0: R_PPC64_REL16_HA     .TOC.
4:   00 00 42 38     addi    r2,r2,0
                4: R_PPC64_REL16_LO     .TOC.+0x4
8:   a6 02 08 7c     mflr    r0
...

For indirect calls, the code sequence is also shorter than that of the previous ABI.

Listing 6. Indirect calls
$ xlc func2.c -c
$ objdump -d func2.o

0000000000000000 <func2>:
		// Load the function entry of test into r12
		// (upper 16 bits)
	20:   00 00 82 3d     addis   r12,r2,0
		// Load the function entry of test into r12
	24:   00 00 8c e9     ld      r12,0(r12)
		// Save it into automatic variable p	
	28:   20 00 81 f9     std     r12,32(r1)
		// Move r12 into Count Register, note that
		// the r12 is holding the function entry too
	2c:   a6 03 89 7d     mtctr   r12
 		// save the TOC pointer of func2's module      
	30:   18 00 41 f8     std     r2,24(r1)
		// Jump as Count Register
	34:   21 04 80 4e     bctrl
		// Restore the TOC pointer of func2's module                
	38:   18 00 41 e8     ld      r2,24(r1)

Dual entry

One question that might come to mind. As the above examples shown, the code sequence setting up TOC pointer is always run when the invocation occurs. The cost of the TOC pointer establishment is redundant if the caller comes from the same module. So, is it possible to improve this?

Yes, to avoid this redundant TOC pointer initialization, OpenPOWER ABI introduces "dual entry" that consists of a global entry point and a local entry point. The global entry point is available to any caller and is pointing to the beginning of the prologue. The local entry point is to optimize the cost of the TOC pointer initialization. Because functions within the same module share the same TOC base value, they may be entered using the local entry point bypassing the code sequence (which is to set up a TOC pointer). Static linker binding module uses a local entry point and dynamic loader resolving symbols at run time uses global entry point. And, in particular, function pointer will point to the global entry point because it can represent the intra and inter module calls.

With the dual entry, the function prologue appears as shown in Listing 7.

Listing 7. Dual entry
0000000000000000 <test>:
	// global entry of test  <-- inter module calls
                               OR indirect calls by function point
	0:   00 00 4c 3c     addis   r2,r12,0
	4:   00 00 42 38     addi    r2,r2,0
	// local entry of test   <-- intra module calls
	8:   a6 02 08 7c     mflr    r0
	...

With the dual entry, we can make TOC pointer initialization as smarter as required. The local entry point is used when the TOC pointer is known to already be valid for the function, while the global entry point is used when it is necessary to set up the TOC pointer for the function. The OpenPOWER ABI uses the three most-significant bits in the symbol st_other field to specify the number of instructions between a function's global entry point and local entry point. Refer to Listing 8 for an example using readelf.

Listing 8. symbol table support for dual entry
$ readelf -s test.o

	Symbol table '.symtab' contains 6 entries:
		Num:    Value          Size Type    Bind   Vis      Ndx Name
	0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
	1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS test.c
	2: 0000000000000000    68 SECTION LOCAL  DEFAULT    4 .The_Code
	3: 0000000000000000    68 FUNC    GLOBAL DEFAULT [<localentry>: 8]     4 test
	4: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND .TOC.
	5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND t

The [<localentry>: 8] means that there are 8 bytes (two instructions) between a function's global entry point and local entry point. If you want to construct one assembly file where some function has global and local entry points, you can use the corresponding directive .localentry as shown in Listing 9.

Listing 9. Assembly sample for dual entry
test.s
	...
	.globl  test
	.type   test,@function
	.localentry test,8
	...
	test:
	0:  addis 2,12,.TOC.-0b@ha
		addi 2,2,.TOC.-0b@l
		...; function definition
		blr

Environment point initialization

Now we know how to support the TOC point initialization of a called function in OpenPOWER ABI. But, another question comes up: how to deal with the environment point of lexical nesting environment? Although today, the lexical nesting is rarely used in programming language, the ABI has to implement it for completeness, and there are still some schemes to implement it if there is a need. The solution given in OpenPOWER ABI is trampoline. The basic idea is to generate a piece of executable code (trampoline) at run time when the address of a nested function is taken. The purpose of the trampoline is to load and set up the actual environment point, load and jump to the actual function entry of nested function. Refer to Figure 1 on trampoline.

Figure 1. Trampoline
Trampoline
Trampoline

In Figure 1, when the address of one nested function is taken, the routine trampoline_setup will allocate some spaces on the stack and then place some instructions, its environment pointer and function entry into the stack. Finally, the function pointer pointing to the nested function is assigned by the start address of trampoline. When you call the nested function by this function pointer, it should begin executing from the start of trampoline, which is responsible to set up the environment pointer and then jump to the actual function entry.

Conclusion

The article describes the enhanced scheme to get rid of the function descriptor in the new OpenPOWER ABI. The improvements make function calls to run without the function descriptor, which make the called function take charge of the environment setup as needed and avoid the caller to set up an entire environment conservatively. Finally, consider the optimization on intra module calls and nested function support. The enhancements are listed in Table 1.

Table 1. Summary of the enhancements
TaskPrevious ABIOpenPOWER ABI
TOC pointer initializationThrough function descriptordual entry
global entry point:
addis r2, r12, (.TOC.-l0)@ha
addi r2, r2, (.TOC.-l0)@l
local entry point:
...
Environment pointer initializationThrough function descriptorUsing trampoline

Note that IBM XL C/C++ V13.1.2 for Linux and XL Fortran V15.1.2 for Linux have well supported this OpenPOWER systems ABI. You can refer to the Resources section to get the free trial download.

Reference

Resources


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=1010636
ArticleTitle=Improve the performance of function calls with OpenPOWER ABI
publish-date=07102015