Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 2

Accelerate performance

An analysis based on the computation of Fibonacci sequence

Content series:

This content is part # of # in the series: Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 2

Stay tuned for additional content in this series.

This content is part of the series:Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 2

Stay tuned for additional content in this series.

Starting with version 1.1 in 2015, the inline assembly feature is supported by IBM XL C/C++ compiler for Linux on z Systems. The feature provides software engineers with direct access to the set of assembler instructions available at the chip-level. Using inline assembly, software engineers can specify with ease the optimal instructions to accelerate the performance of specific code sections beyond the extent provided by the algorithm or the compiler. This is essential for high performance applications, especially if the improved code sections are on the critical path of execution.

This article discusses and compares the performances of inline assembly approach to compute the Fibonacci numbers with that of the following four algorithms, in which inline assembly is not used.

  • The recursive algorithm with exponential time complexity
  • The dynamic programming approach in linear time complexity
  • The iterative implementation in linear time complexity
  • The optimized matrix power algorithm in logarithmic time complexity

IBM XL compilers perform highly sophisticated optimizations based on the multidimensional analysis of the user codes. To neutralize the comparative environment from the unexpected effects of various optimizations automatically performed by the compiler, all compilations in this article do not specify any level of compiler optimization.

In this series

The article assumes that the audience is familiar with inline assembly for Linux on z Systems. Software engineers might find the other articles in this series informative.

Definition of Fibonacci sequence

Named after the Italian mathematician Leonardo Bonacci (alsoknown as Fibonacci), applications of Fibonacci sequence can be found in multiple disciplines such as mathematics, architecture, geometry, computer science and in nature. It is the series of integer numbers, starting with the pair 0, 1 (or 1, 1). Each subsequent number in the series is the sum of the previous two. In mathematical terms, Fibonacci sequence (Fn) is defined by the recurrence relation with the first two terms, either being F0, F1 or F1, F2 as follows:

F0 = 0 
F1 = 1
Fn = F(n-2) + F(n-1)  if n >1
F1 = 1 
F2 = 1
Fn = F(n-2) + F(n-1)  if n >2

Another augmented form of the Fibonacci numbers is the negaFibonacci sequence, where the indices are extended to negative numbers. Among the three definitions of the Fibonacci numbers, this article bases its analysis on the sequence of the more modern form F0, F1, …, Fn. The first few Fibonacci numbers are displayed in Table 1.

Table 1: List of the first few Fibonacci numbers

F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
0 1 1 2 3 5 8 13 21 34 55 89

Recursive algorithm

Based on the recurrence definition of the Fibonacci sequence, the recursive algorithm to compute the nth Fibonacci number is a straightforward implementation.

Listing 1: Pseudo code of recursive algorithm

if n in range [0,1]
	return n
	return Fib(n-2) + Fib(n-1) 
end if

The recursive algorithm maps naturally to the recurrence definition of the series. The runtime complexity of the recursive steps can be expressed in the following equation:

T(n) = T(n-1) + T(n-2)+ Theta(1) 
     >= 2 T(n-2)
     = Theta(2n/2)

Thus, the complexity of this implementation is in exponential time.

Listing 2: Recursive algorithm to compute the nth Fibonacci number

long long fib_r(int n)
return n <= 1 ? n : fib_r(n-1) + fib_r(n-2) ;
Figure 1. Recursion tree of Fibonacci recursive algorithm

From the recursion tree of the algorithm displayed in Figure 1, it is observable that some expressions are computed multiple times during the process. For example, the expression Fib(n-2) is calculated during the calls made by both Fib(n) and Fib(n-1). The repetitive computation is the main reason why the recursive algorithm is in exponential time. If all the redundant computations can be avoided, the time complexity will be improved. In the case of Fibonacci sequence, the problem can be solved with dynamic programming. Utilizing memory to retain the computed values for later use, the dynamic programming technique speeds up the execution by eliminating the repetitive computations occurring within the recursive algorithm.

Dynamic programming approach

To eliminate the redundant computations, dynamic programming utilizes Memoization to retain the computed values for later use. With a single computation for each term, the time complexity of dynamic programming approaches is linear. This article discusses two well-known dynamic programming approaches to compute the Fibonacci sequence: The Memoization version and the bottom-up version.

The Memoization version of the dynamic programming approach

As expressed in the name of the technique, this approach memorizes the computed values by storing them in some form of data structure. A new computation will only be needed if its result is not already in the memory of the algorithm. Although the Memoization approach still relies on the recursive steps, the memory look-up ensures that each expression is evaluated only once during the process. The memory look-up operation is assumed to have constant time complexity.

Listing 3: Pseudo code of Memoization dynamic programming

Mem = { }                                 //Data structure storing computed values
	if n in Mem                       //Return existing value from Memory
		return Mem[n]             //Early exit when handling known values
	end if
if n if n in range [0,1]                 //Base step of the recursion
	fib ? n
else                                    //Recursive steps
	fib ?  Fib(n-2) + Fib(n-1) 
	end if 
	Mem[n] ? fib                   //Memorize new value
	return fib                      //Return the new computed value

Because each value of the Fibonacci number is computed only once, the runtime complexity of this approach is in linear time. The recursion tree of the Memoization version of dynamic programming approach, as displayed in Figure 2, is free from repetitive computations of known values.

Figure 2. The recursion tree of dynamic programming approach

Although the Memoization version can improve the runtime complexity, it still relies on recursive function calls. The costs to set up and manage function calls are ignored in the complexity analysis but they are not trivial in reality. The technique called bottom-up traversal, discussed below, can help avoid the extra cost of supporting function calls.

The reversed traversal (bottom-up) version of dynamic programming

Instead of starting from the root at the top of the recursion tree, the reversed traversal approach traverses the tree from the bottom up. In the particular case of computing the Fibonacci numbers, the algorithm will start with index 0 then traversing up to index n. It still uses a data structure to memorize the computed values, but it replaces the recursive function calls with look-up operations. By definition, the nth Fibonacci number is computed as Fn = F(n-2) + F(n-1). Because both the earlier terms F(n-2) and F(n-1) can be retrieved from the memory of the algorithm, no recursive function call is required.

Listing 4: Pseudo code of reversed traversal dynamic programming

    Mem = {}                                         //the memory
    for k in range [0,1,…,n]                         //traverse bottom-up
        if k in range [0,1]                         //compute each term
            Mem[k] ? k                             //starting with indices 0 and 1
            Mem[k] ? Mem[k-1] + Mem[k-2]           //Retrieve known values
        end if
    end for
    return Mem[n]

With the memory look-up preventing the repetitive computation of known terms, each term of the Fibonacci sequence is computed only once. The algorithm is, thus, in linear time complexity. In addition, because recursive function calls are replaced with memory look-up operations, the extra costs of setting up and managing recursive function calls are eliminated.

Listing 5: Dynamic programming approach to compute the nth Fibonacci number

long long fib_d(int n)
   int k;
   long long memory[n];
   for ( k = 0 ; k <= n ; ++k ) {
      if ( k <= 1 )
         memory[k] = k;
         memory[k] = memory[k-1] + memory[k-2];
   return memory[n];

Iterative algorithm

Similar to the reversed traversal approach, iterative algorithm iterates through the entire sequence from index 0 up to the last index n. During the process, it updates each term of the Fibonacci sequence with the sum of the previous two. Iterative algorithm, however, does not memorize the computed values. Figure 3 provides a visualization of the iterative algorithm used to compute the Fibonacci numbers.

Figure 3. Visualization of iterative algorithm

Because the requirements set forth for the program is to compute the nth Fibonacci number, the step to update the actual value of the Fibonacci sequence with the computed values res can be skipped in the actual implementation. When required, it can be implemented with ease.

Listing 6: Iterative approach to compute the nth Fibonacci number

long long fib_i(int n)
   int k;
   long long e0 = 0, e1 = 1, res = 0;
   for ( k = 0 ; k <= n ; ++k ) {       //iterate from 0 to n
      if ( k <= 1 )
         res = k;                      //base case for F0 and F1
      else {
         res = e0 + e1;               //res is the sum of 2 previous terms    
         e0 = e1;                     //update the 2 previous terms
         e1 = res;                    // -------- ditto --------
   return res;                      //return Fn

Because the iterative algorithm iterates through the entire Fibonacci sequence only once, its complexity is in linear time. Similar to the reversed traversal algorithm, it does not call the recursive functions.

The optimized matrix power algorithm

This is, most likely, the algorithm with the best performance: it computes the Fibonacci numbers in logarithmic time complexity. The algorithm relies on the following mathematical manipulation.

Figure 4. Mathematical base of matrix power algorithm

The nth Fibonacci number will be the element (0, 0) of the resulting matrix after raising the matrix A to the power of (n-1). Because all matrices in question are in two-dimension, matrix multiplication can be done in constant time. An example code is displayed in Listing 7.

Listing 7: Multiplication of two-dimension matrices in constant time

void multiply(long long M[2][2], long long N[2][2])
  long long a =  M[0][0]*N[0][0] + M[0][1]*N[1][0];
  long long b =  M[0][0]*N[0][1] + M[0][1]*N[1][1];
  long long c =  M[1][0]*N[0][0] + M[1][1]*N[1][0];
  long long d =  M[1][0]*N[0][1] + M[1][1]*N[1][1];

  M[0][0] = a;
  M[0][1] = b;
  M[1][0] = c;
  M[1][1] = d;

In order to raise the matrix A to the power of n, a straightforward algorithm will run in linear time because it must call the multiplication (n-1) times.

An = A A … A

This can be further optimized by using the following technique:

An = A(n/2) A(n/2)

Thus, instead of raising the matrix A to the power of n, the algorithm will raise it to the power of (n/2) and perform the constant time multiplication operation of An/2 with itself.

Listing 8: Matrix power algorithm in logarithmic time

void power(long long M[2][2], int n)
  if( n == 0 || n == 1) return;               //no operation needed for base case
  long long A[2][2] = {{1,1},{1,0}};

  power(M, n/2);                             //recursively call power on n/2
  multiply(M, M);                           //a constant time multiplication

  if (n%2 != 0)                             //To handle odd n
     multiply(M, A);

By recursively calling the power function on n/2, the size of the input is reduced by half each time. Accordingly, the complexity will be in logarithmic time. Listing 9 shows an example implementation that uses optimized matrix power approach.

Listing 9: Optimized matrix power approach to compute the nth Fibonacci number

long long fib_m(int n)
  long long A[2][2] = {{1,1},{1,0}};              //matrix A
  if (n == 0)
      return 0;                                  //base case F0 = 0
  power(A, n-1);                                 //raise A to power (n-1)
  return A[0][0];                               //Fn is element (0,0)

Inline assembly implementation

The inline assembly implementation will be based on the iterative algorithm. As seen in the visualization displayed in Figure 3, iterative algorithm computes the nth Fibonacci number by adding the two terms (n-2)th and (n-1)th standing just before it. The (n-2)th term, after this computation can be discarded as it is no longer used in subsequent computations. This observation will be utilized to implement the inline assembly approach.

The algorithm

The inline assembly implementation keeps the two most recent Fibonacci numbers in two separate registers. At each iteration, the algorithm computes the sum of the most recent terms, and then stores the computed value in the register holding the smaller term. This will effectively overwrite the later term. Because the smaller term is not required for future computation, this deletion is acceptable. The greater term is thus the result of that iteration. Figure 4 displays a visualization of the inline assembly approach.

Figure 5. Visualization of the inline assembly approach

As visually displayed in Figure 4, the implementation iterates over each term of the Fibonacci sequence once. The result of each iteration will be stored in the register colored in red in Figure 4. The pseudo code of the implementation is shown in Listing 10.

Listing 10: Pseudo code of the inline assembly implementation

    load F0 to register 0                              //smaller term in register 0
    load F1 to register 1                              //greater term in register 1
    for k in range [2,3,…,n]
        swap the values in two registers              //register 1 holds smaller term
        sum up the values in two register             //sum up 2 terms
        store the sum to register 1                   //register 1 holds the sum 
    end for
    return register 1	                              //return the sum

Swapping two values with triple exclusive-or operations

There are various ways to swap the values stored in two registers, one of which being the straightforward technique using a temporary register or variable. There is, however, a faster way to swap the values without going through a temporary storage: the exclusive-or (XOR) swap algorithm using three consecutive XOR operations. The truth table of the triple XOR operations is exhibited in Table 2.

Table 2: The truth table of triple exclusive-or operation

Initial valuesR0 = R0 XOR R1R1 = R0 XOR R1R0 = R0 XOR R1
R0 R1 R0 R1 R0 R1 R0 R1
00 0 0 0 0 00
01 1 1 1 0 10
10 1 0 1 1 01
11 0 1 0 1 11

The truth table shows that the values in the two registers, R0 and R1, have been swapped after three consecutive XOR operations.

The assembly instructions

There are many assembly instructions that can be used to implement the algorithm. For an exhaustive list of available instruction, refer to the book titled, z/Architecture Principles of Operation, IBM Publication No. SA22-7832-10. Amongst the long list of possible assemble instructions on IBM z Systems™, this article selects the following instructions to implement the Fibonacci sequence computation.

XGR R0, R1:

  • The EXCLUSIVE OR of the values stored in the first and second registers is placed at the first-operand location (R0).
  • The value stored in the 2nd operand (R1) stays unchanged.
  • The operands and the result are 64-bit in length.

AGR R0, R1:

  • The second operand is added to the first operand, and the sum is placed at the first-operand location.
  • The value stored in the 2nd operand (R1) stays unchanged.
  • The operands and the result are treated as 64-bit signed binary integers.

BRCT R0, branch address:

  • Value 1 is subtracted from the first operand, and the result is placed back at the first-operand location.
  • When the result is zero, the execution proceeds with normal instruction sequence.
  • When the result is not zero, the instruction address in the current program status word (PSW) is replaced by the branch address.

The inline assembly code

The actual inline assembly code to compute the nth Fibonacci number is shown in Listing 11.

Listing 11: Inline assembly approach to compute the nth Fibonacci number

long long fib_asm(int upto) {
    if ( upto < 2 ) return upto;                 //F0 = 0, F1 = 1
    int mycount = upto - 1;
    long long f0 = 0L, f1 = 1L;
    asm ( "0:                 \n"                  //back to this when mycount > 0 
          "XGR   %0, %1       \n"                  //1st XOR
          "XGR   %1, %0       \n"                  //2nd XOR
          "XGR   %0, %1       \n"                  //3rd XOR
          "AGR   %1, %0       \n"                  //Add 2 terms, store to f1 
          "BRCT  %2, 0b       \n"                  //back to 0 if mycount > 0
         :"+r"(f0), "+r"(f1), "+r"(mycount)
    return f1;                                     //return f1  

The values f0 and f1 are loaded with 0 and 1, respectively, at the beginning of the loop. Variable mycount is used to hold the index of the required Fibonacci number; it is also the number of iterations to be performed. At the beginning of each iteration, f0 holds the smaller terms and f1 holds the greater term of the Fibonacci sequence. During each iteration, the algorithm swaps the values of f0 and f1 through three XOR operations. As a result of the swapping, f0 holds the greater term and f1 holds the smaller one. The AGR operation sums up the values of f0 and f1, and then stores the sum back to f1. At the end of any iteration, f1 holds the greater term and f0 holds the smaller term. When mycount is greater than zero, the BRCT operation deducts one from mycount and branches back to label 0. When mycount becomes zero, the assembly instruction terminates. The program then returns the value stored in f1, which is the nth Fibonacci number.

Because the inline assembly iterates through each term of the Fibonacci sequence once, the complexity is in linear time.

Performance expectation

Among the five implementations presented above, the matrix power algorithm with logarithmic time ranks first in terms of performance. Dynamic programming approach, iterative algorithm, and inline assembly implementation are in the middle with linear time complexity. The recursive algorithm will be at the lower end due to its exponential time complexity.

Table 3: Runtime complexity

Approach Recursive algorithm Iterative algorithm Dynamic programming Inline assembly Matrix power
Complexity Exponential Linear Linear Linear Logarithmic

Based on the runtime analysis, it is expected that the actual performance are in line with the complexity when the sample size is sufficiently large.

Test program

Each implementation can be encapsulated in a function. The test program uses two variables as follows:

volatile clock_t start, end;

This records the time when each of the function starts and returns. Five distinct variables are used to keep the elapsed time of each implementation as follows.

volatile float elapsedASM, elapsedI, elapsedR, elapsedM, elapsedD;

The calculation of the elapsed time is displayed in Listing 12.

Listing 12: Calculation of elapsed time

   start = clock();
   while ( myCount-- ) { resASM = fib_asm(limit); }
   end = clock();
   elapsedASM =  ((float) (end - start))*1000 / CLOCKS_PER_SEC ;

With the exception of the recursive algorithm, all implementations will perform the computation of the 92nd term for 10,000,000 times. The large number of the loop count is to extend the execution so that the difference in the elapsed time can be recorded properly. Also, because the value of any term greater than the 100th number of the Fibonacci sequence exceeds the length the integer data types, terms higher than the 90th are the practical upper limits for the Fibonacci computation with 64-bit integers. The recursive function will be tested with much smaller parameters; it will compute the 40th term of the Fibonacci for only twice. Due to the exponential time complexity of the recursive algorithm, the smaller parameters are selected to ensure that the program completion is within the range of the other approaches.

Refer to the complete test program for more details.

Actual performance

The actual performance result of the test carried out on a z System machine running Linux at the IBM Toronto Software Laboratory are listed in Figure 6.

Figure 6. Comparative performance

From Figure 6, it is evident that inline assembly implementation exceeds all other approaches when computing the largest possible Fibonacci number. It is approximately four times faster than the dynamic programming and iterative approaches. Surprisingly, its performance is also better than that of the optimized matrix power algorithm. The performance advantage comes from the smaller footprint of the inline assembly implementation.

In theory, if the size of the input is sufficiently large, the performance of the optimized matrix power algorithm can surpass that of the inline assembly implementation. In reality, however, when the input to the application is limited by the sizes of the variables, the actual speed of the compact codes generated from the inline assembly implementation can exceed the performance of a faster algorithm. This is an example of using inline assembly to further accelerate the performance beyond the extent that the best algorithms can deliver.


In the particular case of computing the Fibonacci sequence, the inline assembly approach performs better than all other algorithms, even when it is proven to be slower in theory. This is an example demonstrating that the theoretical performance might not be practical in reality. Appropriate usage of inline assembly to fine tune the most performance-critical code section is an avenue to surpass the default speed provided by the algorithms.

It is of vital importance to note the performance gain from using inline assembly can only be realized by means of careful planning and extensive testing. Because IBM XL compiler performs highly sophisticated optimizations based on multidimensional analysis, in most cases, using an appropriate level of optimization by the compiler might simply be the best option. The decision whether or not to use inline assembly must be based on the empirical test data on the specific system. Furthermore, inline assembly should only be used for the most performance-centric code sections. Intrinsic knowledge of the application together with thorough comprehension about the set of assembler instructions available on the system are prerequisites to a successful implementation.


I thank Nha-Vy Tran for her advice and comments during the composition of this article.



Downloadable resources

ArticleTitle=Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 2: Accelerate performance