## Contents

- Introduction
- Definition of Fibonacci sequence
- Recursive algorithm
- Dynamic programming approach
- The reversed traversal (bottom-up) version of dynamic programming
- Iterative algorithm
- The optimized matrix power algorithm
- Inline assembly implementation
- Performance expectation
- Test program
- Actual performance
- Conclusion
- Acknowledgements
- Resources
- References
- Downloadable resources
- Comments

Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 2

# Accelerate performance

An analysis based on the computation of Fibonacci sequence

### Content series:

## This content is part # of # in the series: Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 2

## This content is part of the series:Using inline assembly with IBM XL C/C++ compiler for Linux on z Systems, Part 2

Stay tuned for additional content in this series.

Starting with version 1.1 in 2015, the inline assembly feature is supported by IBM XL C/C++ compiler for Linux on z Systems. The feature provides software engineers with direct access to the set of assembler instructions available at the chip-level. Using inline assembly, software engineers can specify with ease the optimal instructions to accelerate the performance of specific code sections beyond the extent provided by the algorithm or the compiler. This is essential for high performance applications, especially if the improved code sections are on the critical path of execution.

This article discusses and compares the performances of inline assembly approach to compute the Fibonacci numbers with that of the following four algorithms, in which inline assembly is not used.

- The recursive algorithm with exponential time complexity
- The dynamic programming approach in linear time complexity
- The iterative implementation in linear time complexity
- The optimized matrix power algorithm in logarithmic time complexity

IBM XL compilers perform highly sophisticated optimizations based on the multidimensional analysis of the user codes. To neutralize the comparative environment from the unexpected effects of various optimizations automatically performed by the compiler, all compilations in this article do not specify any level of compiler optimization.

### In this series

The article assumes that the audience is familiar with inline assembly for Linux on z Systems. Software engineers might find the other articles in this series informative.

- Part 1: Primer: Assembler instructions to be embedded in C/C++ applications on mainframe computers
- Part 3: Basic features
- Part 4: Advanced features

## Definition of Fibonacci sequence

Named after the Italian mathematician Leonardo Bonacci (alsoknown as Fibonacci), applications of Fibonacci sequence can be found in multiple disciplines such as mathematics, architecture, geometry, computer science and in nature. It is the series of integer numbers, starting with the pair 0, 1 (or 1, 1). Each subsequent number in the series is the sum of the previous two. In mathematical terms, Fibonacci sequence (Fn) is defined by the recurrence relation with the first two terms, either being F0, F1 or F1, F2 as follows:

F0 = 0 F1 = 1 Fn = F(n-2) + F(n-1) if n >1 or F1 = 1 F2 = 1 Fn = F(n-2) + F(n-1) if n >2

Another augmented form of the Fibonacci numbers is the negaFibonacci sequence, where the indices are extended to negative numbers. Among the three definitions of the Fibonacci numbers, this article bases its analysis on the sequence of the more modern form F0, F1, …, Fn. The first few Fibonacci numbers are displayed in Table 1.

**Table 1: List of the first few Fibonacci numbers**

F0 | F1 | F2 | F3 | F4 | F5 | F6 | F7 | F8 | F9 | F10 | F11 |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |

## Recursive algorithm

Based on the recurrence definition of the Fibonacci sequence, the recursive
algorithm to compute the n^{th} Fibonacci number is a
straightforward implementation.

**Listing 1: Pseudo code of recursive algorithm**

Fib(n): if n in range [0,1] return n else return Fib(n-2) + Fib(n-1) end if

The recursive algorithm maps naturally to the recurrence definition of the series. The runtime complexity of the recursive steps can be expressed in the following equation:

T(n) = T(n-1) + T(n-2)+ Theta(1) >= 2 T(n-2) = Theta(2^{n/2})

Thus, the complexity of this implementation is in exponential time.

**Listing 2: Recursive algorithm to compute the nth Fibonacci
number**

long long fib_r(int n)

{

return n <= 1 ? n : fib_r(n-1) + fib_r(n-2) ;

}

##### Figure 1. Recursion tree of Fibonacci recursive algorithm

From the recursion tree of the algorithm displayed in Figure 1, it is
observable that some expressions are computed multiple times during the
process. For example, the expression Fib(n-2) is calculated during the
calls made by both Fib(n) and Fib(n-1). The repetitive computation is the
main reason why the recursive algorithm is in exponential time. If all the
redundant computations can be avoided, the time complexity will be
improved. In the case of Fibonacci sequence, the problem can be solved
with dynamic programming. Utilizing *memory* to retain the computed
values for later use, the dynamic programming technique speeds up the
execution by eliminating the repetitive computations occurring within the
recursive algorithm.

## Dynamic programming approach

To eliminate the redundant computations, dynamic programming utilizes
*Memoization* to retain the computed values for later use. With
a single computation for each term, the time complexity of dynamic
programming approaches is linear. This article discusses two well-known
dynamic programming approaches to compute the Fibonacci sequence: The
Memoization version and the bottom-up version.

### The Memoization version of the dynamic programming approach

As expressed in the name of the technique, this approach memorizes the computed values by storing them in some form of data structure. A new computation will only be needed if its result is not already in the memory of the algorithm. Although the Memoization approach still relies on the recursive steps, the memory look-up ensures that each expression is evaluated only once during the process. The memory look-up operation is assumed to have constant time complexity.

**Listing 3: Pseudo code of Memoization dynamic
programming**

Mem = { } //Data structure storing computed values Fib(n): if n in Mem //Return existing value from Memory return Mem[n] //Early exit when handling known values end if if n if n in range [0,1] //Base step of the recursion fib ? n else //Recursive steps fib ? Fib(n-2) + Fib(n-1) end if Mem[n] ? fib //Memorize new value return fib //Return the new computed value

Because each value of the Fibonacci number is computed only once, the runtime complexity of this approach is in linear time. The recursion tree of the Memoization version of dynamic programming approach, as displayed in Figure 2, is free from repetitive computations of known values.

##### Figure 2. The recursion tree of dynamic programming approach

Although the Memoization version can improve the runtime complexity, it still relies on recursive function calls. The costs to set up and manage function calls are ignored in the complexity analysis but they are not trivial in reality. The technique called bottom-up traversal, discussed below, can help avoid the extra cost of supporting function calls.

## The reversed traversal (bottom-up) version of dynamic programming

Instead of starting from the root at the top of the recursion tree, the
reversed traversal approach traverses the tree from the bottom up. In the
particular case of computing the Fibonacci numbers, the algorithm will
start with index 0 then traversing up to index n. It still uses a data
structure to memorize the computed values, but it replaces the recursive
function calls with look-up operations. By definition, the n^{th}
Fibonacci number is computed as F_{n} = F_{(n-2)} +
F_{(n-1)}. Because both the earlier terms F_{(n-2)} and
F_{(n-1)} can be retrieved from the memory of the algorithm, no
recursive function call is required.

**Listing 4: Pseudo code of reversed traversal dynamic
programming**

Fib(n): Mem = {} //the memory for k in range [0,1,…,n] //traverse bottom-up if k in range [0,1] //compute each term Mem[k] ? k //starting with indices 0 and 1 else: Mem[k] ? Mem[k-1] + Mem[k-2] //Retrieve known values end if end for return Mem[n]

With the memory look-up preventing the repetitive computation of known terms, each term of the Fibonacci sequence is computed only once. The algorithm is, thus, in linear time complexity. In addition, because recursive function calls are replaced with memory look-up operations, the extra costs of setting up and managing recursive function calls are eliminated.

**Listing 5: Dynamic programming approach to compute the nth
Fibonacci number**

long long fib_d(int n) { int k; long long memory[n]; for ( k = 0 ; k <= n ; ++k ) { if ( k <= 1 ) memory[k] = k; else memory[k] = memory[k-1] + memory[k-2]; } return memory[n]; }

## Iterative algorithm

Similar to the reversed traversal approach, iterative algorithm iterates through the entire sequence from index 0 up to the last index n. During the process, it updates each term of the Fibonacci sequence with the sum of the previous two. Iterative algorithm, however, does not memorize the computed values. Figure 3 provides a visualization of the iterative algorithm used to compute the Fibonacci numbers.

##### Figure 3. Visualization of iterative algorithm

Because the requirements set forth for the program is to compute the
n^{th} Fibonacci number, the step to update the actual value of
the Fibonacci sequence with the computed values *res* can be
skipped in the actual implementation. When required, it can be implemented
with ease.

**Listing 6: Iterative approach to compute the nth Fibonacci
number**

long long fib_i(int n) { int k; long long e0 = 0, e1 = 1, res = 0; for ( k = 0 ; k <= n ; ++k ) { //iterate from 0 to n if ( k <= 1 ) res = k; //base case for F0 and F1 else { res = e0 + e1; //res is the sum of 2 previous terms e0 = e1; //update the 2 previous terms e1 = res; // -------- ditto -------- } } return res; //return Fn }

Because the iterative algorithm iterates through the entire Fibonacci sequence only once, its complexity is in linear time. Similar to the reversed traversal algorithm, it does not call the recursive functions.

## The optimized matrix power algorithm

This is, most likely, the algorithm with the best performance: it computes the Fibonacci numbers in logarithmic time complexity. The algorithm relies on the following mathematical manipulation.

##### Figure 4. Mathematical base of matrix power algorithm

The n^{th} Fibonacci number will be the element (0, 0) of the
resulting matrix after raising the matrix A to the power of (n-1). Because
all matrices in question are in two-dimension, matrix multiplication can
be done in constant time. An example code is displayed in Listing 7.

**Listing 7: Multiplication of two-dimension matrices in constant
time**

void multiply(long long M[2][2], long long N[2][2]) { long long a = M[0][0]*N[0][0] + M[0][1]*N[1][0]; long long b = M[0][0]*N[0][1] + M[0][1]*N[1][1]; long long c = M[1][0]*N[0][0] + M[1][1]*N[1][0]; long long d = M[1][0]*N[0][1] + M[1][1]*N[1][1]; M[0][0] = a; M[0][1] = b; M[1][0] = c; M[1][1] = d; }

In order to raise the matrix A to the power of n, a
*straightforward* algorithm will run in linear time because it
must call the multiplication (n-1) times.

A^{n} = A A … A

This can be further optimized by using the following technique:

A^{n} = A^{(n/2)} A^{(n/2)}

Thus, instead of raising the matrix A to the power of n, the algorithm will
raise it to the power of (n/2) and perform the constant time
multiplication operation of A^{n/2} with itself.

**Listing 8: Matrix power algorithm in logarithmic time**

void power(long long M[2][2], int n) { if( n == 0 || n == 1) return; //no operation needed for base case long long A[2][2] = {{1,1},{1,0}}; power(M, n/2); //recursively call power on n/2 multiply(M, M); //a constant time multiplication if (n%2 != 0) //To handle odd n multiply(M, A); }

By recursively calling the power function on n/2, the size of the input is reduced by half each time. Accordingly, the complexity will be in logarithmic time. Listing 9 shows an example implementation that uses optimized matrix power approach.

**Listing 9: Optimized matrix power approach to compute the nth
Fibonacci number**

long long fib_m(int n) { long long A[2][2] = {{1,1},{1,0}}; //matrix A if (n == 0) return 0; //base case F0 = 0 power(A, n-1); //raise A to power (n-1) return A[0][0]; //Fn is element (0,0) }

## Inline assembly implementation

The inline assembly implementation will be based on the iterative
algorithm. As seen in the visualization displayed in Figure 3, iterative
algorithm computes the n^{th} Fibonacci number by adding the two
terms (n-2)^{th }and (n-1)^{th }standing just before it.
The (n-2)^{th} term, after this computation can be discarded as it
is no longer used in subsequent computations. This observation will be
utilized to implement the inline assembly approach.

### The algorithm

The inline assembly implementation keeps the two most recent Fibonacci numbers in two separate registers. At each iteration, the algorithm computes the sum of the most recent terms, and then stores the computed value in the register holding the smaller term. This will effectively overwrite the later term. Because the smaller term is not required for future computation, this deletion is acceptable. The greater term is thus the result of that iteration. Figure 4 displays a visualization of the inline assembly approach.

##### Figure 5. Visualization of the inline assembly approach

As visually displayed in Figure 4, the implementation iterates over each term of the Fibonacci sequence once. The result of each iteration will be stored in the register colored in red in Figure 4. The pseudo code of the implementation is shown in Listing 10.

**Listing 10: Pseudo code of the inline assembly
implementation**

Fib(n): load F0 to register 0 //smaller term in register 0 load F1 to register 1 //greater term in register 1 for k in range [2,3,…,n] swap the values in two registers //register 1 holds smaller term sum up the values in two register //sum up 2 terms store the sum to register 1 //register 1 holds the sum end for return register 1 //return the sum

### Swapping two values with triple exclusive-or operations

There are various ways to swap the values stored in two registers, one of which being the straightforward technique using a temporary register or variable. There is, however, a faster way to swap the values without going through a temporary storage: the exclusive-or (XOR) swap algorithm using three consecutive XOR operations. The truth table of the triple XOR operations is exhibited in Table 2.

Table 2: The truth table of triple exclusive-or operation

Initial
values | R0 = R0
XOR R1 | R1 = R0
XOR R1 | R0 = R0
XOR R1 | ||||
---|---|---|---|---|---|---|---|

R0 | R1 | R0 | R1 | R0 | R1 | R0 | R1 |

0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 |

1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 |

1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |

The truth table shows that the values in the two registers, R0 and R1, have been swapped after three consecutive XOR operations.

### The assembly instructions

There are many assembly instructions that can be used to implement the algorithm. For an exhaustive list of available instruction, refer to the book titled, z/Architecture Principles of Operation, IBM Publication No. SA22-7832-10. Amongst the long list of possible assemble instructions on IBM z Systems™, this article selects the following instructions to implement the Fibonacci sequence computation.

**XGR R0, R1: **

- The EXCLUSIVE OR of the values stored in the first and second registers is placed at the first-operand location (R0).
- The value stored in the 2
^{nd}operand (R1) stays unchanged. - The operands and the result are 64-bit in length.

**AGR R0, R1:**

- The second operand is added to the first operand, and the sum is placed at the first-operand location.
- The value stored in the 2
^{nd}operand (R1) stays unchanged. - The operands and the result are treated as 64-bit signed binary integers.

**BRCT R0, branch address:**

- Value 1 is subtracted from the first operand, and the result is placed back at the first-operand location.
- When the result is zero, the execution proceeds with normal instruction sequence.
- When the result is not zero, the instruction address in the current program status word (PSW) is replaced by the branch address.

### The inline assembly code

The actual inline assembly code to compute the n^{th} Fibonacci
number is shown in Listing 11.

**Listing 11: Inline assembly approach to compute the nth Fibonacci
number**

long long fib_asm(int upto) { if ( upto < 2 ) return upto; //F0 = 0, F1 = 1 int mycount = upto - 1; long long f0 = 0L, f1 = 1L; asm ( "0: \n" //back to this when mycount > 0 "XGR %0, %1 \n" //1st XOR "XGR %1, %0 \n" //2nd XOR "XGR %0, %1 \n" //3rd XOR "AGR %1, %0 \n" //Add 2 terms, store to f1 "BRCT %2, 0b \n" //back to 0 if mycount > 0 :"+r"(f0), "+r"(f1), "+r"(mycount) ); return f1; //return f1 }

The values f0 and f1 are loaded with 0 and 1, respectively, at the
beginning of the loop. Variable **mycount** is used to hold
the index of the required Fibonacci number; it is also the number of
iterations to be performed. At the beginning of each iteration, f0 holds
the smaller terms and f1 holds the greater term of the Fibonacci sequence.
During each iteration, the algorithm swaps the values of f0 and f1 through
three XOR operations. As a result of the swapping, f0 holds the greater
term and f1 holds the smaller one. The AGR operation sums up the values of
f0 and f1, and then stores the sum back to f1. At the end of any
iteration, f1 holds the greater term and f0 holds the smaller term. When
**mycount** is greater than zero, the BRCT operation
deducts one from **mycount** and branches back to label 0.
When **mycount** becomes zero, the assembly instruction
terminates. The program then returns the value stored in f1, which is the
n^{th} Fibonacci number.

Because the inline assembly iterates through each term of the Fibonacci sequence once, the complexity is in linear time.

## Performance expectation

Among the five implementations presented above, the matrix power algorithm with logarithmic time ranks first in terms of performance. Dynamic programming approach, iterative algorithm, and inline assembly implementation are in the middle with linear time complexity. The recursive algorithm will be at the lower end due to its exponential time complexity.

**Table 3: Runtime complexity**

Approach | Recursive algorithm | Iterative algorithm | Dynamic programming | Inline assembly | Matrix power |
---|---|---|---|---|---|

Complexity | Exponential | Linear | Linear | Linear | Logarithmic |

Based on the runtime analysis, it is expected that the actual performance are in line with the complexity when the sample size is sufficiently large.

## Test program

Each implementation can be encapsulated in a function. The test program uses two variables as follows:

**volatile clock_t start, end;**

This records the time when each of the function starts and returns. Five distinct variables are used to keep the elapsed time of each implementation as follows.

**volatile float elapsedASM, elapsedI, elapsedR, elapsedM,
elapsedD;**

The calculation of the elapsed time is displayed in Listing 12.

**Listing 12: Calculation of elapsed time **

start = clock(); while ( myCount-- ) { resASM = fib_asm(limit); } end = clock(); elapsedASM = ((float) (end - start))*1000 / CLOCKS_PER_SEC ;

With the exception of the recursive algorithm, all implementations will
perform the computation of the 92^{nd} term for 10,000,000 times.
The large number of the loop count is to extend the execution so that the
difference in the elapsed time can be recorded properly. Also, because the
value of any term greater than the 100^{th} number of the
Fibonacci sequence exceeds the length the integer data types, terms higher
than the 90^{th} are the practical upper limits for the Fibonacci
computation with 64-bit integers. The recursive function will be tested
with much smaller parameters; it will compute the 40^{th} term of
the Fibonacci for only twice. Due to the exponential time complexity of
the recursive algorithm, the smaller parameters are selected to ensure
that the program completion is within the range of the other
approaches.

Refer to the complete test program for more details.

## Actual performance

The actual performance result of the test carried out on a z System machine running Linux at the IBM Toronto Software Laboratory are listed in Figure 6.

##### Figure 6. Comparative performance

From Figure 6, it is evident that inline assembly implementation exceeds all other approaches when computing the largest possible Fibonacci number. It is approximately four times faster than the dynamic programming and iterative approaches. Surprisingly, its performance is also better than that of the optimized matrix power algorithm. The performance advantage comes from the smaller footprint of the inline assembly implementation.

In theory, if the size of the input is sufficiently large, the performance of the optimized matrix power algorithm can surpass that of the inline assembly implementation. In reality, however, when the input to the application is limited by the sizes of the variables, the actual speed of the compact codes generated from the inline assembly implementation can exceed the performance of a faster algorithm. This is an example of using inline assembly to further accelerate the performance beyond the extent that the best algorithms can deliver.

## Conclusion

In the particular case of computing the Fibonacci sequence, the inline assembly approach performs better than all other algorithms, even when it is proven to be slower in theory. This is an example demonstrating that the theoretical performance might not be practical in reality. Appropriate usage of inline assembly to fine tune the most performance-critical code section is an avenue to surpass the default speed provided by the algorithms.

It is of vital importance to note the performance gain from using inline assembly can only be realized by means of careful planning and extensive testing. Because IBM XL compiler performs highly sophisticated optimizations based on multidimensional analysis, in most cases, using an appropriate level of optimization by the compiler might simply be the best option. The decision whether or not to use inline assembly must be based on the empirical test data on the specific system. Furthermore, inline assembly should only be used for the most performance-centric code sections. Intrinsic knowledge of the application together with thorough comprehension about the set of assembler instructions available on the system are prerequisites to a successful implementation.

## Acknowledgements

I thank Nha-Vy Tran for her advice and comments during the composition of this article.

## Resources

- Visit the IBM XL C/C++ for Linux on z Systems product pages for more information.
- Get connected. Join the Rational C/C++ Cafe community.

## References

- z/Architecture: Principles of Operation, IBM Publication No. SA22-7832-10
- IBM z/Architecture: Reference Summary, IBM Publication No. SA22-7871-08
- Inline assembly statements for XL C/C++ for Linux on z Systems, V1.1, 2015
- Anh Tuyen Tran (2015), Primer for assembler instructions to be embedded in C/C++ applications on mainframe computers, IBM developerWorks®
- Anh Tuyen Tran (2015), The basics of inline assembly for Linux on z Systems, IBM developerWorks
- Anh Tuyen Tran (2015), Advanced features of inline assembly for Linux on z Systems, IBM developerWorks

#### Downloadable resources

- PDF of this content
- Code sample (fibonacci-anhtran.c | 1.42 KB)