Conclusions
The SPE has a number of unusual architectural concerns. It is a SIMD-only processor with no hardware branch prediction, instruction issue restrictions, and single-ported memory. This has imposed a number of duties on the compiler. The nature of the tasks makes them very hard for programmers to hand-tune.
Compiler involvement in high performance
The compiler is heavily involved in alleviating the unusual requirements of the SPE. The compiler handles scalar code automatically, both by hiding the complexity of real scalar code and by autovectorizing some code. The use of bundling to deal with the dual-issue restrictions can improve performance dramatically. A combination of using predication to handle simple if-then-else structures, and using good branch hinting, dramatically improves the performance of algorithms dependant on branching. Finally, management of the shared memory port between memory and instruction fetching helps reduce instruction starvation, improving the performance of some algorithms noticeably.
As of this writing, net improvement in performance is quite good across the board, with algorithms showing an average improvement of just over 20%. Some algorithms, such as matrix multiply and saxpy, show dramatic improvements, of 30%-50%. Continued development will likely yield further improvements.
The next tutorial in this series will look in detail at the issues faced in extracting parallelism from code to run it efficiently on SIMD processors.
This tutorial series is based on the original presentation Optimizing Compiler for the Cell Processor given at PACT 2005 by Alexandre Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter Oden, Daniel Prener, Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind of IBM Research.
This Part 2 is based on the section "SPE Optimizations."
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.

