Notes and conclusions
As with the section on techniques, the compiler designers wanted to call attention to some specific points:
Managing complexity includes both interactions among different phases in simdization and interactions between the simdization code and other compiler optimizations.
The compiler uses an internal representation of vectors. The virtual-length vectors capture the effects of different aggregation phases. Using generic operations in the internal representation helps in supporting multiple platforms.
Auxiliary analysis and transformations are also important. These include alignment analysis, pointer analysis, and dependence analysis. Redundant conversion elimination has a significant effect. Finally, data layout optimizations can fix alignment problems and provide for stride-one access to data, reducing the overhead imposed by hardware constraints.
On a processor such as the PowerPC®, or the PPE of the Cell BE, it's important to make sure that simdized code is actually more efficient than the code it replaces; the setup costs of a simdized loop might be higher than its benefit. On the SIMD-only SPEs, however, it doesn't matter nearly as much, because scalar code has very high overhead. Thus, while it's important to track the overhead and complexity of the simdized code in general, the SPEs will nearly always end up using it even when it's somewhat expensive.
One hidden cost of simdization is that the resulting code might be much harder to optimize efficiently, decreasing performance.
Figure 15. Speedups obtained through automatic simdization
This chart shows the speedups of optimized code with automatic simdization against scalar code, running on a single SPE; for instance, if unoptimized code ran in four seconds, and optimized code ran in one second, the speedup would be four.
The IBM compiler has a workable integrated and modular approach to simdization. It extracts SIMD parallelism at multiple levels, efficiently handles constraints such as alignment and data conversion, and can target multiple ISAs, such as VMX and the SPU. The next tutorial in this series looks at partitioning and parallelization techniques to allow code to run on multiple processor elements at once, bringing us from optimizations of code for a single element into optimizations for the processor as a whole.
This tutorial series is based on the original presentation Optimizing Compiler for the Cell Processor given at PACT 2005 by Alexandre Eichenberger, Kathryn O'Brien, Kevin O'Brien, Peng Wu, Tong Chen, Peter Oden, Daniel Prener, Janice Shepherd, Byoungro So, Zehra Sura, Amy Wang, Tao Zhang, Peng Zhao, and Michael Gschwind of IBM Research.
This Part 3 is based on the section "Automatic Simdization."
Cell Broadband Engine is a trademark of Sony Computer Entertainment Inc.

