Part 1 in this series considered a simple vectorized version of the TCP/IP checksum algorithm and proposed a possible enhancement. This article shows how the enhancements can be better unrolled and expanded to produce dramatically increased performance and how compilers can produce this code with a little friendly hinting.
The comparison between the MOT and IBM schemes is somewhat unfair. MOT reads 32 bytes per iteration, whereas IBM reads only 16. Loop unrolling doesn't help much because the accumulators force serialization between iterations. An enhancement, named IBM2, reads and computes 32 bytes per iteration:
Listing 1: IBM algorithm, revised (IBM2)
while (vector_count-- >0) {
VD1 = *((vector ushort*)data);
VD2 = *((vector ushort*)(data+16));
data += 32;
V_Sum1 = vec_msum(VD1,One,V_Sum1);
V_Sum2 = vec_msum(VD2,One,V_Sum2);
}
|
Figure 1: MOT2 vs. IBM2 vector schemes on a G5: 32 bytes are processed each iteration into two distinct accumulator vectors
This scheme outperforms MOT for every buffer size. However, now IBM2 has an unequal advantage, the data is summed into two distinct accumulators.
This leads to the following improvement in MOT (MOT2):
Listing 2: Motorola's algorithm, revised (MOT2)
while (vector_count-- >0) {
VD1 = *((vector uint*)data);
VD2 = *((vector uint*)(data+16));
data += 32;
V_CC = vec_addc(VD1,V_Sum);
V_Sum = vec_add(VD1,V_Sum);
VCAR = vec_add(V_CC,VCAR);
V_CC2 = vec_addc(VD2,V_Sum2);
V_Sum2 = vec_add(VD2,V_Sum2);
VCAR2 = vec_add(V_CC2,VCAR2);
}
|
Every 16-byte chunk is accumulated into a separate sum and carry pair.
Even so, IBM2 outperforms it on both a G4 and G5 by an average of 25%, as Figure 1 shows. This trend continues even beyond buffer sizes of 32KB
(not shown in graph), so the distinction isn't in memory processing but
rather in the execution pipeline. The IBM2 scheme can pipe two vec_msum instructions without causing any bubbles in
the execution pipe. The MOT scheme tries to pipe six instructions, some of
them dependent, through the execution pipeline. The pipeline "chokes" and
stalls.
The study in the previous section indicates that it might be possible to process even more data per iteration and accelerate performance by using additional independent accumulators. The following two schemes named MOT4 and IBM4 process 64 bytes per iteration:
Listing 3: More revisions (IBM4 and MOT4)
/* MOT4 */
while (vector_count-- >0) {
VD1 = *((vector uint*)data);
VD2 = *((vector uint*)(data+16));
VD3 = *((vector uint*)(data+32));
VD4 = *((vector uint*)(data+48));
data += 64;
V_CC = vec_addc(VD1,V_Sum);
V_Sum = vec_add(VD1,V_Sum);
VCAR = vec_add(V_CC,VCAR);
/* perform 3 more accumulations */
}
/* IBM4 */
while (vector_count-- >0) {
VD1 = *((vector ushort*)data);
VD2 = *((vector ushort*)(data+16));
VD3 = *((vector ushort*)(data+32));
VD4 = *((vector ushort*)(data+48));
data += 64;
V_Sum1 = vec_msum(VD1,One,V_Sum1);
/* perform 3 more accumulations */
}
|
Figure 2: MOT4 vs. IBM2 and IBM4 vector schemes on a G5: 64 bytes are processed each iteration into four distinct accumulator vectors. Results are normalized to MOT2
Figure 2 normalizes the performance of MOT4 and IBM4 to MOT2. Unrolling
the Motorola scheme isn't worthwhile and doesn't improve performance at
all. On the other hand, unrolling the IBM scheme provides over a 100%
performance boost until the L1 cache is saturated, and then the speedups
slowly degenerate until 1.5. The four, independent, vec_msum operations fit neatly into the five stage
Vector Complex Integer Unit (VCIU).
If so, it is possible that more unrolling can unearth even more performance. We manually unrolled the algorithm until we had 256 bytes processed per iteration into 16 distinct accumulators. Figure 3 shows the speedups relative to the basic IBM scheme. The upper graph is in buffer increments of 16 bytes and the lower of 256 bytes. Obviously, for small buffer sizes adding more than four accumulators is detrimental to performance. The IBM4 scheme yields the best performance in ranges until 256 and then matches the higher unrolled schemes until 1KB buffer sizes. At buffer size of over 1KB, IBM4 is comparable to the schemes that use 6, 8, and even 12 accumulators. However, the IBM16 scheme shows that the potential of unrolling the accumulators is as high as the number of vector registers available (32).
Figure 3: Unrolling of the IBM scheme into 2, 4, 6, 8, 12, and 16 accumulators per iteration (results are normalized to the basic IBM scheme)
Another advantage of the IBM vectorization scheme is that it can be applied automatically by a vectorizing compiler. Automatic vectorization of loops consists of analysis and transformation stages. The analysis stage checks that the loop is amenable for vectorization. This primarily requires proving that operations from different iterations can run in parallel without violating the semantics of the program. In the following transformation phase, scalar operations are replaced with their equivalent vector form, and the loop bound is appropriately modified.
For the scalar checksum loops, most of the vectorization properties required are relatively easy to establish:
- The control-flow structure of the loop is simple enough: an inner-most, single-basic-block loop, whose iteration count can be evaluated before the loop starts to execute.
- It is simple to determine the access pattern of the memory references in the loop. Furthermore, these memory accesses are consecutive, and require no special data manipulations.
- Proving that there are no data dependences between memory accesses in the loop is trivial, as there is only a single memory reference in the loop -- the read from memory through the data pointer.
The properties indeed simplify some of the aspects of
autovectorization; however there are other features in the loop that
present some difficulties. The accumulation into the variable sum creates a dependence between the write into sum in one iteration, and the read from sum in the subsequent iteration. Such dependence
cycles inhibit vectorization in general, but can be dealt with in the case
of accumulation using reduction pattern recognition.
Reduction operations compute a scalar result from a set of data
elements; they are generally vectorized by computing partial results
(partial sums in this case) in parallel, and combining them at the end into
a single result. To do this, the compiler needs to recognize the reduction
pattern, create the proper prologue to initialize the partial sums and
epilogue code to reduce them into a single result. Vectorizing the
particular reduction pattern that appears in the checksum loop is further
complicated by the presence of multiple data types. The data elements
operated upon are short, which means that eight elements will be loaded
per iteration (using AltiVec). However, the accumulation is performed on
int data types, only four of which can fit into
an AltiVec vector. Promotion or demotion of data types in vector code
requires packing or unpacking of data elements between different vectors,
which are relatively complex and generally not supported by vectorizing
compilers. However, as demonstrated before, on AltiVec this entire pattern
can be supported by the vec_msum operation --
the intermediate int sums of four pairs of
short data elements are added to four int
values and stored as a second vector of four int. To be able to use this instruction, a compiler
needs to recognize the entire pattern and map it to the vec_msum operation. The MOT scheme, on the other
hand, requires explicit handling of multiple types and 64-bit
accumulators, and is therefore less amenable to automatic
vectorization.
Using the techniques above, a compiler can autovectorize the checksum loop. It remains to be examined whether the compiler can also apply the more optimized IBM2 vectorization scheme, which involves unrolling the loop while introducing multiple vector accumulators. This transformation is in fact a general technique called modulo accumulator expansion or simply variable expansion, and can be applied to loops with an accumulation or induction pattern, whether in scalar or vector form, by the instruction scheduler. The fact that this optimization can be applied by a later compilation pass independent of the autovectorizer makes the IBM2 scheme feasible to incorporate into a compiler, and allows it to generate the most efficient vector code.
We have been developing the vectorization optimization in GCC. The basic infrastructure is in place to support initial vectorization capabilities. These capabilities include some of the features described above -- pattern recognition and pointer handling. Work is under way to extend these capabilities and to introduce more advanced vectorization features, including the handling of reduction and multiple-data types.
This article analyzes the performance potential of vectorizing the
TCP/IP checksum calculation using the IBM AltiVec instruction set.
Straightforward vectorization as exemplified by Motorola's scheme of
accumulating 32-bit values into sum and carry accumulators (due to the
lack of 64-bit computing) does not fully utilize the AltiVec unit. The
large number of vec_add instructions that are
required quickly "stuffs" the Vector Simple Instruction Unit (VSIU)
pipeline. Unrolling the loop to use independent accumulators does nothing
to alleviate the problem.
On the other hand, the scheme of using the vec_msum instruction, neatly fills the VCIU pipeline
and can scale even up to 16 independent accumulators (which uses all 32
vector registers available) for buffer sizes larger than 1KB. The IBM2
scheme (32 bytes per iteration) beats the comparable MOT2 scheme by 25%,
and the IBM4 scheme (64 bytes per iteration) outperforms the MOT4 scheme
by over 100%.
From the compiler's perspective, the IBM scheme is more amenable to autovectorization due to its simpler structure. The MOT scheme requires mapping of each scalar operation into two vector operations (for sum and carry), which complicates the vectorization process.
This study has uncovered several limitations and requirements of the AltiVec unit:
- Support for 64-bit arithmetic (or even just the storing of 64-bit results) would simplify checksum type computations.
- An unsigned and modulo version of the
vsum4shs(Vector Sum Across Partial (1/4) Signed Half Word Saturate) instruction is missing and could possibly enhance the checksum algorithm. However, implementing checksum with thevec_sum4sintrinsic instead of thevec_msumintrinsic, yielded wrong results but no performance improvement. Both instructions are executed by the VCIU and suffer from the same latency. - The current pipeline structure that enables issuing only one arithmetic instruction per cycle is an obstacle to performance gains. Adding a VSIU to the permute pipeline would be beneficial to the MOT family of algorithms.
The bottom line is that successful vectorization must utilize the richness of the AltiVec instruction set and use non-obvious solutions.
- Read Part 1 of this series on scalar optimizations and simple vectorization (developerWorks, October 2004).
- Apple maintains an AltiVec
Instruction Cross-Reference.
- Learn more about automatic
vectorization in GCC.
- While the MPC7441
is end-of-lifed, it was used in drafting this article. It's the earlier
revision of the "G4".
- Further discussion of loop unrolling and vector accumulators may be
found in Advanced
Compiler Design and Implementation, by Seven S. Muchnick. Page
562.
- Autovectorization was the topic of a talk by D. Naishlos at the 2004
GCC Developers' Summit.
- You can learn more about RFC 791, the original IP
protocol specification.
- The PowerPC 970 section of the IBM Microelectronics Technical Library offers background information, documentation, and application notes for the 9xx family of chips.
- This article refers to information taken from TCP/IP Illustrated,
Volume 1: The Protocols. The author's page has additional
information.
- This article refers to information taken from TCP/IP Illustrated,
Volume 2: The Implementation. The author's page has additional
information.
- Learn more about AltiVec from the PowerPC
Microprocessor Family: AltiVec Technology Programming Environments
Manual.
- This article refers to Jacob Pan's Enhanced TCP/IP Performance
technical report, published by Motorola in 2003. It doesn't seem to be
available online (any longer?), but the author gave a presentation
on Accelerating Networking Data Movement Using AltiVec Technology at
the Smart Networks Developer Forum 2003, which is available.
- Have experience you'd be willing to share with Power Architecture zone
readers? Article submissions on all aspects of Power Architecture technology from authors inside and outside
IBM are welcomed. Check out the Power Architecture author
FAQ to learn more.
- Have a question or comment on this story, or
on Power Architecture technology in general?
Post it in the Power Architecture technical forum
or send in a letter to the editors.
-
Get a subscription to the Power Architecture Community Newsletter when
you Join the Power Architecture community.
- All things Power are chronicled in the developerWorks Power
Architecture editors' blog, which is just one of many developerWorks
blogs.
- Find more articles and resources on Power Architecture
technology and all things
related in the developerWorks Power
Architecture technology content area.
- Download a IBM PowerPC 405 Evaluation Kit to demo a SoC in a simulated
environment, or just to explore the fully licensed version of
Power Architecture technology.
Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.




