Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

From the stacks: TCP/IP checksum vectorization using AltiVec, Part 1

Scalar optimizations and simple vectorization

Ayal Zaks, Researcher, IBM, Software Group
Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.
Dorit Naishlos, Researcher, IBM, Software Group
Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.
Daniel Citron, Researcher, IBM, Software Group
Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.

Summary:  This two-part article demonstrates the kinds of performance gains AltiVec can produce on the TCP/IP checksum, or on code similar to it. It gives special attention both to instructions that help improve performance, and to general unrolling and scheduling techniques. The net result? Performance increased by a factor of four.

Date:  26 Oct 2004
Level:  Introductory

Activity:  7549 views
Comments:  

One of the key components in the TCP/IP protocol stack is the checksum computation, which ensures the integrity of the transferred data. This computation can be greatly accelerated with the use of single instruction, multiple data (SIMD) units prevalent in state-of-the-art processors. In particular, the AltiVec unit of the IBM® PowerPC® architecture is well-suited for this type of computation. This article analyzes a former vectorization effort, shows how it can be improved upon, and then enhances it further. For packets of up to 64KB, an average speedup of around 4.0 is obtained. The second part of this series also examines the performance gains obtained by hand coding the algorithm and discusses the issues that must be resolved for it to be autovectorized by the compiler.

The checksum field is part of the TCP header structure, and when a packet arrives at any final, or intermediate, destination a new checksum is calculated and compared to the checksum field. Any discrepancies indicate that the transferred data has been violated in some way. The checksum field is the 16-bit one's complement of the one's complement sum of all 16-bit words in the header and text. This lends itself to data level parallelization (DLP) that can be supported by the AltiVec unit in the IBM PowerPC processor line.

This work isn't the first attempt at vectorizing the checksum algorithm -- Motorola published a white paper that does exactly this on the AltiVec unit of its MPC7440 processor (see Resources). However, we propose to vectorize the algorithm to take full potential of the AltiVec unit on the PowerPC 970 processor. Based on the knowledge obtained during this study, we will suggest a technique for having the compiler autovectorize the checksum algorithm. The main contributions of the article are:

  • A superior scheme for vectorizing the checksum algorithm
  • A study of the limits of checksum vectorization
  • Autovectorization based on the proposed scheme

Scalar checksum

This section describes the basic scalar TCP/IP checksum algorithm and looks at techniques to enhance it. The final enhancement will be compared to the vectorized versions presented in the following sections.

Much has been said about TCP/IP and the role of the checksum integrity check. Whether the check is sound, and whether it is a computational bottleneck, are issues this article does not resolve. The focus is on delivering the highest performance possible for it.

The basic checksum algorithm as described in TCP/IP Illustrated, Volume 2: The Implementation receives a buffer of unsigned chars, accumulates 16-bit half-words into a 32-bit accumulator, and finally takes the one's complement of the one's complement sum. This implies that the test isn't sound beyond 64KB, which is the theoretical limit of an IP packet. The protocol specifies a practical upper limit of 576 bytes. Nevertheless, in other systems (Microsoft Windows, for instance) there is no such limit, so this article explores all buffer sizes until 64KB. Given that data transfer optimization is one of the key components of the protocol, assume that previous stages have aligned the buffer on 16-byte boundaries and padded it with zeros where necessary.

The basic 16-bit version is:


Listing 1: The TCP/IP checksum algorithm
ushort checksum16(uchar* data, int len)
{
    uint sum = 0;
    if ((len & 1) == 0)
        len = len >> 1;
    else
        len = (len >> 1) + 1;
    while (len > 0) {
        sum += *((ushort*)data);
        data += sizeof(ushort);
        len--;
    }
    sum = (sum >> 16) + (sum & 0xffff);
    sum += (sum >> 16);
    return(~sum);
}

This example assumes that the buffer is padded with a zero if its length isn't even (in all future code snippets, the length calculation will be omitted). This code can be potentially improved upon by accumulating 32-bit words into a 64-bit double word -- reducing the number of memory reads and additions by half (this scheme only works correctly for big-endian machines).


Listing 2: Accumulating 32 bits at a time
ushort checksum32 (uchar *data, int len) {
    unsigned long long sum = 0;
    len = len >> 2;
    while (len > 0) {
        sum += *((uint *)data);
        data += sizeof(uint);
        --len;
    }
    while (sum > 0xffffffffULL) {
        sum = (sum & 0xffffffffULL) + (sum >> 32);
    }
    sum = (sum & 0xffff) + (sum >> 16);
    sum += (sum >> 16);
    return (~sum);
}

The sum variable is defined as a long long data type in order to make sure the GCC compiler recognizes it as a 64-bit value. Figure 1 shows the speedups of the 32-bit version on two different Macintosh computers, a G4 and a G5 (Table 1 summarizes their main characteristics.). On both machines, the -O3 optimization ags were used (-mpowerpc64 was set on the G5), buffer sizes of up to 8KB (256 byte increments) were measured, and the checksums were run 1,000,000 times for each buffer size (which implies warm caches). The performance measured is the user CPU time.


Figure 1: Comparison of 32- to 64-bit checksum computation on Macintosh G4 and G5 computers
Figure 1: Comparison of 32- to 64-bit checksum computation on Macintosh G4 and G5 computers

The graphs show that no advantage is gained by using a 64-bit accumulator on the G4 because the implementation of the long long data type uses two registers and two adds per accumulation. On the G5, which possesses true 64-bit computing, the speedup is immediately apparent and constant at around 1.5.


Table 1. G4 and G5 main characteristics
AttributeG4G5
ProcessorMPC 7441PowerPC 970
MHz7001800
Word Size3264
L1 DCache size32KB32KB
L1 DCache line32B128B
L2 Cache256KB512KB
CompilerGCC 3.1GCC 3.3
OSDarwin 6.8Darwin 7.3

The resulting assembly code shows that loop unrolling isn't performed by the compiler. Adding the -funroll-loops that unrolls four iterations results in enhanced performance on both machines. However, on the G5, both scalar versions are improved by 100%, where on the G4 the 16-bit version is improved by 1.5, and the 32-bit version by 1.1. This means that on a G4 the 16-bit unrolled version is the scalar baseline to consider.


Figure 2: The AltiVec unit on the PowerPC 970
Figure 2: The AltiVec unit on the PowerPC 970

Vector processing on the PowerPC architecture

This article uses the name AltiVec to denote both the instruction set and implementing functional unit. The AltiVec unit contains thirty-two 128-bit registers. Operations can be applied to 8-, 16-, and 32-bit signed and unsigned integer values, single precision (32-bit) floating point (FP) values, and 16- and 32-bit pixel values. Thus, each instruction performs 16, 8, or 4 operations. The instructions are divided into five groups that are executed by different functional units. Figure 2 shows a block-level diagram of the AltiVec unit on the PowerPC 970 processor. Although there are four distinct functional units, only two instructions can be issued per cycle, and one is to the permute unit. All units are fully pipelined. Table 2 shows the units, instruction groups, and latencies. A complete description of the instruction set is available in PowerPC Microprocessor Family: AltiVec Technology Programming Environments Manual (see Resources).


Table 2. Latencies of AltiVec instructions on the G4 and G5.
UnitInstructionsG4 lat.G5 lat.
VSIUALU12
VCIUmul, div, sum, max45
VFPUFP58
VPERMpermute22
LSUload, store34

Load/Store instructions are handled by the processor's Load/Store Unit (LSU).

Hand coding in high-level languages is simple with the use of intrinsics, which are macros that enable direct insertion of AltiVec instructions into C code. All the following code examples will be written in C using intrinsics. For example, the following code snippet adds two vectors containing four signed integers each:


Listing 3: Vector add in C
vector int a,b,c;
c = vec_add(a,b);

Translates into:


Listing 4: Vector add in assembly
vadduwm v2,v0,1

If necessary, a,b will be loaded from memory:


Listing 5: Vector add with load

lvx v0,0,r3 #r3 contains &a
lvx v1,16,r3

Checksum vectorization

A previously published vectorized checksum algorithm was by Motorola on a MPC7455 processor. Eight int objects (32 bytes total) are read each cycle, added in pairs, and then accumulated into sum and carry accumulators. The main loop of the MOT algorithm is (all variables are defined as vector unit):


Listing 6: Motorola's algorithm (MOT)
while (vector_count-- >0) {
    VD1 = *((vector uint*)data);
    VD2 = *((vector uint*)(data+16));
    data += 32;
    V_CC = vec_addc(VD1,VD2);
    V_Temp = vec_add(VD1,VD2);
    V_CS = vec_addc(V_Temp,V_Sum);
    V_Sum = vec_add(V_Temp,V_Sum);
    VCAR = vec_add(V_CC,VCAR);
    VCAR = vec_add(V_CS,VCAR);
}

The absence of 64-bit operations implies that 32-bit addition results must be stored in two registers, resulting in an additional four vector add instructions used to manipulate these carries. Thus, two vector loads and six vector adds are run for every 32 bytes of data. This technique is naive on two counts: it doesn't try to use more powerful AltiVec instructions; and it doesn't use the fact that the TCP/IP buffer limit is 64KB.


Figure 3: Diagrams of the vsum4shs and vsumsuhm vector instructions
Figure 3: Diagrams of the vsum4shs and vsumsuhm vector instructions

The AltiVec instruction set offers two other options that can reduce the number of instructions (Figure 3 displays diagrams of them.):

vsum4shs vD,vA,vBVector Sum Across Partial (1/4) Signed Half Word Saturate
The four pairs of half-words in vA are added together and then added to the four words in vB. The result is stored in vD. This instruction is exactly what you need, eight 16-bit values are accumulated into four 32-bit values. However, there are two catches: the results saturate when reaching 2^31-1, and the additions are signed. The first obstacle can be overcome if you assume that the number of data items is smaller than 32KB, the second obstacle results in an incorrect checksum.

vmsumuhm vD,vA,vB,vCVector Multiply Sum Unsigned Half Word Modulo
Each of the eight half-words in vA and vB are multiplied together, the corresponding eight 32-bit results are added in pairs and then added to the four words in vC. The result is stored in vD. At first glance, this instruction is totally irrelevant to the problem: there is no need in multiplication and the 32-bit adds will overflow. However, if vB is a vector set to contain only 1s, the ensuing products won't overflow when added together, nor will the accumulator register vC if the number of additions is smaller than 64KB (which is the case in TCP/IP checksum computation).

By using the vec_msum intrinsic (that implements the vmsumuhm instruction), every 16 bytes are processed by one vector load and one complex vector instruction. This algorithm will be denoted as the IBM scheme (VD1 and One are of type vector ushort, V Sum is a vector uint):


Listing 7: IBM's algorithm (IBM)
while (vector_count-- >0) {
    VD1 = *((vector ushort*)data);
    data += 16;
    V_Sum = vec_msum(VD1,One,V_Sum);
}

We compared the algorithms above to the best scalar version on a G4. Three levels of granularity are displayed: 16-byte, 256-byte and 1-KB increments. The two vector versions were unrolled as well in order to provide a fair comparison (although the unrolled versions provide a maximal speedup of 10% and a slight slowdown when the buffer sizes are small). You can conclude the following from Figure 4:

  • The IBM scheme matches the scalar scheme for a 32-byte buffer size, the MOT scheme matches only after 64 bytes.
  • The simpler IBM scheme outperforms MOT due to its shorter and simpler loop (can start computation after 16 bytes are read) and epilogue (fewer long long adds). At around 1KB, MOT finally outperforms it due to fewer iterations.
  • Until 32KB, MOT slightly outperforms IBM, but then the trend is reversed. The knee in the graph is probably an artifact of the data cache size, which is 32KB; the buffer is flushed from the cache since the last run. The cache line size is 32 bytes, thus the MOT scheme encounters a cache miss every iteration. The IBM version encounters a cache miss only on the even iterations.

Figure 4: MOT and IBM vector schemes vs.16-bit scalar on a G4 (all versions are unrolled)
Figure 4: MOT and IBM vector schemes vs. 16-bit scalar on a G4 (all versions are unrolled)

A similar comparison was performed on a G5, where the 32-bit scalar scheme was compared to the IBM and MOT schemes (all unrolled). The speedups are displayed in Figure 5 (one graph, 256B increments until 64KB). The main difference is that MOT outperforms IBM faster and even after the 32KB buffer size. This can be attributed to the larger line sizes (128 bytes) of the G5 cache and the hardware prefetching mechanism.


Figure 5: MOT and IBM vector schemes vs. 16-bit scalar on a G5 (all versions are unrolled)
Figure 5: MOT and IBM vector schemes vs. 16-bit scalar on a G5 (all versions are unrolled)

Summary, and what lies ahead

While the Motorola algorithm is perhaps naive, it works reasonably well on sample data. In the next article in this series, you'll see how this code can be better unrolled and vectorized, and how recent versions of GCC can be persuaded to generate the right vectorized code automatically.


Resources

About the authors

Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.

Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.

Ayal Zaks, Dorit Naishlos, and Daniel Citron are IBM Haifa Labs researchers specializing in microarchitecture and compiler development and optimizations.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Multicore acceleration
ArticleID=32004
ArticleTitle=From the stacks: TCP/IP checksum vectorization using AltiVec, Part 1
publish-date=10262004
author1-email=
author1-email-cc=
author2-email=
author2-email-cc=
author3-email=citron@il.ibm.com
author3-email-cc=