## The TCP/IP ChecksumThe TCP/IP checksum is used to detect corruption of data over a TCP or IPv4 connection. If a bit is flipped, a byte mangled, or some other badness happens to a packet, then it is highly likely that the receiver of that broken packet will notice the problem due to a checksum mismatch. This provides end-to-end assurance that the data stream is correct. IPv4 uses the checksum to detect corruption of packet headers. i.e. the source, destination, and other meta-data. The TCP protocol includes an extra checksum that protects the packet "payload" as well as the header. This is in addition to the header-checksum of IP. IPv6 does not use a checksum. This is due to the modern assumption that low-level transport protocols typically carry some form of error correction, together with the fact that higher-level protocols (like TCP) do as well. This makes IPv6 packet headers "leaner" than they might otherwise be (provided the size of the flow-label isn't altered). The algorithm for the TCP and IPv4 checksums is identical. The data is processed a word (16 bits, two bytes) at a time. (An odd number of bytes just has a trailing zero byte added to make the total number even.) The words are added using ones-complement arithmetic, with the location holding the checksum set to be zeros. Once this long chain of additions is complete, the result is negated in ones-complement by taking the binary not, and the result is stored in the right spot. If this operation is repeated then the result of the checksum should be the binary all-ones. The algorithm above has many interesting properties. The first is that since it is based on addition, it is both commutative and associative. This means that we can use some other calculation that effectively adds in some other order. By re-arranging the additions we might be able to increase performance. The second property is more subtle: The result is endian independent. A ones-complement addition can be performed on twos-complement machines (most of them) by doing a normal twos-complement addition, followed by adding the potential carry back in the least-significant position. This means that we effectively have a carry from the first byte into the second, and from the second byte into the first. This symmetry means that no matter what endian order we internally place on those two bytes, the calculation will give the same result. A final interesting property (not really used in this article) is that since the checksum is based on addition, we can alter a packet and update the checksum cheaply. We just need to do a ones-complement subtraction of the differences in the words we change. For example, this means that when a router updates the time-to-live (TTL) of a packet, it doesn't have to redo the whole checksum calculation. Evaluating the checksum can be a bottle-neck in network heavy environments. This is why many network interfaces offer the ability for the hardware to do the resulting checksum calculations. However, many network stacks do the checksum in software, which is why optimizing its calculation can be important. This article explores how that might be done. ## TCP/IP Checksum AlgorithmsC assumes two's complement arithmetic with its unsigned integers, and doesn't specify the format for signed integers, so we can't use ones-complement arithmetic directly. (This is for portability. Hardware might use either representation.) It also doesn't have access to the carry flag, or the add-with-carry instruction. However, if we use integer types wider than a 16-bit word, we can accumulate many carries. Due to the commutivity and associativity of addition, we can fold them all in at the end. (Note that we don't have to worry about overflow of the wider type because the size of packets is limited.) Some C code that does that looks like:
We can run the above 2
The above obvious algorithm handles 16-bits at a time. Usually, if you process more data at a time, then performance is better. Since we have a 64-bit machine, we would like to modify the algorithm to handle chunks of that size. The problem there is the carries. Fortunately, it isn't hard to come up with an algorithm to detect an (unsigned) overflow. If a+b < a, then a carry occurred. Thus the main loop is easily modified to add an extra one to the total when needed. Finally, we need to "fold" the result down from 64 to 16 bits at the end. Such an algorithm looks like:
As one might expect, this does a fair bit better:
It isn't quite the four-times faster that using four-times larger chunks might give, but it's close. The problem now is to do even better. However, since C doesn't have access to the carry flag, it becomes increasingly difficult to describe what we want the machine to do. Instead, we will make the jump to assembly language, which will give us the freedom to try any technique we wish. A direct translation of the above C code into assembly language by hand might look like the following:
The main loop is quite simple. It uses the %rax register to accumulate the result, using the add instruction to do so. Carries are handled by the adc instruction afterwards. The rest of the loop just updates the pointer into the buffer, and counts down the number of iterations to do. Handling the tail, and the final fold down to the 16-bit result are also simplified by the ability to use the adc instruction explicitly. Since the above is a faithful translation of the C code, you wouldn't expect too much of a difference in performance. That is the case, with the results being:
There is only a smidgeon of a difference, due to the intrinsic lower overhead of using assembly language. To go faster we need to use more of the flexibility available. One of the most obvious things to do is to optimize the main loop some more. If we take some care, we can let the carry flag be "live" throughout. This means we can get rid of the adc-with-zero, and turn that into an adc with the next thing to sum in the next iteration. However, the current code corrupts the carry flag with the other instructions in the loop. So we need to be creative and use the lea instruction to update the pointer, and the dec instruction to count down the loop. Both of those don't affect the carry.
The above shrinks the hot inner loop enormously, and is obviously going to be much faster. However, reality isn't so kind:
Ouch. The transformation had little to no effect at all. It turns out that on this machine adc is a very fast instruction. The extra adc instruction inside the loop overlapped in execution with the other operations, so removing it didn't speed anything up much at all. However, if that's the case, we can use that knowledge to make some improvements. If we unroll the loop once, and make two additions in it, we can perhaps double the speed:
This time, our suspicions are realized, and we get a significant performance increase:
Obviously, if unrolling once is good, unrolling even more must be better. If we unroll so that there are four additions in the inner loop, we get:
Again, we get more improvement. Not quite as much as last time, but enough to make it worth it:
Unrolling more doesn't seem to help. To increase the speed further we will need to think of something else to change. As we unroll we are increasing the length of the dependent instruction chain in the loop. If we can somehow break this chain into smaller pieces, then the cpu might be able to exploit more inter-instruction parallelism. Unfortunately, there is only a single carry flag. However, we can combine what we have with something using the algorithm in the original C code. Using lea instructions to accumulate into a wide register, we can get extra additions in the loop, without many dependencies. Such code looks like:
So does this save us any cycles? Unfortunately, it doesn't:
If anything, it is slightly worse than the previous case. Sometimes being tricky helps, sometimes it doesn't. This is a case of the latter. We will need to try something different to improve. ## VectorizationOne obvious optimization technique for cpu-bound problems is vectorization. If we can use the power of the SSE instructions then we can do more work in parallel, and hopefully increase performance. However, this is one drawback. The set of SSE instructions is anything but orthogonal. There is a high likelihood that the exact instruction you might want may not be implemented. Instead, we have to make do with what exists, and re-design our algorithm around that. In this case, the limiting factor is that the vectorization instructions have no concept of the carry flag (or of ones-complement arithmetic), so we can't use the best-performing technique above. However, we can still use the method of accumulating results into a double-length register. In this case, we can add 32-bit chunks to the two 64-bit halves of a 128-bit xmm register. In fact, we can go even further, and accumulate into two such registers simultaneously, further increasing parallelism. Such code looks like:
The above carefully overlaps use of the vectorization instructions with normal 64-bit arithmetic. (It is possible to use a purely vectorized algorithm, but that turns out to be slower.) However, the results are not encouraging:
This is much worse than not using vectorization at all. The problem here is that we are using three relatively expensive instructions (vectorized load, extract, add) to do the work of a single 64-bit adc. The fact that the vectorized instructions can execute with more parallelism doesn't help make up for their intrinsic inefficiency. We can unroll more, and do 64-bytes at a time:
This does a little better:
However, we still haven't caught up to the unvectorized code. We need to vectorize in a smarter way. The problem is that we are still using three "expensive" vector instructions to handle 64-bits of data. There is a way however, to use 4 vector instructions to work with 128 bits of data. This is quite an improvement, and might be 50% faster. The trick is to try to emulate the second C code that open-codes the add-with-carry instruction by using a comparison. There are plenty of comparison operations available, but none does exactly what we want. Basically, we would like a vectorized unsigned integer comparison, but all the integer comparison instructions are on signed integers. We can get around this by noticing that if a and b are unsigned integers then the signed comparison (a - msb) < (b - msb) will produce the correct result. (Basically, we are shifting zero to be at INT_MIN.) Since the msb (most significant bit) can be toggled with an addition, subtraction or an xor instruction, we have quite some flexibility here. We choose to have the accumulator in the form a + msb. Then, when we add a partial result, we get (a + b + msb). Thus, if we compare (a + b + msb) to (a + msb), we can detect a carry. Since the SSE comparison instructions set their "true" value to be all-ones, and that is equal to -1 in twos-complement arithmetic, we can subtract that to in effect add the carry as needed. Finally, when done, we can xor out the msb's to get the accumulated ones-complement results. So some lateral thinking gives:
This improved algorithm does better than the previous vectorized code:
Again, it still isn't fast enough. Perhaps the problem is that we now have extra overhead after the main loop ends in order to combine the two xmm register results. Let's just accumulate into one register:
This does a little better than before:
We are catching up on the unvectorized code. However, there is something rather important that we have ignored up until now. The SSE vectorized instructions depend on having their inputs 16-byte aligned. If that is not the case, then they will fail with a hardware exception. We need to make sure that the buffer is properly aligned. Sometimes that is trivial, just use an aligned allocator, or use an aligned attribute for a static or automatic buffer. However, the use-case for the TCP checksum is with networking. The data from packets can be unaligned, so we really need to handle that annoying case. If the buffer is mis-aligned, but still evenly-aligned, then we can use a similar algorithm. Just handle the beginning bytes until we are 16-byte aligned (or run out of bytes to checksum). Initialize the accumulator to that value, and continue on as normal with the vectorized code. However, if we are at an odd-alignment we run into trouble. We will have a partial 16-bit word to handle. Here, the fact that the checksum is endian-neutral saves us. We can accumulate part of the checksum in the "wrong" endian form, and then correct for that later. Thus the weakness that SSE instructions don't handle mis-aligned data severely complicates the code. However, we can still get something to work:
The cost of checking for alignment is overwhelmed by the other work the function needs to do, and the resulting performance doesn't change from the previous (incorrect) method:
Unfortunately, on this machine, it doesn't look like you can do much better with using the vectorized instructions. However, the above does exploit a trick that we haven't used with the non-vectorized code. By enforcing aligned accesses in the inner loop, we can clean up the code that deals with the odd-sized tail. Doing the same with the adc-based loop looks like:
This gives:
Which is faster than before, except for the odd-length case, which is slightly slower. If we could only make those cases more like the ones that are multiples of 64 bytes long (the unrolled loop step size), then we might be able to increase speed a little further. One way to do this is via the asm equivalent of Duff's device. We can jump into the middle of the loop when we start it. Unfortunately, we still need to handle lengths not exact multiples of the 64-bit register size in a special way, but we needed to do that anyway. Normally, you would construct a jump table, and then use it to branch to the correct spot in the inner loop. However, here, all the interesting instructions are the same length: four bytes long. This means we can do some arithmetic and emulate the equivalent of a computed goto and avoid doing a relatively expensive unpredictable memory lookup. The resulting code looks like:
This gives:
The above is very good, the best code so far on this machine. The only thing left to try is shrinking the inner loop. The current code processes 64 bytes at a time, like the vectorized version did. However, the old adc-based algorithm found that 32 bytes at a time was the best. Lets take one last look at what that might do:
The results don't change all that much:
The only major difference is the small 64-byte packet time has gone up due to it requiring two loop iterations instead of one. However, it is still faster than the earlier unvectorized code. On the other hand, we may want to keep with the 64-byte unrolled version due to this slight speed loss. ## SummaryAfter trying many many optimization tricks, we can find a method of calculating the TCP/IP checksum up to an order of magnitude faster than the original C code. Most of the gains were due to using computation on larger chunks of data (64-bits rather than 16 bits at a time). More gains were due to unrolling the inner loop, and controlling use of the carry flag. Small gains could be obtained by controlling alignment, and breaking slow-paths out of the main control flow. Unfortunately vectorization was not a win on this machine for this problem. The non-orthogonality of the SSE instructions can hurt quite a bit when the operation you need isn't supported. However, this isn't the full story. On more recent machines the adc instruction which is key to the unvectorized algorithm is much much slower. In addition, the vectorized instructions are much much faster, plus there are more of them doing new things. The combination of these changes could mean that one of the "slower" functions above could be better. If vectorization does prove to be better on your machine, perhaps you may want to explore adding the Duff's device trick to the best vectorized version. You may also want to explore using instructions added after SSE2, or perhaps even the 256-bit AVX extensions. As hardware changes over time, so does the most optimal version of an algorithm. Most of the time, it isn't worth it to hyper-optimize something. In some cases, where you absolutely need the performance edge, it can be highly profitable though. |

About Us | Returns Policy | Privacy Policy | Send us Feedback |

Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |

## Comments

yolo swag said...captcha is waay too hard

It is generally futile to unroll loops without breaking dependencies.

Sum loop based on adc is essentially one long chain of dependent ops.

I've run couple of brief tests on a Lynnfield box, here's how checksum6 performs:

<pre>

$ perf stat ./a.out 1000

10000000 loops; len=1000; res = ...

6827798474 cycles # 2.960 GHz

5574096465 stalled-cycles-frontend # 81.64% frontend cycles idle

4505422340 stalled-cycles-backend # 65.99% backend cycles idle

2550630533 instructions # 0.37 insns per cycle

# 2.19 stalled cycles per insn

...

More results. checksum1 and checksum2, compiled with gcc-4.5.2 -O2, are

~ 1.05 clocks/byte (IPC 2.86), and 0.4 clocks/byte (IPC 2.25), respectively.

checksum3 is ~ 0.53 clocks/byte (IPC 1.51), checksum6 ~ 0.68 clocks/byte (IPC 0.37)

I wrote a version of checksum with unrolled loop of four uint32 adds into

four different uint64 registers, disassembles like this:

400620: 44 8b 10 mov (%rax),%r10d

400623: 4c 01 d2 add %r10,%rdx

400626: 44 8b 50 04 mov 0x4(%rax),%r10d

40062a: 4d 01 d1 add %r10,%r9

40062d: 44 8b 50 08 mov 0x8(%rax),%r10d

400631: 4d 01 d0 add %r10,%r8

400634: 44 8b 50 0c mov 0xc(%rax),%r10d

400638: 48 83 c0 10 add $0x10,%rax

40063c: 4c 01 d1 add %r10,%rcx

40063f: 4c 39 d8 cmp %r11,%rax

400642: 75 dc jne 400620 <checksum77+0x30>

$ perf stat ./a.out 1000

10000000 loops; len=1000; res = ...

2740821769 cycles # 2.959 GHz

905954412 stalled-cycles-frontend # 33.05% frontend cycles idle

29838920 stalled-cycles-backend # 1.09% backend cycles idle

7565283438 instructions # 2.76 insns per cycle

# 0.12 stalled cycles per insn

...

</pre>

What cpu did you run your tests on?

Finally, I'd like to let you know that I'm a robot, and that I find your captchas way too simple!

I found your article on SSE2 optimization and IP checksum via Google as I have been interested in using Altivec/SSE2 version of IP checksum for many years. I took some of the ideas in article and coded up a slightly different version. This version uses three simple idea:

1) No alignment checking (causes branch mispredictions)

2) Use two independent memory strides by breaking buffer in half

3) Sum each stride separately (for multi-execution CPUs)

I originally tried verifying 64-bit sum version that has overflow checking. VTune showed a large number of branch mispredictions, so the 32-bit sum without overflow checking performs better.

Unfortunately, this code crashes using gcc -O3, but it runs correctly on VS2008. In tracking down the problem, it appears that gcc for 64-bit platforms has -sse optimization enabled by default. Since the code below doesn't care about integer alignment, it should technically work on Intel CPUs (not Sparc or PPC).

What I found interesting is that gcc uses SSE2 instruction when your article in-fact shows them as no-win. I suspect the code is crashing due to an SSE2 alignment issue (or gcc loop unrolling defect).

x00007ffff6b4c7c0 in in_cksum (data=0x68b2cd "\n\352\f8\220\204;\235\263\067P\030\372\360m\201", len=499,

initialValue=<optimized out>) at ../../modules/platform_linux64/platform/../../platform/checksum64.c:59

59 sum += *buf;

0x00007ffff6b4c7b4 <in_cksum+196>: 66 0f ef d2 pxor %xmm2,%xmm2

0x00007ffff6b4c7b8 <in_cksum+200>: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)

=> 0x00007ffff6b4c7c0 <in_cksum+208>: 66 43 0f 6f 5c 15 00 movdqa 0x0(%r13,%r10,1),%xmm3

0x00007ffff6b4c7c7 <in_cksum+215>: 41 83 c3 01 add $0x1,%r11d

0x00007ffff6b4c7cb <in_cksum+219>: 66 0f 6f e3 movdqa %xmm3,%xmm4

0x00007ffff6b4c7cf <in_cksum+223>: 66 0f 6a da punpckhdq %xmm2,%xmm3

0x00007ffff6b4c7d3 <in_cksum+227>: 66 0f 62 e2 punpckldq %xmm2,%xmm4

0x00007ffff6b4c7d7 <in_cksum+231>: 66 0f d4 cc paddq %xmm4,%xmm1

0x00007ffff6b4c7db <in_cksum+235>: 66 0f d4 cb paddq %xmm3,%xmm1

uint_fast16 in_cksum(const unsigned char *data, size_t len, uint_fast32 initialValue) {

uint64 sum = initialValue;

uint64 sum2 = 0;

uint16 final;

uint_fast32 stride = ((len / sizeof(uint32)) * sizeof(uint32)) / (sizeof(uint32)*2);

const uint32* buf = (const uint32*)data;

const uint32* buf2 = (const uint32*)&buf[stride];

/* Break the buffer up into two different strides - sum each one independently */

for (; stride--; ++buf, ++buf2) {

sum += *buf;

sum2 += *buf2;

}

sum += sum2;

/* Roll second sum back into equation */

assert(buf2 >= data);

len -= (((const unsigned char*)buf2) - data);

data = (const unsigned char*)buf2;

if (len & 4) {

sum += *((const uint32*)data);

data += sizeof(uint32);

len -= 4;

}

if (len & 2) {

sum += *((const uint16*)data);

data += sizeof(uint16);

len -= 2;

}

if (len) {

sum += *data++;

--len;

}

assert(len == 0);

/* Convert 64-bit value into 16-bit CRC */

{

uint32 i1, i2;

uint16 i4;

i1 = (uint32)sum;

i2 = sum >> 32;

i1 += i2;

if (i1 < i2)

i1++;

final = (uint16)i1;

i4 = i1 >> 16;

final += (uint16)i4;

if (final < i4)

final++;

}

return final;

} /* in_cksum */

instead of

if (size)

{

unsigned char s = *(unsigned char *) buf;

sum += s;

if (sum < s) sum++;

}

shouldn't we have :

if (size & 1)

{

unsigned char s = *(unsigned char *) buf;

sum += s;

if (sum < s) sum++;

}

Eduardo Panisset

The excuse seems to be that "C" cannot access the carry flag. Then when that does not prove a real advantage, you go ahead and proceed with the standard loop unrolling techniques that are much easier to show in C...

I have a strong feeling a C implementation would be as fast, and 100 times more readable. Probably not so much fun (for you!) though ;-)

Of course you can add as much loop unrolling as you want to make this go faster.

Finally, you can also make things go faster if you use a prefetch in your loop to ensure that data is available at all times.

Of course a hand coded assembly solution will beat a C solution, but you should really try to get to the best C solution first.