The TCP/IP Checksum
The TCP/IP checksum is used to detect corruption of data over a TCP or IPv4 connection. If a bit is flipped, a byte mangled, or some other badness happens to a packet, then it is highly likely that the receiver of that broken packet will notice the problem due to a checksum mismatch. This provides end-to-end assurance that the data stream is correct.
IPv4 uses the checksum to detect corruption of packet headers. i.e. the source, destination, and other meta-data. The TCP protocol includes an extra checksum that protects the packet "payload" as well as the header. This is in addition to the header-checksum of IP. IPv6 does not use a checksum. This is due to the modern assumption that low-level transport protocols typically carry some form of error correction, together with the fact that higher-level protocols (like TCP) do as well. This makes IPv6 packet headers "leaner" than they might otherwise be (provided the size of the flow-label isn't altered).
The algorithm for the TCP and IPv4 checksums is identical. The data is processed a word (16 bits, two bytes) at a time. (An odd number of bytes just has a trailing zero byte added to make the total number even.) The words are added using ones-complement arithmetic, with the location holding the checksum set to be zeros. Once this long chain of additions is complete, the result is negated in ones-complement by taking the binary not, and the result is stored in the right spot. If this operation is repeated then the result of the checksum should be the binary all-ones.
The algorithm above has many interesting properties. The first is that since it is based on addition, it is both commutative and associative. This means that we can use some other calculation that effectively adds in some other order. By re-arranging the additions we might be able to increase performance.
The second property is more subtle: The result is endian independent. A ones-complement addition can be performed on twos-complement machines (most of them) by doing a normal twos-complement addition, followed by adding the potential carry back in the least-significant position. This means that we effectively have a carry from the first byte into the second, and from the second byte into the first. This symmetry means that no matter what endian order we internally place on those two bytes, the calculation will give the same result.
A final interesting property (not really used in this article) is that since the checksum is based on addition, we can alter a packet and update the checksum cheaply. We just need to do a ones-complement subtraction of the differences in the words we change. For example, this means that when a router updates the time-to-live (TTL) of a packet, it doesn't have to redo the whole checksum calculation.
Evaluating the checksum can be a bottle-neck in network heavy environments. This is why many network interfaces offer the ability for the hardware to do the resulting checksum calculations. However, many network stacks do the checksum in software, which is why optimizing its calculation can be important. This article explores how that might be done.
TCP/IP Checksum Algorithms
C assumes two's complement arithmetic with its unsigned integers, and doesn't specify the format for signed integers, so we can't use ones-complement arithmetic directly. (This is for portability. Hardware might use either representation.) It also doesn't have access to the carry flag, or the add-with-carry instruction. However, if we use integer types wider than a 16-bit word, we can accumulate many carries. Due to the commutivity and associativity of addition, we can fold them all in at the end. (Note that we don't have to worry about overflow of the wider type because the size of packets is limited.) Some C code that does that looks like:
unsigned short checksum1(const char *buf, unsigned size)
{
unsigned sum = 0;
int i;
/* Accumulate checksum */
for (i = 0; i < size - 1; i += 2)
{
unsigned short word16 = *(unsigned short *) &buf[i];
sum += word16;
}
/* Handle odd-sized case */
if (size & 1)
{
unsigned short word16 = (unsigned char) buf[i];
sum += word16;
}
/* Fold to get the ones-complement result */
while (sum >> 16) sum = (sum & 0xFFFF)+(sum >> 16);
/* Invert to get the negative in ones-complement arithmetic */
return ~sum;
}
We can run the above 224 times as a benchmark, and time the result. Obviously, the times will vary depending on the size of the buffer used. We choose to look at 64-byte packets, 1023 byte packets and 1024 byte packets. The 64-byte case is indicative of the overhead for small packets. 1024 bytes tells us how performance varies in the large-packet limit. 1023 bytes helps diagnose the effects of odd-sized packets on performance of different variants of checksum algorithms. Doing that, the results are:
Size | 64 | 1023 | 1024 |
Time (s) | 0.88 | 10.99 | 11.03 |
The above obvious algorithm handles 16-bits at a time. Usually, if you process more data at a time, then performance is better. Since we have a 64-bit machine, we would like to modify the algorithm to handle chunks of that size. The problem there is the carries. Fortunately, it isn't hard to come up with an algorithm to detect an (unsigned) overflow. If a+b < a, then a carry occurred. Thus the main loop is easily modified to add an extra one to the total when needed. Finally, we need to "fold" the result down from 64 to 16 bits at the end. Such an algorithm looks like:
unsigned short checksum2(const char *buf, unsigned size)
{
unsigned long long sum = 0;
const unsigned long long *b = (unsigned long long *) buf;
unsigned t1, t2;
unsigned short t3, t4;
/* Main loop - 8 bytes at a time */
while (size >= sizeof(unsigned long long))
{
unsigned long long s = *b++;
sum += s;
if (sum < s) sum++;
size -= 8;
}
/* Handle tail less than 8-bytes long */
buf = (const char *) b;
if (size & 4)
{
unsigned s = *(unsigned *)buf;
sum += s;
if (sum < s) sum++;
buf += 4;
}
if (size & 2)
{
unsigned short s = *(unsigned short *) buf;
sum += s;
if (sum < s) sum++;
buf += 2;
}
if (size)
{
unsigned char s = *(unsigned char *) buf;
sum += s;
if (sum < s) sum++;
}
/* Fold down to 16 bits */
t1 = sum;
t2 = sum >> 32;
t1 += t2;
if (t1 < t2) t1++;
t3 = t1;
t4 = t1 >> 16;
t3 += t4;
if (t3 < t4) t3++;
return ~t3;
}
As one might expect, this does a fair bit better:
Size | 64 | 1023 | 1024 |
Time (s) | 0.29 | 2.90 | 2.93 |
It isn't quite the four-times faster that using four-times larger chunks might give, but it's close. The problem now is to do even better. However, since C doesn't have access to the carry flag, it becomes increasingly difficult to describe what we want the machine to do. Instead, we will make the jump to assembly language, which will give us the freedom to try any technique we wish.
A direct translation of the above C code into assembly language by hand might look like the following:
.globl checksum3
.type checksum3,@function
.align 16
checksum3:
xor %eax, %eax
cmp $8, %esi
jl 2f
# The main loop
1: add (%rdi), %rax
adc $0, %rax
add $8, %rdi
sub $8, %esi
cmp $8, %esi
jge 1b
# Handle the tail
2: test $4, %esi
je 3f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
3: test $2, %esi
je 4f
xor %edx, %edx
movw (%rdi), %dx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
4: test $1, %esi
je 5f
xor %edx, %edx
movb (%rdi), %dl
add %rdx, %rax
adc $0, %rax
# Fold down to 16-bits
5: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
# Invert to get the final checksum
not %ax
retq
.size checksum3, .-checksum3
The main loop is quite simple. It uses the %rax register to accumulate the result, using the add instruction to do so. Carries are handled by the adc instruction afterwards. The rest of the loop just updates the pointer into the buffer, and counts down the number of iterations to do. Handling the tail, and the final fold down to the 16-bit result are also simplified by the ability to use the adc instruction explicitly.
Since the above is a faithful translation of the C code, you wouldn't expect too much of a difference in performance. That is the case, with the results being:
Size | 64 | 1023 | 1024 |
Time (s) | 0.28 | 2.88 | 2.91 |
There is only a smidgeon of a difference, due to the intrinsic lower overhead of using assembly language. To go faster we need to use more of the flexibility available. One of the most obvious things to do is to optimize the main loop some more. If we take some care, we can let the carry flag be "live" throughout. This means we can get rid of the adc-with-zero, and turn that into an adc with the next thing to sum in the next iteration. However, the current code corrupts the carry flag with the other instructions in the loop. So we need to be creative and use the lea instruction to update the pointer, and the dec instruction to count down the loop. Both of those don't affect the carry.
.globl checksum4
.type checksum4,@function
.align 16
checksum4:
mov %esi, %ecx
xor %eax, %eax
# Divide by 8 to get the total number of iterations
shr $3, %ecx
je 2f
# Clear the carry before starting the loop
clc
# The new smaller main loop
1: adc (%rdi), %rax
lea 8(%rdi), %rdi
dec %ecx
jne 1b
# Fold in the final carry
adc $0, %rax
2: test $4, %esi
je 3f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
3: test $2, %esi
je 4f
xor %edx, %edx
movw (%rdi), %dx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
4: test $1, %esi
je 5f
xor %edx, %edx
movb (%rdi), %dl
add %rdx, %rax
adc $0, %rax
5: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
.size checksum4, .-checksum4
The above shrinks the hot inner loop enormously, and is obviously going to be much faster. However, reality isn't so kind:
Size | 64 | 1023 | 1024 |
Time (s) | 0.27 | 2.87 | 2.90 |
Ouch. The transformation had little to no effect at all. It turns out that on this machine adc is a very fast instruction. The extra adc instruction inside the loop overlapped in execution with the other operations, so removing it didn't speed anything up much at all. However, if that's the case, we can use that knowledge to make some improvements. If we unroll the loop once, and make two additions in it, we can perhaps double the speed:
.globl checksum5
.type checksum5,@function
.align 16
checksum5:
mov %esi, %ecx
xor %eax, %eax
# Now handle 16 bytes at a time
shr $4, %ecx
je 2f
clc
# The main loop now uses two 64-bit additions
1: adc (%rdi), %rax
adc 8(%rdi), %rax
lea 16(%rdi), %rdi
dec %ecx
jne 1b
adc $0, %rax
# We need to handle anything up to 15 tail bytes.
2: test $8, %esi
je 3f
add (%rdi), %rax
adc $0, %rax
add $8, %rdi
3: test $4, %esi
je 4f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
4: test $2, %esi
je 5f
xor %edx, %edx
movw (%rdi), %dx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
5: test $1, %esi
je 6f
xor %edx, %edx
movb (%rdi), %dl
add %rdx, %rax
adc $0, %rax
# Since we accumulate with 64-bits still, this doesn't change.
6: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
.size checksum5, .-checksum5
This time, our suspicions are realized, and we get a significant performance increase:
Size | 64 | 1023 | 1024 |
Time (s) | 0.20 | 1.54 | 1.56 |
Obviously, if unrolling once is good, unrolling even more must be better. If we unroll so that there are four additions in the inner loop, we get:
.globl checksum6
.type checksum6,@function
.align 16
checksum6:
mov %esi, %ecx
xor %eax, %eax
# 32 bytes at a time
shr $5, %ecx
je 2f
clc
# Lets make sure our quite-large loop is now aligned
.align 16
# Four 64-bit adds per iteration
1: adc (%rdi), %rax
adc 8(%rdi), %rax
adc 16(%rdi), %rax
adc 24(%rdi), %rax
lea 32(%rdi), %rdi
dec %ecx
jne 1b
adc $0, %rax
# Handle the 31 bytes or less remaining
2: test $16, %esi
je 3f
add (%rdi), %rax
adc 8(%rdi), %rax
adc $0, %rax
add $16, %rdi
3: test $8, %esi
je 4f
add (%rdi), %rax
adc $0, %rax
add $8, %rdi
4: test $4, %esi
je 5f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
5: test $2, %esi
je 6f
xor %edx, %edx
movw (%rdi), %dx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
6: test $1, %esi
je 7f
xor %edx, %edx
movb (%rdi), %dl
add %rdx, %rax
adc $0, %rax
7: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
.size checksum6, .-checksum6
Again, we get more improvement. Not quite as much as last time, but enough to make it worth it:
Size | 64 | 1023 | 1024 |
Time (s) | 0.18 | 1.10 | 1.08 |
Unrolling more doesn't seem to help. To increase the speed further we will need to think of something else to change. As we unroll we are increasing the length of the dependent instruction chain in the loop. If we can somehow break this chain into smaller pieces, then the cpu might be able to exploit more inter-instruction parallelism. Unfortunately, there is only a single carry flag. However, we can combine what we have with something using the algorithm in the original C code. Using lea instructions to accumulate into a wide register, we can get extra additions in the loop, without many dependencies. Such code looks like:
.globl checksum7
.type checksum7,@function
.align 16
checksum7:
mov %esi, %ecx
xor %eax, %eax
shr $5, %ecx
je 2f
# Use %r8 to accumulate as well
xor %r8, %r8
.align 16
1: movl 24(%rdi), %edx
# Add 32-bits into %r8
lea (%rdx,%r8),%r8
# Three adc's now into %rax instead of four
adc (%rdi), %rax
adc 8(%rdi), %rax
adc 16(%rdi), %rax
movl 28(%rdi), %edx
# Add another 32-bits into %r8
lea (%rdx,%r8),%r8
lea 32(%rdi), %rdi
dec %ecx
jne 1b
# Merge %r8 with other results in %rax
adc %r8, %rax
adc $0, %rax
2: test $16, %esi
je 3f
add (%rdi), %rax
adc 8(%rdi), %rax
adc $0, %rax
add $16, %rdi
3: test $8, %esi
je 4f
add (%rdi), %rax
adc $0, %rax
add $8, %rdi
4: test $4, %esi
je 5f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
5: test $2, %esi
je 6f
xor %edx, %edx
movw (%rdi), %dx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
6: test $1, %esi
je 7f
xor %edx, %edx
movb (%rdi), %dl
add %rdx, %rax
adc $0, %rax
7: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
.size checksum7, .-checksum7
So does this save us any cycles? Unfortunately, it doesn't:
Size | 64 | 1023 | 1024 |
Time (s) | 0.22 | 1.10 | 1.15 |
If anything, it is slightly worse than the previous case. Sometimes being tricky helps, sometimes it doesn't. This is a case of the latter. We will need to try something different to improve.
Vectorization
One obvious optimization technique for cpu-bound problems is vectorization. If we can use the power of the SSE instructions then we can do more work in parallel, and hopefully increase performance. However, this is one drawback. The set of SSE instructions is anything but orthogonal. There is a high likelihood that the exact instruction you might want may not be implemented. Instead, we have to make do with what exists, and re-design our algorithm around that.
In this case, the limiting factor is that the vectorization instructions have no concept of the carry flag (or of ones-complement arithmetic), so we can't use the best-performing technique above. However, we can still use the method of accumulating results into a double-length register. In this case, we can add 32-bit chunks to the two 64-bit halves of a 128-bit xmm register. In fact, we can go even further, and accumulate into two such registers simultaneously, further increasing parallelism. Such code looks like:
.globl checksum8
.type checksum8,@function
.align 16
checksum8:
mov %esi, %ecx
xor %eax, %eax
shr $5, %ecx
je 2f
# %xmm7 will hold zeros so that the unpack instructions work nicely
xorpd %xmm7, %xmm7
# The two accumulation registers
xorpd %xmm0, %xmm0
xorpd %xmm2, %xmm2
clc
.align 16
1: movupd (%rdi), %xmm1
movupd 16(%rdi), %xmm3
# Add the non-16 byte aligned cases with adc instructions
adc 8(%rdi), %rax
adc 24(%rdi), %rax
# Distribute the lower 8 bytes into 2 4-byte chunks
punpckldq %xmm7, %xmm1
punpckldq %xmm7, %xmm3
lea 32(%rdi), %rdi
# Accumulate into the two xmm registers
paddq %xmm1, %xmm0
paddq %xmm3, %xmm2
dec %ecx
jne 1b
# Merge the %xmm register results
paddq %xmm2, %xmm0
# Extract the information in %xmm0 by using the red zone.
movapd %xmm0, -24(%rsp)
adc -24(%rsp), %rax
adc -16(%rsp), %rax
adc $0, %rax
# Handle the remaining possible tail of 31 bytes like before.
2: test $16, %esi
je 3f
add (%rdi), %rax
adc 8(%rdi), %rax
adc $0, %rax
add $16, %rdi
3: test $8, %esi
je 4f
add (%rdi), %rax
adc $0, %rax
add $8, %rdi
4: test $4, %esi
je 5f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
5: test $2, %esi
je 6f
xor %edx, %edx
movw (%rdi), %dx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
6: test $1, %esi
je 7f
xor %edx, %edx
movb (%rdi), %dl
add %rdx, %rax
adc $0, %rax
7: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
.size checksum8, .-checksum8
The above carefully overlaps use of the vectorization instructions with normal 64-bit arithmetic. (It is possible to use a purely vectorized algorithm, but that turns out to be slower.) However, the results are not encouraging:
Size | 64 | 1023 | 1024 |
Time (s) | 0.28 | 2.27 | 2.29 |
This is much worse than not using vectorization at all. The problem here is that we are using three relatively expensive instructions (vectorized load, extract, add) to do the work of a single 64-bit adc. The fact that the vectorized instructions can execute with more parallelism doesn't help make up for their intrinsic inefficiency. We can unroll more, and do 64-bytes at a time:
.globl checksum9
.type checksum9,@function
.align 16
checksum9:
mov %esi, %ecx
xor %eax, %eax
shr $6, %ecx
je 2f
# Just accumulate into %xmm0 this time
xorpd %xmm0, %xmm0
xorpd %xmm7, %xmm7
clc
.align 16
1: adc 8(%rdi), %rax
adc 24(%rdi), %rax
# Use three xmm loads though
movupd (%rdi), %xmm1
movupd 16(%rdi), %xmm2
movupd 32(%rdi), %xmm3
adc 40(%rdi), %rax
adc 48(%rdi), %rax
# Three unpacks
punpckldq %xmm7, %xmm1
punpckldq %xmm7, %xmm2
punpckldq %xmm7, %xmm3
adc 56(%rdi), %rax
lea 64(%rdi), %rdi
# Three semi-independent adds
paddq %xmm1, %xmm2
paddq %xmm3, %xmm0
paddq %xmm2, %xmm0
dec %ecx
jne 1b
movapd %xmm0, -24(%rsp)
adc -24(%rsp), %rax
adc -16(%rsp), %rax
adc $0, %rax
2: test $32, %esi
je 3f
add (%rdi), %rax
adc 8(%rdi), %rax
adc 16(%rdi), %rax
adc 24(%rdi), %rax
adc $0, %rax
add $32, %rdi
3: test $16, %esi
je 4f
add (%rdi), %rax
adc 8(%rdi), %rax
adc $0, %rax
add $16, %rdi
4: test $8, %esi
je 5f
add (%rdi), %rax
adc $0, %rax
add $8, %rdi
5: test $4, %esi
je 6f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
6: test $2, %esi
je 7f
xor %edx, %edx
movw (%rdi), %dx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
7: test $1, %esi
je 8f
xor %edx, %edx
movb (%rdi), %dl
add %rdx, %rax
adc $0, %rax
8: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
.size checksum9, .-checksum9
This does a little better:
Size | 64 | 1023 | 1024 |
Time (s) | 0.26 | 1.70 | 1.75 |
However, we still haven't caught up to the unvectorized code. We need to vectorize in a smarter way. The problem is that we are still using three "expensive" vector instructions to handle 64-bits of data. There is a way however, to use 4 vector instructions to work with 128 bits of data. This is quite an improvement, and might be 50% faster.
The trick is to try to emulate the second C code that open-codes the add-with-carry instruction by using a comparison. There are plenty of comparison operations available, but none does exactly what we want. Basically, we would like a vectorized unsigned integer comparison, but all the integer comparison instructions are on signed integers. We can get around this by noticing that if a and b are unsigned integers then the signed comparison (a - msb) < (b - msb) will produce the correct result. (Basically, we are shifting zero to be at INT_MIN.) Since the msb (most significant bit) can be toggled with an addition, subtraction or an xor instruction, we have quite some flexibility here.
We choose to have the accumulator in the form a + msb. Then, when we add a partial result, we get (a + b + msb). Thus, if we compare (a + b + msb) to (a + msb), we can detect a carry. Since the SSE comparison instructions set their "true" value to be all-ones, and that is equal to -1 in twos-complement arithmetic, we can subtract that to in effect add the carry as needed. Finally, when done, we can xor out the msb's to get the accumulated ones-complement results. So some lateral thinking gives:
.align 16
# Mask of most significant bits.
mask:
.quad 0x8000000080000000
.quad 0x8000000080000000
.globl checksum10
.type checksum10,@function
.align 16
checksum10:
mov %esi, %ecx
xor %eax, %eax
shr $6, %ecx
je 2f
movapd mask, %xmm0
movapd mask, %xmm2
clc
.align 16
1: adc 32(%rdi), %rax
# Load 32 bytes of data into xmm registers
movapd %xmm0, %xmm1
movapd %xmm2, %xmm3
adc 40(%rdi), %rax
# Accumulate in %xmm0 and %xmm2
paddd (%rdi), %xmm0
paddd 16(%rdi), %xmm2
adc 48(%rdi), %rax
# pcmpgtq is in SSE4.2, not SSE2, which limits us to 32-bit accumulators
pcmpgtd %xmm0, %xmm1
pcmpgtd %xmm2, %xmm3
adc 56(%rdi), %rax
# Add in carry
psubd %xmm1, %xmm0
psubd %xmm3, %xmm2
lea 64(%rdi), %rdi
dec %ecx
jne 1b
# Use the same algorithm to combine %xmm0 and %xmm2 when done
xorpd mask, %xmm2
movapd %xmm0, %xmm1
paddd %xmm2, %xmm0
pcmpgtd %xmm0, %xmm1
psubd %xmm1, %xmm0
xorpd mask, %xmm0
# Finally, extract the results from %xmm0
movapd %xmm0, -24(%rsp)
adc -24(%rsp), %rax
adc -16(%rsp), %rax
adc $0, %rax
2: test $32, %esi
je 3f
add (%rdi), %rax
adc 8(%rdi), %rax
adc 16(%rdi), %rax
adc 24(%rdi), %rax
adc $0, %rax
add $32, %rdi
3: test $16, %esi
je 4f
add (%rdi), %rax
adc 8(%rdi), %rax
adc $0, %rax
add $16, %rdi
4: test $8, %esi
je 5f
add (%rdi), %rax
adc $0, %rax
add $8, %rdi
5: test $4, %esi
je 6f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
6: test $2, %esi
je 7f
movzxw (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
7: test $1, %esi
je 8f
movzxb (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
8: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
.size checksum10, .-checksum10
This improved algorithm does better than the previous vectorized code:
Size | 64 | 1023 | 1024 |
Time (s) | 0.26 | 1.28 | 1.21 |
Again, it still isn't fast enough. Perhaps the problem is that we now have extra overhead after the main loop ends in order to combine the two xmm register results. Let's just accumulate into one register:
.globl checksum11
.type checksum11,@function
.align 16
checksum11:
mov %esi, %ecx
xor %eax, %eax
shr $6, %ecx
je 2f
# Just use %xmm0 to accumulate
movapd mask, %xmm0
clc
.align 16
1: adc 0(%rdi), %rax
adc 8(%rdi), %rax
# A single load, add
movapd %xmm0, %xmm1
paddd 16(%rdi), %xmm0
adc 32(%rdi), %rax
adc 40(%rdi), %rax
# A single compare
pcmpgtd %xmm0, %xmm1
adc 48(%rdi), %rax
adc 56(%rdi), %rax
# Handle the carries
psubd %xmm1, %xmm0
lea 64(%rdi), %rdi
dec %ecx
jne 1b
# Remove the msb's now that we are done
xorpd mask, %xmm0
movapd %xmm0, -24(%rsp)
adc -24(%rsp), %rax
adc -16(%rsp), %rax
adc $0, %rax
2: mov %esi, %ecx
shr $3, %ecx
and $7, %ecx # Also clears the carry flag
je 4f
3: adc (%rdi), %rax
lea 8(%rdi), %rdi
dec %ecx
jne 3b
4: adc $0, %rax
test $7, %edi
jne 7f # Not aligned
mov %esi, %ecx
and $7, %ecx
jne 6f # Some odd bytes to do
# Fold down to 16 bits
5: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
# Add in last bytes
6: lea (,%ecx,8),%ecx
mov $-1, %rdx
neg %ecx
shr %cl, %rdx
and (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
jmp 5b
7: test $4, %esi
je 8f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
8: test $2, %esi
je 9f
movzxw (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
9: test $1, %esi
je 5b
movzxb (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
jmp 5b
.size checksum11, .-checksum11
This does a little better than before:
Size | 64 | 1023 | 1024 |
Time (s) | 0.170 | 1.19 | 1.11 |
We are catching up on the unvectorized code. However, there is something rather important that we have ignored up until now. The SSE vectorized instructions depend on having their inputs 16-byte aligned. If that is not the case, then they will fail with a hardware exception. We need to make sure that the buffer is properly aligned. Sometimes that is trivial, just use an aligned allocator, or use an aligned attribute for a static or automatic buffer. However, the use-case for the TCP checksum is with networking. The data from packets can be unaligned, so we really need to handle that annoying case.
If the buffer is mis-aligned, but still evenly-aligned, then we can use a similar algorithm. Just handle the beginning bytes until we are 16-byte aligned (or run out of bytes to checksum). Initialize the accumulator to that value, and continue on as normal with the vectorized code. However, if we are at an odd-alignment we run into trouble. We will have a partial 16-bit word to handle. Here, the fact that the checksum is endian-neutral saves us. We can accumulate part of the checksum in the "wrong" endian form, and then correct for that later.
Thus the weakness that SSE instructions don't handle mis-aligned data severely complicates the code. However, we can still get something to work:
.globl checksum12
.type checksum12,@function
.align 16
checksum12:
xor %eax, %eax
test $0xf, %edi
jne 7f # Not aligned?
0: mov %esi, %ecx
shr $6, %ecx
je 2f
movapd mask, %xmm0
clc
# The main loop assumes it has 16-byte alignment
1: adc 0(%rdi), %rax
adc 8(%rdi), %rax
movapd %xmm0, %xmm1
paddd 16(%rdi), %xmm0
adc 32(%rdi), %rax
adc 40(%rdi), %rax
pcmpgtd %xmm0, %xmm1
adc 48(%rdi), %rax
adc 56(%rdi), %rax
psubd %xmm1, %xmm0
lea 64(%rdi), %rdi
dec %ecx
jne 1b
xorpd mask, %xmm0
movq %xmm0, %rdx
movhlps %xmm0, %xmm0
adc %rdx, %rax
movq %xmm0, %rdx
adc %rdx, %rax
adc $0, %rax
2: mov %esi, %ecx
shr $3, %ecx
and $7, %ecx # Also clears the carry flag
je 4f
3: adc (%rdi), %rax
lea 8(%rdi), %rdi
dec %ecx
jne 3b
4: adc $0, %rax
mov %esi, %ecx
and $7, %ecx
jne 6f # Some odd bytes to do
# Fold down to 16 bits
5: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
# Add in last bytes
6: lea (,%ecx,8),%ecx
mov $-1, %rdx
neg %ecx
shr %cl, %rdx
and (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
jmp 5b
7: test $1, %edi
jne 11f # Odd-aligned is annoying
mov %edi, %ecx
neg %ecx
and $0xf, %ecx
cmp %ecx, %esi
cmovl %esi, %ecx # Work to do is min(size, alignment)
sub %ecx, %esi
test $8, %ecx
je 8f
movq (%rdi), %rax
add $8, %rdi
8: test $4, %ecx
je 9f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
9: test $2, %ecx
je 10f
movzwq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
10: test $1, %ecx
je 0b
movzbq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $1, %rdi
test %esi, %esi # Nothing left to do?
je 5b
# Continue with the rest
jmp 0b
11: movzbw (%rdi), %r8w
inc %rdi
dec %esi
je 12f # Only one byte to do?
push %rbp # Needed for 16-byte stack alignment
call checksum12
pop %rbp
not %ax
xchg %al, %ah
add %r8w, %ax
adc $0, %ax
not %ax
retq
12: mov %r8w, %ax
not %ax
retq
.size checksum12, .-checksum12
The cost of checking for alignment is overwhelmed by the other work the function needs to do, and the resulting performance doesn't change from the previous (incorrect) method:
Size | 64 | 1023 | 1024 |
Time (s) | 0.170 | 1.20 | 1.11 |
Unfortunately, on this machine, it doesn't look like you can do much better with using the vectorized instructions. However, the above does exploit a trick that we haven't used with the non-vectorized code. By enforcing aligned accesses in the inner loop, we can clean up the code that deals with the odd-sized tail. Doing the same with the adc-based loop looks like:
.globl checksum13
.type checksum13,@function
.align 16
checksum13:
xor %eax, %eax
test $7, %edi
jne 7f
0: mov %esi, %ecx
shr $6, %ecx
je 2f
clc
.align 16
1: adc 0(%rdi), %rax
adc 8(%rdi), %rax
adc 16(%rdi), %rax
adc 24(%rdi), %rax
adc 32(%rdi), %rax
adc 40(%rdi), %rax
adc 48(%rdi), %rax
adc 56(%rdi), %rax
lea 64(%rdi), %rdi
dec %ecx
jne 1b
adc $0, %rax
2: mov %esi, %ecx
shr $3, %ecx
and $7, %ecx # Also clears the carry flag
je 4f
3: adc (%rdi), %rax
lea 8(%rdi), %rdi
dec %ecx
jne 3b
4: adc $0, %rax
and $7, %esi
jne 6f # Some odd bytes to do
# Fold down to 16 bits
5: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
# Add in last bytes
6: lea (,%esi,8),%ecx
mov $-1, %rdx
neg %ecx
shr %cl, %rdx
and (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
jmp 5b
7: test $1, %edi
jne 11f # Odd-aligned is annoying
mov %edi, %ecx
neg %ecx
and $0x7, %ecx
cmp %ecx, %esi
cmovl %esi, %ecx # Work to do is min(size, alignment)
sub %ecx, %esi
test $4, %ecx
je 8f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
8: test $2, %ecx
je 9f
movzwq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
9: test $1, %ecx
je 10f
movzbq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $1, %rdi
10: test %esi, %esi # Nothing left to do?
je 5b
# Continue with the rest
jmp 0b
11: movzbw (%rdi), %r8w
inc %rdi
dec %esi
je 12f # Only one byte to do?
call checksum13
not %ax
xchg %al, %ah
add %r8w, %ax
adc $0, %ax
not %ax
retq
12: mov %r8w, %ax
not %ax
retq
.size checksum13, .-checksum13
This gives:
Size | 64 | 1023 | 1024 |
Time (s) | 0.128 | 1.13 | 1.05 |
Which is faster than before, except for the odd-length case, which is slightly slower. If we could only make those cases more like the ones that are multiples of 64 bytes long (the unrolled loop step size), then we might be able to increase speed a little further. One way to do this is via the asm equivalent of Duff's device. We can jump into the middle of the loop when we start it. Unfortunately, we still need to handle lengths not exact multiples of the 64-bit register size in a special way, but we needed to do that anyway.
Normally, you would construct a jump table, and then use it to branch to the correct spot in the inner loop. However, here, all the interesting instructions are the same length: four bytes long. This means we can do some arithmetic and emulate the equivalent of a computed goto and avoid doing a relatively expensive unpredictable memory lookup. The resulting code looks like:
.globl checksum14
.type checksum14,@function
.align 16
checksum14:
xor %eax, %eax
test $7, %edi
jne 5f
0: mov %esi, %edx
mov %esi, %ecx
shr $3, %edx
je 2f
shr $6, %ecx
and $7, %edx
# A multiple of 64 means we do full executions of the loop
je 1f
add $1, %ecx
# We need to adjust the pointer, because we have a partial loop
lea -64(%rdi,%rdx,8), %rdi
# Compute the jump point
neg %edx
and $7, %edx
lea 1f-1(,%rdx,4), %rdx
jmp *%rdx
.align 16
1: adc (%rdi), %rax
adc 8(%rdi), %rax
adc 16(%rdi), %rax
adc 24(%rdi), %rax
adc 32(%rdi), %rax
adc 40(%rdi), %rax
adc 48(%rdi), %rax
adc 56(%rdi), %rax
lea 64(%rdi), %rdi
dec %ecx
jne 1b
adc $0, %rax
2: and $7, %esi
jne 4f # Some odd bytes to do
# Fold down to 16 bits
3: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
# Add in last bytes
4: lea (,%esi,8),%ecx
mov $-1, %rdx
neg %ecx
shr %cl, %rdx
and (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
jmp 3b
5: test $1, %edi
jne 9f # Odd-aligned is annoying
mov %edi, %ecx
neg %ecx
and $0x7, %ecx
cmp %ecx, %esi
cmovl %esi, %ecx # Work to do is min(size, alignment)
sub %ecx, %esi
test $4, %ecx
je 6f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
6: test $2, %ecx
je 7f
movzwq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
7: test $1, %ecx
je 8f
movzbq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $1, %rdi
8: test %esi, %esi # Nothing left to do?
je 3b
# Continue with the rest
jmp 0b
9: movzbw (%rdi), %r8w
inc %rdi
dec %esi
je 10f # Only one byte to do?
call checksum14
not %ax
xchg %al, %ah
add %r8w, %ax
adc $0, %ax
not %ax
retq
10: mov %r8w, %ax
not %ax
retq
.size checksum14, .-checksum14
This gives:
Size | 64 | 1023 | 1024 |
Time (s) | 0.118 | 1.10 | 1.04 |
The above is very good, the best code so far on this machine. The only thing left to try is shrinking the inner loop. The current code processes 64 bytes at a time, like the vectorized version did. However, the old adc-based algorithm found that 32 bytes at a time was the best. Lets take one last look at what that might do:
.globl checksum15
.type checksum15,@function
.align 16
checksum15:
xor %eax, %eax
test $7, %edi
jne 5f
mov %esi, %edx
mov %esi, %ecx
shr $3, %edx
je 2f
shr $5, %ecx
and $3, %edx
je 1f
add $1, %ecx
lea -32(%rdi,%rdx,8), %rdi
neg %edx
and $3, %edx
lea 1f-1(,%rdx,4), %rdx
jmp *%rdx
# Now only 32 bytes at a time
.align 16
1: adc (%rdi), %rax
adc 8(%rdi), %rax
adc 16(%rdi), %rax
adc 24(%rdi), %rax
lea 32(%rdi), %rdi
dec %ecx
jne 1b
adc $0, %rax
2: and $7, %esi
jne 4f # Some odd bytes to do
# Fold down to 16 bits
3: mov %eax, %edx
shr $32, %rax
add %edx, %eax
adc $0, %eax
mov %eax, %edx
shr $16, %eax
add %dx, %ax
adc $0, %ax
not %ax
retq
# Add in last bytes
4: lea (,%esi,8),%ecx
mov $-1, %rdx
neg %ecx
shr %cl, %rdx
and (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
jmp 3b
5: test $1, %edi
jne 9f # Odd-aligned is annoying
mov %edi, %ecx
neg %ecx
and $0xf, %ecx
cmp %ecx, %esi
cmovl %esi, %ecx # Work to do is min(size, alignment)
sub %ecx, %esi
test $8, %ecx
je 6f
movq (%rdi), %rax
add $8, %rdi
6: test $4, %ecx
je 7f
movl (%rdi), %edx
add %rdx, %rax
adc $0, %rax
add $4, %rdi
7: test $2, %ecx
je 8f
movzwq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $2, %rdi
8: test $1, %ecx
je 0b
movzbq (%rdi), %rdx
add %rdx, %rax
adc $0, %rax
add $1, %rdi
test %esi, %esi # Nothing left to do?
je 3b
# Continue with the rest
jmp 0b
9: movzbw (%rdi), %r8w
inc %rdi
dec %esi
je 10f # Only one byte to do?
call checksum15
not %ax
xchg %al, %ah
add %r8w, %ax
adc $0, %ax
not %ax
retq
10: mov %r8w, %ax
not %ax
retq
.size checksum15, .-checksum15
The results don't change all that much:
Size | 64 | 1023 | 1024 |
Time (s) | 0.128 | 1.10 | 1.04 |
The only major difference is the small 64-byte packet time has gone up due to it requiring two loop iterations instead of one. However, it is still faster than the earlier unvectorized code. On the other hand, we may want to keep with the 64-byte unrolled version due to this slight speed loss.
Summary
After trying many many optimization tricks, we can find a method of calculating the TCP/IP checksum up to an order of magnitude faster than the original C code. Most of the gains were due to using computation on larger chunks of data (64-bits rather than 16 bits at a time). More gains were due to unrolling the inner loop, and controlling use of the carry flag. Small gains could be obtained by controlling alignment, and breaking slow-paths out of the main control flow.
Unfortunately vectorization was not a win on this machine for this problem. The non-orthogonality of the SSE instructions can hurt quite a bit when the operation you need isn't supported. However, this isn't the full story. On more recent machines the adc instruction which is key to the unvectorized algorithm is much much slower. In addition, the vectorized instructions are much much faster, plus there are more of them doing new things. The combination of these changes could mean that one of the "slower" functions above could be better.
If vectorization does prove to be better on your machine, perhaps you may want to explore adding the Duff's device trick to the best vectorized version. You may also want to explore using instructions added after SSE2, or perhaps even the 256-bit AVX extensions. As hardware changes over time, so does the most optimal version of an algorithm. Most of the time, it isn't worth it to hyper-optimize something. In some cases, where you absolutely need the performance edge, it can be highly profitable though.
|
Comments
yolo swag said...captcha is waay too hard
It is generally futile to unroll loops without breaking dependencies.
Sum loop based on adc is essentially one long chain of dependent ops.
I've run couple of brief tests on a Lynnfield box, here's how checksum6 performs:
<pre>
$ perf stat ./a.out 1000
10000000 loops; len=1000; res = ...
6827798474 cycles # 2.960 GHz
5574096465 stalled-cycles-frontend # 81.64% frontend cycles idle
4505422340 stalled-cycles-backend # 65.99% backend cycles idle
2550630533 instructions # 0.37 insns per cycle
# 2.19 stalled cycles per insn
...
More results. checksum1 and checksum2, compiled with gcc-4.5.2 -O2, are
~ 1.05 clocks/byte (IPC 2.86), and 0.4 clocks/byte (IPC 2.25), respectively.
checksum3 is ~ 0.53 clocks/byte (IPC 1.51), checksum6 ~ 0.68 clocks/byte (IPC 0.37)
I wrote a version of checksum with unrolled loop of four uint32 adds into
four different uint64 registers, disassembles like this:
400620: 44 8b 10 mov (%rax),%r10d
400623: 4c 01 d2 add %r10,%rdx
400626: 44 8b 50 04 mov 0x4(%rax),%r10d
40062a: 4d 01 d1 add %r10,%r9
40062d: 44 8b 50 08 mov 0x8(%rax),%r10d
400631: 4d 01 d0 add %r10,%r8
400634: 44 8b 50 0c mov 0xc(%rax),%r10d
400638: 48 83 c0 10 add $0x10,%rax
40063c: 4c 01 d1 add %r10,%rcx
40063f: 4c 39 d8 cmp %r11,%rax
400642: 75 dc jne 400620 <checksum77+0x30>
$ perf stat ./a.out 1000
10000000 loops; len=1000; res = ...
2740821769 cycles # 2.959 GHz
905954412 stalled-cycles-frontend # 33.05% frontend cycles idle
29838920 stalled-cycles-backend # 1.09% backend cycles idle
7565283438 instructions # 2.76 insns per cycle
# 0.12 stalled cycles per insn
...
</pre>
What cpu did you run your tests on?
Finally, I'd like to let you know that I'm a robot, and that I find your captchas way too simple!
I found your article on SSE2 optimization and IP checksum via Google as I have been interested in using Altivec/SSE2 version of IP checksum for many years. I took some of the ideas in article and coded up a slightly different version. This version uses three simple idea:
1) No alignment checking (causes branch mispredictions)
2) Use two independent memory strides by breaking buffer in half
3) Sum each stride separately (for multi-execution CPUs)
I originally tried verifying 64-bit sum version that has overflow checking. VTune showed a large number of branch mispredictions, so the 32-bit sum without overflow checking performs better.
Unfortunately, this code crashes using gcc -O3, but it runs correctly on VS2008. In tracking down the problem, it appears that gcc for 64-bit platforms has -sse optimization enabled by default. Since the code below doesn't care about integer alignment, it should technically work on Intel CPUs (not Sparc or PPC).
What I found interesting is that gcc uses SSE2 instruction when your article in-fact shows them as no-win. I suspect the code is crashing due to an SSE2 alignment issue (or gcc loop unrolling defect).
x00007ffff6b4c7c0 in in_cksum (data=0x68b2cd "\n\352\f8\220\204;\235\263\067P\030\372\360m\201", len=499,
initialValue=<optimized out>) at ../../modules/platform_linux64/platform/../../platform/checksum64.c:59
59 sum += *buf;
0x00007ffff6b4c7b4 <in_cksum+196>: 66 0f ef d2 pxor %xmm2,%xmm2
0x00007ffff6b4c7b8 <in_cksum+200>: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)
=> 0x00007ffff6b4c7c0 <in_cksum+208>: 66 43 0f 6f 5c 15 00 movdqa 0x0(%r13,%r10,1),%xmm3
0x00007ffff6b4c7c7 <in_cksum+215>: 41 83 c3 01 add $0x1,%r11d
0x00007ffff6b4c7cb <in_cksum+219>: 66 0f 6f e3 movdqa %xmm3,%xmm4
0x00007ffff6b4c7cf <in_cksum+223>: 66 0f 6a da punpckhdq %xmm2,%xmm3
0x00007ffff6b4c7d3 <in_cksum+227>: 66 0f 62 e2 punpckldq %xmm2,%xmm4
0x00007ffff6b4c7d7 <in_cksum+231>: 66 0f d4 cc paddq %xmm4,%xmm1
0x00007ffff6b4c7db <in_cksum+235>: 66 0f d4 cb paddq %xmm3,%xmm1
uint_fast16 in_cksum(const unsigned char *data, size_t len, uint_fast32 initialValue) {
uint64 sum = initialValue;
uint64 sum2 = 0;
uint16 final;
uint_fast32 stride = ((len / sizeof(uint32)) * sizeof(uint32)) / (sizeof(uint32)*2);
const uint32* buf = (const uint32*)data;
const uint32* buf2 = (const uint32*)&buf[stride];
/* Break the buffer up into two different strides - sum each one independently */
for (; stride--; ++buf, ++buf2) {
sum += *buf;
sum2 += *buf2;
}
sum += sum2;
/* Roll second sum back into equation */
assert(buf2 >= data);
len -= (((const unsigned char*)buf2) - data);
data = (const unsigned char*)buf2;
if (len & 4) {
sum += *((const uint32*)data);
data += sizeof(uint32);
len -= 4;
}
if (len & 2) {
sum += *((const uint16*)data);
data += sizeof(uint16);
len -= 2;
}
if (len) {
sum += *data++;
--len;
}
assert(len == 0);
/* Convert 64-bit value into 16-bit CRC */
{
uint32 i1, i2;
uint16 i4;
i1 = (uint32)sum;
i2 = sum >> 32;
i1 += i2;
if (i1 < i2)
i1++;
final = (uint16)i1;
i4 = i1 >> 16;
final += (uint16)i4;
if (final < i4)
final++;
}
return final;
} /* in_cksum */
instead of
if (size)
{
unsigned char s = *(unsigned char *) buf;
sum += s;
if (sum < s) sum++;
}
shouldn't we have :
if (size & 1)
{
unsigned char s = *(unsigned char *) buf;
sum += s;
if (sum < s) sum++;
}
Eduardo Panisset
The excuse seems to be that "C" cannot access the carry flag. Then when that does not prove a real advantage, you go ahead and proceed with the standard loop unrolling techniques that are much easier to show in C...
I have a strong feeling a C implementation would be as fast, and 100 times more readable. Probably not so much fun (for you!) though ;-)
Of course you can add as much loop unrolling as you want to make this go faster.
Finally, you can also make things go faster if you use a prefetch in your loop to ensure that data is available at all times.
Of course a hand coded assembly solution will beat a C solution, but you should really try to get to the best C solution first.
unsigned short checksum5(uint16_t *bs, unsigned size)
{
uint32_t sum = 0; //the key is uint32 or just int here. Your code uses long which is 64bit size and it works twice slower
while (size >= 2)
{
sum += bs[0];
size -= 2;
bs += 1;
}
if (size)
{
sum += *(unsigned char *) bs;
}
sum = (sum>>16)+(sum & 0xffff);
sum = sum + (sum>>16);
return ~(short)sum;
}
Your checksum15
csum(64) 0.143804
csum(515) 1.067394
csum(1023) 2.272981
csum(1024) 2.252651
My simple C checksum
csum(64) 0.184182
csum(515) 0.979094
csum(1023) 2.052713
csum(1024) 2.057040
slower on small packets, but faster on bigger.
Compiled with gcc 4.8.3 with -O3 -msse4, only faster when using -O3
Tested on Intel(R) Core(TM) i7-4810MQ
Also tested on Intel(R) Xeon(R) CPU X5650
And asm is even more worse than C
C:
./csumbench
csum(64) 0.453740
csum(515) 2.106023
csum(1023) 3.652448
csum(1024) 3.608719
ASM checksum15:
csum(64) 0.650974
csum(515) 3.960013
csum(1023) 7.688919
csum(1024) 7.578975
fuck your captcha, I'm robot! it's so hard >_< Is it a or o? Did case matter?
On some CPUs (especially Nehalem), you're badly hurting your vectorized code with bypass-delays between vec-int and vec-FP domains by using packed-double instructions like xorpd mask, %xmm0 instead of pxor %xmm7, %xmm0. (You should also definitely hoist the load of the constant out of the loop).
Using `pd` instructions is totally crazy. I could see trying to save code-size by using `movups` or `xorps`, but `movupd` is just as big as `movdqu`. I'd definitely recommend using integer loads and booleans.
-------------
Also, as Said points out, vectorizing by simply unpacking to wider accumulators that can't carry is sufficient and shortens the dep chains. gcc / clang auto-vectorize that C somewhat well, but you could do better by hand. (Well, not much better than clang, which uses pmovzxwd xmm3, [rdx] / paddd xmm1, xmm3 with 2 accumulators. But it bottlenecks on shuffle-port throughput on most Intel CPUs). https://godbolt.org/g/ETm7FJ
----
Your cleanup code can be much more efficient for the last 1-7 bytes: use an 8 byte load that ends with the last byte of the buffer, and shift out the bytes that overlap with work you already did:
In this version, I have RAX = an accumulator that does carry, and RDX = an accumulator that I only ever add 32-bit zero-extended things into, so it never carries. So I don't need a lot of adc reg,0 instructions, because ADC is 2 uops on Intel pre-Broadwell. I'm playing around with having the mainloop use three 64-bit operations, the first being ADD rax, [rdi] and the next two being ADC rax, [rdi+16/32]. Critically, I break loop-carried carry-flag dependency by doing the next 8 bytes as two separate adds of 32-bit zero-extended chunks (which takes 2 separate mov-loads, so there's a front-end bottleneck as well as using extra load-port bandwidth).
Anyway, this is the last part of the cleanup:
```
.under8:
and r9d, 7
jz .done
;; 1-7 bytes left: grab the last 8 bytes of the buffer and shift out the extras
mov rsi, [rdi + r9 - 8] ; e.g. for r9d=1, we want [rdi - 7] >> (8*7)
lea ecx, [r9 * 8 - 64] ; extra_bits = 8*8 - (leftover_bytes * 8)
neg ecx
shr rsi, cl
add rdx, rsi ; can't carry: rsi always has at least the top byte clear, and rdx is not close to carrying
.done:
```
Then fall through to the horizontal stuff getting down to one 16 bit value.
I didn't really play with SSE/AVX, because I was interested about what you might do in a kernel where it's not worth saving/restoring vector state to checksum a packet.
---
https://stackoverflow.com/users/224132/peter-cordes