Lockless Inc

The TCP/IP Checksum

The TCP/IP checksum is used to detect corruption of data over a TCP or IPv4 connection. If a bit is flipped, a byte mangled, or some other badness happens to a packet, then it is highly likely that the receiver of that broken packet will notice the problem due to a checksum mismatch. This provides end-to-end assurance that the data stream is correct.

IPv4 uses the checksum to detect corruption of packet headers. i.e. the source, destination, and other meta-data. The TCP protocol includes an extra checksum that protects the packet "payload" as well as the header. This is in addition to the header-checksum of IP. IPv6 does not use a checksum. This is due to the modern assumption that low-level transport protocols typically carry some form of error correction, together with the fact that higher-level protocols (like TCP) do as well. This makes IPv6 packet headers "leaner" than they might otherwise be (provided the size of the flow-label isn't altered).

The algorithm for the TCP and IPv4 checksums is identical. The data is processed a word (16 bits, two bytes) at a time. (An odd number of bytes just has a trailing zero byte added to make the total number even.) The words are added using ones-complement arithmetic, with the location holding the checksum set to be zeros. Once this long chain of additions is complete, the result is negated in ones-complement by taking the binary not, and the result is stored in the right spot. If this operation is repeated then the result of the checksum should be the binary all-ones.

The algorithm above has many interesting properties. The first is that since it is based on addition, it is both commutative and associative. This means that we can use some other calculation that effectively adds in some other order. By re-arranging the additions we might be able to increase performance.

The second property is more subtle: The result is endian independent. A ones-complement addition can be performed on twos-complement machines (most of them) by doing a normal twos-complement addition, followed by adding the potential carry back in the least-significant position. This means that we effectively have a carry from the first byte into the second, and from the second byte into the first. This symmetry means that no matter what endian order we internally place on those two bytes, the calculation will give the same result.

A final interesting property (not really used in this article) is that since the checksum is based on addition, we can alter a packet and update the checksum cheaply. We just need to do a ones-complement subtraction of the differences in the words we change. For example, this means that when a router updates the time-to-live (TTL) of a packet, it doesn't have to redo the whole checksum calculation.

Evaluating the checksum can be a bottle-neck in network heavy environments. This is why many network interfaces offer the ability for the hardware to do the resulting checksum calculations. However, many network stacks do the checksum in software, which is why optimizing its calculation can be important. This article explores how that might be done.

TCP/IP Checksum Algorithms

C assumes two's complement arithmetic with its unsigned integers, and doesn't specify the format for signed integers, so we can't use ones-complement arithmetic directly. (This is for portability. Hardware might use either representation.) It also doesn't have access to the carry flag, or the add-with-carry instruction. However, if we use integer types wider than a 16-bit word, we can accumulate many carries. Due to the commutivity and associativity of addition, we can fold them all in at the end. (Note that we don't have to worry about overflow of the wider type because the size of packets is limited.) Some C code that does that looks like:


unsigned short checksum1(const char *buf, unsigned size)
{
	unsigned sum = 0;
	int i;

	/* Accumulate checksum */
	for (i = 0; i < size - 1; i += 2)
	{
		unsigned short word16 = *(unsigned short *) &buf[i];
		sum += word16;
	}

	/* Handle odd-sized case */
	if (size & 1)
	{
		unsigned short word16 = (unsigned char) buf[i];
		sum += word16;
	}

	/* Fold to get the ones-complement result */
	while (sum >> 16) sum = (sum & 0xFFFF)+(sum >> 16);

	/* Invert to get the negative in ones-complement arithmetic */
	return ~sum;
}

We can run the above 224 times as a benchmark, and time the result. Obviously, the times will vary depending on the size of the buffer used. We choose to look at 64-byte packets, 1023 byte packets and 1024 byte packets. The 64-byte case is indicative of the overhead for small packets. 1024 bytes tells us how performance varies in the large-packet limit. 1023 bytes helps diagnose the effects of odd-sized packets on performance of different variants of checksum algorithms. Doing that, the results are:

Size6410231024
Time (s)0.8810.9911.03

The above obvious algorithm handles 16-bits at a time. Usually, if you process more data at a time, then performance is better. Since we have a 64-bit machine, we would like to modify the algorithm to handle chunks of that size. The problem there is the carries. Fortunately, it isn't hard to come up with an algorithm to detect an (unsigned) overflow. If a+b < a, then a carry occurred. Thus the main loop is easily modified to add an extra one to the total when needed. Finally, we need to "fold" the result down from 64 to 16 bits at the end. Such an algorithm looks like:


unsigned short checksum2(const char *buf, unsigned size)
{
	unsigned long long sum = 0;
	const unsigned long long *b = (unsigned long long *) buf;

	unsigned t1, t2;
	unsigned short t3, t4;

	/* Main loop - 8 bytes at a time */
	while (size >= sizeof(unsigned long long))
	{
		unsigned long long s = *b++;
		sum += s;
		if (sum < s) sum++;
		size -= 8;
	}

	/* Handle tail less than 8-bytes long */
	buf = (const char *) b;
	if (size & 4)
	{
		unsigned s = *(unsigned *)buf;
		sum += s;
		if (sum < s) sum++;
		buf += 4;
	}

	if (size & 2)
	{
		unsigned short s = *(unsigned short *) buf;
		sum += s;
		if (sum < s) sum++;
		buf += 2;
	}

	if (size)
	{
		unsigned char s = *(unsigned char *) buf;
		sum += s;
		if (sum < s) sum++;
	}

	/* Fold down to 16 bits */
	t1 = sum;
	t2 = sum >> 32;
	t1 += t2;
	if (t1 < t2) t1++;
	t3 = t1;
	t4 = t1 >> 16;
	t3 += t4;
	if (t3 < t4) t3++;

	return ~t3;
}

As one might expect, this does a fair bit better:

Size6410231024
Time (s)0.292.902.93

It isn't quite the four-times faster that using four-times larger chunks might give, but it's close. The problem now is to do even better. However, since C doesn't have access to the carry flag, it becomes increasingly difficult to describe what we want the machine to do. Instead, we will make the jump to assembly language, which will give us the freedom to try any technique we wish.

A direct translation of the above C code into assembly language by hand might look like the following:


.globl checksum3
.type checksum3,@function
.align 16
checksum3:
	xor %eax, %eax
	cmp $8, %esi
	jl 2f

# The main loop
1:	add (%rdi), %rax
	adc $0, %rax
	add $8, %rdi
	sub $8, %esi
	cmp $8, %esi
	jge 1b

# Handle the tail
2:	test $4, %esi
	je  3f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

3:	test $2, %esi
	je 4f
	xor %edx, %edx
	movw (%rdi), %dx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

4:	test $1, %esi
	je 5f
	xor %edx, %edx
	movb (%rdi), %dl
	add %rdx, %rax
	adc $0, %rax

# Fold down to 16-bits
5:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

# Invert to get the final checksum
	not %ax
	retq
.size checksum3, .-checksum3

The main loop is quite simple. It uses the %rax register to accumulate the result, using the add instruction to do so. Carries are handled by the adc instruction afterwards. The rest of the loop just updates the pointer into the buffer, and counts down the number of iterations to do. Handling the tail, and the final fold down to the 16-bit result are also simplified by the ability to use the adc instruction explicitly.

Since the above is a faithful translation of the C code, you wouldn't expect too much of a difference in performance. That is the case, with the results being:

Size6410231024
Time (s)0.282.882.91

There is only a smidgeon of a difference, due to the intrinsic lower overhead of using assembly language. To go faster we need to use more of the flexibility available. One of the most obvious things to do is to optimize the main loop some more. If we take some care, we can let the carry flag be "live" throughout. This means we can get rid of the adc-with-zero, and turn that into an adc with the next thing to sum in the next iteration. However, the current code corrupts the carry flag with the other instructions in the loop. So we need to be creative and use the lea instruction to update the pointer, and the dec instruction to count down the loop. Both of those don't affect the carry.


.globl checksum4
.type checksum4,@function
.align 16
checksum4:
	mov %esi, %ecx
	xor %eax, %eax

# Divide by 8 to get the total number of iterations
	shr $3, %ecx
	je 2f
	
# Clear the carry before starting the loop
	clc
	
# The new smaller main loop
1:	adc (%rdi), %rax
	lea 8(%rdi), %rdi
	dec %ecx
	jne 1b

# Fold in the final carry
	adc $0, %rax
	
2:	test $4, %esi
	je  3f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

3:	test $2, %esi
	je 4f
	xor %edx, %edx
	movw (%rdi), %dx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

4:	test $1, %esi
	je 5f
	xor %edx, %edx
	movb (%rdi), %dl
	add %rdx, %rax
	adc $0, %rax

5:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq
.size checksum4, .-checksum4

The above shrinks the hot inner loop enormously, and is obviously going to be much faster. However, reality isn't so kind:

Size6410231024
Time (s)0.272.872.90

Ouch. The transformation had little to no effect at all. It turns out that on this machine adc is a very fast instruction. The extra adc instruction inside the loop overlapped in execution with the other operations, so removing it didn't speed anything up much at all. However, if that's the case, we can use that knowledge to make some improvements. If we unroll the loop once, and make two additions in it, we can perhaps double the speed:


.globl checksum5
.type checksum5,@function
.align 16
checksum5:
	
	mov %esi, %ecx
	xor %eax, %eax
# Now handle 16 bytes at a time
	shr $4, %ecx
	je 2f
	clc

# The main loop now uses two 64-bit additions
1:	adc (%rdi), %rax
	adc 8(%rdi), %rax
	lea 16(%rdi), %rdi
	dec %ecx
	jne 1b
	
	adc $0, %rax

# We need to handle anything up to 15 tail bytes.
2:	test $8, %esi
	je  3f
	add (%rdi), %rax
	adc $0, %rax
	add $8, %rdi
	
3:	test $4, %esi
	je  4f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

4:	test $2, %esi
	je 5f
	xor %edx, %edx
	movw (%rdi), %dx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

5:	test $1, %esi
	je 6f
	xor %edx, %edx
	movb (%rdi), %dl
	add %rdx, %rax
	adc $0, %rax

#   Since we accumulate with 64-bits still, this doesn't change.
6:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq
.size checksum5, .-checksum5

This time, our suspicions are realized, and we get a significant performance increase:

Size6410231024
Time (s)0.201.541.56

Obviously, if unrolling once is good, unrolling even more must be better. If we unroll so that there are four additions in the inner loop, we get:


.globl checksum6
.type checksum6,@function
.align 16
checksum6:
	mov %esi, %ecx
	xor %eax, %eax

# 32 bytes at a time
	shr $5, %ecx
	je 2f
	clc

# Lets make sure our quite-large loop is now aligned
.align 16

# Four 64-bit adds per iteration
1:	adc (%rdi), %rax
	adc 8(%rdi), %rax
	adc 16(%rdi), %rax
	adc 24(%rdi), %rax
	lea 32(%rdi), %rdi
	dec %ecx
	jne 1b
	
	adc $0, %rax

# Handle the 31 bytes or less remaining
2:	test $16, %esi
	je  3f
	add (%rdi), %rax
	adc 8(%rdi), %rax
	adc $0, %rax
	add $16, %rdi	
	
3:	test $8, %esi
	je  4f
	add (%rdi), %rax
	adc $0, %rax
	add $8, %rdi
	
4:	test $4, %esi
	je  5f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

5:	test $2, %esi
	je 6f
	xor %edx, %edx
	movw (%rdi), %dx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

6:	test $1, %esi
	je 7f
	xor %edx, %edx
	movb (%rdi), %dl
	add %rdx, %rax
	adc $0, %rax

7:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq
.size checksum6, .-checksum6

Again, we get more improvement. Not quite as much as last time, but enough to make it worth it:

Size6410231024
Time (s)0.181.101.08

Unrolling more doesn't seem to help. To increase the speed further we will need to think of something else to change. As we unroll we are increasing the length of the dependent instruction chain in the loop. If we can somehow break this chain into smaller pieces, then the cpu might be able to exploit more inter-instruction parallelism. Unfortunately, there is only a single carry flag. However, we can combine what we have with something using the algorithm in the original C code. Using lea instructions to accumulate into a wide register, we can get extra additions in the loop, without many dependencies. Such code looks like:


.globl checksum7
.type checksum7,@function
.align 16
checksum7:
	mov %esi, %ecx
	xor %eax, %eax
	shr $5, %ecx
	je 2f

# Use %r8 to accumulate as well
	xor %r8, %r8

.align 16
1:	movl 24(%rdi), %edx
# Add 32-bits into %r8
	lea (%rdx,%r8),%r8
# Three adc's now into %rax instead of four
	adc (%rdi), %rax
	adc 8(%rdi), %rax
	adc 16(%rdi), %rax
	movl 28(%rdi), %edx
# Add another 32-bits into %r8
	lea (%rdx,%r8),%r8
	lea 32(%rdi), %rdi
	dec %ecx
	jne 1b

# Merge %r8 with other results in %rax
	adc %r8, %rax
	adc $0, %rax
2:	test $16, %esi
	je  3f
	add (%rdi), %rax
	adc 8(%rdi), %rax
	adc $0, %rax
	add $16, %rdi	
	
3:	test $8, %esi
	je  4f
	add (%rdi), %rax
	adc $0, %rax
	add $8, %rdi
	
4:	test $4, %esi
	je  5f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

5:	test $2, %esi
	je 6f
	xor %edx, %edx
	movw (%rdi), %dx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

6:	test $1, %esi
	je 7f
	xor %edx, %edx
	movb (%rdi), %dl
	add %rdx, %rax
	adc $0, %rax

7:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq
.size checksum7, .-checksum7

So does this save us any cycles? Unfortunately, it doesn't:

Size6410231024
Time (s)0.221.101.15

If anything, it is slightly worse than the previous case. Sometimes being tricky helps, sometimes it doesn't. This is a case of the latter. We will need to try something different to improve.

Vectorization

One obvious optimization technique for cpu-bound problems is vectorization. If we can use the power of the SSE instructions then we can do more work in parallel, and hopefully increase performance. However, this is one drawback. The set of SSE instructions is anything but orthogonal. There is a high likelihood that the exact instruction you might want may not be implemented. Instead, we have to make do with what exists, and re-design our algorithm around that.

In this case, the limiting factor is that the vectorization instructions have no concept of the carry flag (or of ones-complement arithmetic), so we can't use the best-performing technique above. However, we can still use the method of accumulating results into a double-length register. In this case, we can add 32-bit chunks to the two 64-bit halves of a 128-bit xmm register. In fact, we can go even further, and accumulate into two such registers simultaneously, further increasing parallelism. Such code looks like:


.globl checksum8
.type checksum8,@function
.align 16
checksum8:
	mov %esi, %ecx
	xor %eax, %eax
	shr $5, %ecx
	je 2f

# %xmm7 will hold zeros so that the unpack instructions work nicely
	xorpd %xmm7, %xmm7

# The two accumulation registers
	xorpd %xmm0, %xmm0
	xorpd %xmm2, %xmm2
	clc

.align 16
1:	movupd (%rdi), %xmm1
	movupd 16(%rdi), %xmm3
	
# Add the non-16 byte aligned cases with adc instructions
	adc 8(%rdi), %rax
	adc 24(%rdi), %rax

# Distribute the lower 8 bytes into 2 4-byte chunks
	punpckldq %xmm7, %xmm1
	punpckldq %xmm7, %xmm3
	
	lea 32(%rdi), %rdi
	
# Accumulate into the two xmm registers
	paddq %xmm1, %xmm0
	paddq %xmm3, %xmm2

	dec %ecx
	jne 1b

# Merge the %xmm register results
	paddq %xmm2, %xmm0

# Extract the information in %xmm0 by using the red zone.
	movapd %xmm0, -24(%rsp)
	adc -24(%rsp), %rax
	adc -16(%rsp), %rax
	adc $0, %rax

# Handle the remaining possible tail of 31 bytes like before.
2:	test $16, %esi
	je  3f
	add (%rdi), %rax
	adc 8(%rdi), %rax
	adc $0, %rax
	add $16, %rdi	
	
3:	test $8, %esi
	je  4f
	add (%rdi), %rax
	adc $0, %rax
	add $8, %rdi
	
4:	test $4, %esi
	je  5f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

5:	test $2, %esi
	je 6f
	xor %edx, %edx
	movw (%rdi), %dx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

6:	test $1, %esi
	je 7f
	xor %edx, %edx
	movb (%rdi), %dl
	add %rdx, %rax
	adc $0, %rax

7:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq
.size checksum8, .-checksum8

The above carefully overlaps use of the vectorization instructions with normal 64-bit arithmetic. (It is possible to use a purely vectorized algorithm, but that turns out to be slower.) However, the results are not encouraging:

Size6410231024
Time (s)0.282.272.29

This is much worse than not using vectorization at all. The problem here is that we are using three relatively expensive instructions (vectorized load, extract, add) to do the work of a single 64-bit adc. The fact that the vectorized instructions can execute with more parallelism doesn't help make up for their intrinsic inefficiency. We can unroll more, and do 64-bytes at a time:


.globl checksum9
.type checksum9,@function
.align 16
checksum9:
	mov %esi, %ecx
	xor %eax, %eax
	shr $6, %ecx
	je 2f

# Just accumulate into %xmm0 this time
	xorpd %xmm0, %xmm0
	xorpd %xmm7, %xmm7
	clc

.align 16
1:	adc 8(%rdi), %rax
	adc 24(%rdi), %rax

# Use three xmm loads though
	movupd (%rdi), %xmm1
	movupd 16(%rdi), %xmm2
	movupd 32(%rdi), %xmm3
	
	adc 40(%rdi), %rax
	adc 48(%rdi), %rax

# Three unpacks
	punpckldq %xmm7, %xmm1
	punpckldq %xmm7, %xmm2
	punpckldq %xmm7, %xmm3
	
	adc 56(%rdi), %rax
	lea 64(%rdi), %rdi

# Three semi-independent adds
	paddq %xmm1, %xmm2
	paddq %xmm3, %xmm0
	paddq %xmm2, %xmm0
	dec %ecx
	jne 1b

	movapd %xmm0, -24(%rsp)
	adc -24(%rsp), %rax
	adc -16(%rsp), %rax
	adc $0, %rax
	
2:	test $32, %esi
	je  3f
	add (%rdi), %rax
	adc 8(%rdi), %rax
	adc 16(%rdi), %rax
	adc 24(%rdi), %rax
	adc $0, %rax
	add $32, %rdi	

3:	test $16, %esi
	je  4f
	add (%rdi), %rax
	adc 8(%rdi), %rax
	adc $0, %rax
	add $16, %rdi	
	
4:	test $8, %esi
	je  5f
	add (%rdi), %rax
	adc $0, %rax
	add $8, %rdi
	
5:	test $4, %esi
	je  6f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

6:	test $2, %esi
	je 7f
	xor %edx, %edx
	movw (%rdi), %dx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

7:	test $1, %esi
	je 8f
	xor %edx, %edx
	movb (%rdi), %dl
	add %rdx, %rax
	adc $0, %rax

8:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq
.size checksum9, .-checksum9

This does a little better:

Size6410231024
Time (s)0.261.701.75

However, we still haven't caught up to the unvectorized code. We need to vectorize in a smarter way. The problem is that we are still using three "expensive" vector instructions to handle 64-bits of data. There is a way however, to use 4 vector instructions to work with 128 bits of data. This is quite an improvement, and might be 50% faster.

The trick is to try to emulate the second C code that open-codes the add-with-carry instruction by using a comparison. There are plenty of comparison operations available, but none does exactly what we want. Basically, we would like a vectorized unsigned integer comparison, but all the integer comparison instructions are on signed integers. We can get around this by noticing that if a and b are unsigned integers then the signed comparison (a - msb) < (b - msb) will produce the correct result. (Basically, we are shifting zero to be at INT_MIN.) Since the msb (most significant bit) can be toggled with an addition, subtraction or an xor instruction, we have quite some flexibility here.

We choose to have the accumulator in the form a + msb. Then, when we add a partial result, we get (a + b + msb). Thus, if we compare (a + b + msb) to (a + msb), we can detect a carry. Since the SSE comparison instructions set their "true" value to be all-ones, and that is equal to -1 in twos-complement arithmetic, we can subtract that to in effect add the carry as needed. Finally, when done, we can xor out the msb's to get the accumulated ones-complement results. So some lateral thinking gives:


.align 16
# Mask of most significant bits.
mask:
.quad	0x8000000080000000
.quad	0x8000000080000000
.globl checksum10
.type checksum10,@function
.align 16
checksum10:
	mov %esi, %ecx
	xor %eax, %eax
	shr $6, %ecx
	je 2f

	movapd mask, %xmm0
	movapd mask, %xmm2
	clc

.align 16
1:	adc 32(%rdi), %rax

# Load 32 bytes of data into xmm registers
	movapd %xmm0, %xmm1
	movapd %xmm2, %xmm3
	adc 40(%rdi), %rax

# Accumulate in %xmm0 and %xmm2
	paddd (%rdi), %xmm0
	paddd 16(%rdi), %xmm2
	adc 48(%rdi), %rax

# pcmpgtq is in SSE4.2, not SSE2, which limits us to 32-bit accumulators
	pcmpgtd %xmm0, %xmm1
	pcmpgtd %xmm2, %xmm3
	adc 56(%rdi), %rax

# Add in carry
	psubd %xmm1, %xmm0
	psubd %xmm3, %xmm2
	lea 64(%rdi), %rdi
	dec %ecx
	jne 1b

# Use the same algorithm to combine %xmm0 and %xmm2 when done
	xorpd mask, %xmm2
	movapd %xmm0, %xmm1
	paddd %xmm2, %xmm0
	pcmpgtd %xmm0, %xmm1
	psubd %xmm1, %xmm0
	xorpd mask, %xmm0

# Finally, extract the results from %xmm0
	movapd %xmm0, -24(%rsp)
	adc -24(%rsp), %rax
	adc -16(%rsp), %rax
	adc $0, %rax
	
2:	test $32, %esi
	je  3f
	add (%rdi), %rax
	adc 8(%rdi), %rax
	adc 16(%rdi), %rax
	adc 24(%rdi), %rax
	adc $0, %rax
	add $32, %rdi	

3:	test $16, %esi
	je  4f
	add (%rdi), %rax
	adc 8(%rdi), %rax
	adc $0, %rax
	add $16, %rdi
		
4:	test $8, %esi
	je  5f
	add (%rdi), %rax
	adc $0, %rax
	add $8, %rdi
	
5:	test $4, %esi
	je  6f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

6:	test $2, %esi
	je 7f
	movzxw (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

7:	test $1, %esi
	je 8f
	movzxb (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax

8:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq
.size checksum10, .-checksum10

This improved algorithm does better than the previous vectorized code:

Size6410231024
Time (s)0.261.281.21

Again, it still isn't fast enough. Perhaps the problem is that we now have extra overhead after the main loop ends in order to combine the two xmm register results. Let's just accumulate into one register:


.globl checksum11
.type checksum11,@function
.align 16
checksum11:
	mov %esi, %ecx
	xor %eax, %eax
	shr $6, %ecx
	je 2f

# Just use %xmm0 to accumulate
	movapd mask, %xmm0
	clc

.align 16
1:	adc 0(%rdi), %rax
	adc 8(%rdi), %rax

# A single load, add
	movapd %xmm0, %xmm1
	paddd 16(%rdi), %xmm0
	
	adc 32(%rdi), %rax
	adc 40(%rdi), %rax

# A single compare
	pcmpgtd %xmm0, %xmm1
	
	adc 48(%rdi), %rax
	adc 56(%rdi), %rax

# Handle the carries
	psubd %xmm1, %xmm0
	lea 64(%rdi), %rdi
	dec %ecx
	jne 1b

# Remove the msb's now that we are done
	xorpd mask, %xmm0
	
	movapd %xmm0, -24(%rsp)
	adc -24(%rsp), %rax
	adc -16(%rsp), %rax
	adc $0, %rax
	
2:	mov %esi, %ecx
	shr $3, %ecx
	and $7, %ecx  # Also clears the carry flag
	je 4f

3:	adc (%rdi), %rax
	lea 8(%rdi), %rdi
	dec %ecx
	jne 3b

4:	adc $0, %rax

	test $7, %edi
	jne 7f # Not aligned

	mov %esi, %ecx
	and $7, %ecx
	jne  6f # Some odd bytes to do
	
	# Fold down to 16 bits
5:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq

	# Add in last bytes
6:	lea (,%ecx,8),%ecx
	mov $-1, %rdx
	neg %ecx
	shr %cl, %rdx
	and (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	jmp 5b

7:	test $4, %esi
	je  8f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

8:	test $2, %esi
	je 9f
	movzxw (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

9:	test $1, %esi
	je 5b
	movzxb (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	jmp 5b
.size checksum11, .-checksum11

This does a little better than before:

Size6410231024
Time (s)0.1701.191.11

We are catching up on the unvectorized code. However, there is something rather important that we have ignored up until now. The SSE vectorized instructions depend on having their inputs 16-byte aligned. If that is not the case, then they will fail with a hardware exception. We need to make sure that the buffer is properly aligned. Sometimes that is trivial, just use an aligned allocator, or use an aligned attribute for a static or automatic buffer. However, the use-case for the TCP checksum is with networking. The data from packets can be unaligned, so we really need to handle that annoying case.

If the buffer is mis-aligned, but still evenly-aligned, then we can use a similar algorithm. Just handle the beginning bytes until we are 16-byte aligned (or run out of bytes to checksum). Initialize the accumulator to that value, and continue on as normal with the vectorized code. However, if we are at an odd-alignment we run into trouble. We will have a partial 16-bit word to handle. Here, the fact that the checksum is endian-neutral saves us. We can accumulate part of the checksum in the "wrong" endian form, and then correct for that later.

Thus the weakness that SSE instructions don't handle mis-aligned data severely complicates the code. However, we can still get something to work:


.globl checksum12
.type checksum12,@function
.align 16
checksum12:
	xor %eax, %eax
	test $0xf, %edi
	jne 7f  # Not aligned?

0:	mov %esi, %ecx
	shr $6, %ecx
	je 2f

	movapd mask, %xmm0
	clc

# The main loop assumes it has 16-byte alignment
1:	adc 0(%rdi), %rax
	adc 8(%rdi), %rax
	movapd %xmm0, %xmm1
	paddd 16(%rdi), %xmm0
	
	adc 32(%rdi), %rax
	adc 40(%rdi), %rax
	pcmpgtd %xmm0, %xmm1
	
	adc 48(%rdi), %rax
	adc 56(%rdi), %rax
	psubd %xmm1, %xmm0
	lea 64(%rdi), %rdi
	dec %ecx
	jne 1b
	
	xorpd mask, %xmm0
	
	movq %xmm0, %rdx
	movhlps %xmm0, %xmm0
	adc %rdx, %rax
	movq %xmm0, %rdx
	adc %rdx, %rax
	adc $0, %rax
	
2:	mov %esi, %ecx
	shr $3, %ecx
	and $7, %ecx  # Also clears the carry flag
	je 4f

3:	adc (%rdi), %rax
	lea 8(%rdi), %rdi
	dec %ecx
	jne 3b

4:	adc $0, %rax
	mov %esi, %ecx
	and $7, %ecx
	jne  6f # Some odd bytes to do
	
	# Fold down to 16 bits
5:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq

	# Add in last bytes
6:	lea (,%ecx,8),%ecx
	mov $-1, %rdx
	neg %ecx
	shr %cl, %rdx
	and (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	jmp 5b

7:	test $1, %edi
	jne 11f # Odd-aligned is annoying
	
	mov %edi, %ecx
	neg %ecx
	and $0xf, %ecx
	
	cmp %ecx, %esi
	cmovl %esi, %ecx  # Work to do is min(size, alignment)
	sub %ecx, %esi
	
	test $8, %ecx
	je  8f
	movq (%rdi), %rax
	add $8, %rdi
	
8:	test $4, %ecx
	je  9f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

9:	test $2, %ecx
	je 10f
	movzwq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

10:	test $1, %ecx
	je 0b
	movzbq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $1, %rdi

	test %esi, %esi # Nothing left to do?
	je 5b
	
	# Continue with the rest
	jmp 0b

11:	movzbw (%rdi), %r8w
	inc %rdi
	dec %esi
	je 12f # Only one byte to do?
	push %rbp # Needed for 16-byte stack alignment
	call checksum12
	pop %rbp
	not %ax
	xchg %al, %ah
	add %r8w, %ax
	adc $0, %ax
	not %ax
	retq

12:	mov %r8w, %ax
	not %ax
	retq
.size checksum12, .-checksum12

The cost of checking for alignment is overwhelmed by the other work the function needs to do, and the resulting performance doesn't change from the previous (incorrect) method:

Size6410231024
Time (s)0.1701.201.11

Unfortunately, on this machine, it doesn't look like you can do much better with using the vectorized instructions. However, the above does exploit a trick that we haven't used with the non-vectorized code. By enforcing aligned accesses in the inner loop, we can clean up the code that deals with the odd-sized tail. Doing the same with the adc-based loop looks like:


.globl checksum13
.type checksum13,@function
.align 16
checksum13:
	xor %eax, %eax
	test $7, %edi
	jne 7f
0:	mov %esi, %ecx
	shr $6, %ecx
	je 2f
	clc

.align 16
1:	adc 0(%rdi), %rax
	adc 8(%rdi), %rax
	adc 16(%rdi), %rax
	adc 24(%rdi), %rax
	adc 32(%rdi), %rax
	adc 40(%rdi), %rax
	adc 48(%rdi), %rax
	adc 56(%rdi), %rax
	lea 64(%rdi), %rdi
	dec %ecx
	jne 1b
	
	adc $0, %rax
	
2:	mov %esi, %ecx
	shr $3, %ecx
	and $7, %ecx  # Also clears the carry flag
	je 4f

3:	adc (%rdi), %rax
	lea 8(%rdi), %rdi
	dec %ecx
	jne 3b

4:	adc $0, %rax
	and $7, %esi
	jne  6f # Some odd bytes to do
	
	# Fold down to 16 bits
5:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq

	# Add in last bytes
6:	lea (,%esi,8),%ecx
	mov $-1, %rdx
	neg %ecx
	shr %cl, %rdx
	and (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	jmp 5b

7:	test $1, %edi
	jne 11f # Odd-aligned is annoying
	
	mov %edi, %ecx
	neg %ecx
	and $0x7, %ecx
	
	cmp %ecx, %esi
	cmovl %esi, %ecx  # Work to do is min(size, alignment)
	sub %ecx, %esi
		
	test $4, %ecx
	je  8f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

8:	test $2, %ecx
	je 9f
	movzwq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

9:	test $1, %ecx
	je 10f
	movzbq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $1, %rdi

10:	test %esi, %esi # Nothing left to do?
	je 5b
	
	# Continue with the rest
	jmp 0b

11:	movzbw (%rdi), %r8w
	inc %rdi
	dec %esi
	je 12f # Only one byte to do?
	call checksum13
	not %ax
	xchg %al, %ah
	add %r8w, %ax
	adc $0, %ax
	not %ax
	retq

12:	mov %r8w, %ax
	not %ax
	retq
.size checksum13, .-checksum13

This gives:

Size6410231024
Time (s)0.1281.131.05

Which is faster than before, except for the odd-length case, which is slightly slower. If we could only make those cases more like the ones that are multiples of 64 bytes long (the unrolled loop step size), then we might be able to increase speed a little further. One way to do this is via the asm equivalent of Duff's device. We can jump into the middle of the loop when we start it. Unfortunately, we still need to handle lengths not exact multiples of the 64-bit register size in a special way, but we needed to do that anyway.

Normally, you would construct a jump table, and then use it to branch to the correct spot in the inner loop. However, here, all the interesting instructions are the same length: four bytes long. This means we can do some arithmetic and emulate the equivalent of a computed goto and avoid doing a relatively expensive unpredictable memory lookup. The resulting code looks like:


.globl checksum14
.type checksum14,@function
.align 16
checksum14:
	xor %eax, %eax
	test $7, %edi
	jne 5f
0:	mov %esi, %edx
	mov %esi, %ecx
	shr $3, %edx
	je 2f
	shr $6, %ecx
	and $7, %edx
	# A multiple of 64 means we do full executions of the loop
	je 1f
	add $1, %ecx
	# We need to adjust the pointer, because we have a partial loop
	lea -64(%rdi,%rdx,8), %rdi
	# Compute the jump point
	neg %edx
	and $7, %edx
	lea 1f-1(,%rdx,4), %rdx
	jmp *%rdx

.align 16	
1:	adc (%rdi), %rax
	adc 8(%rdi), %rax
	adc 16(%rdi), %rax
	adc 24(%rdi), %rax
	adc 32(%rdi), %rax
	adc 40(%rdi), %rax
	adc 48(%rdi), %rax
	adc 56(%rdi), %rax
	lea 64(%rdi), %rdi
	dec %ecx
	jne 1b
	
	adc $0, %rax
2:	and $7, %esi
	jne  4f # Some odd bytes to do
	
	# Fold down to 16 bits
3:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq

	# Add in last bytes
4:	lea (,%esi,8),%ecx
	mov $-1, %rdx
	neg %ecx
	shr %cl, %rdx
	and (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	jmp 3b

5:	test $1, %edi
	jne 9f # Odd-aligned is annoying
	
	mov %edi, %ecx
	neg %ecx
	and $0x7, %ecx
	
	cmp %ecx, %esi
	cmovl %esi, %ecx  # Work to do is min(size, alignment)
	sub %ecx, %esi
		
	test $4, %ecx
	je  6f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

6:	test $2, %ecx
	je 7f
	movzwq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

7:	test $1, %ecx
	je 8f
	movzbq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $1, %rdi

8:	test %esi, %esi # Nothing left to do?
	je 3b
	
	# Continue with the rest
	jmp 0b

9:	movzbw (%rdi), %r8w
	inc %rdi
	dec %esi
	je 10f # Only one byte to do?
	call checksum14
	not %ax
	xchg %al, %ah
	add %r8w, %ax
	adc $0, %ax
	not %ax
	retq

10:	mov %r8w, %ax
	not %ax
	retq
.size checksum14, .-checksum14

This gives:

Size6410231024
Time (s)0.1181.101.04

The above is very good, the best code so far on this machine. The only thing left to try is shrinking the inner loop. The current code processes 64 bytes at a time, like the vectorized version did. However, the old adc-based algorithm found that 32 bytes at a time was the best. Lets take one last look at what that might do:


.globl checksum15
.type checksum15,@function
.align 16
checksum15:
	xor %eax, %eax
	test $7, %edi
	jne 5f
	mov %esi, %edx
	mov %esi, %ecx
	shr $3, %edx
	je 2f
	shr $5, %ecx
	and $3, %edx
	je 1f
	add $1, %ecx
	lea -32(%rdi,%rdx,8), %rdi
	neg %edx
	and $3, %edx
	lea 1f-1(,%rdx,4), %rdx
	jmp *%rdx

# Now only 32 bytes at a time
.align 16	
1:	adc (%rdi), %rax
	adc 8(%rdi), %rax
	adc 16(%rdi), %rax
	adc 24(%rdi), %rax
	lea 32(%rdi), %rdi
	dec %ecx
	jne 1b
	
	adc $0, %rax
2:	and $7, %esi
	jne  4f # Some odd bytes to do
	
	# Fold down to 16 bits
3:	mov %eax, %edx
	shr $32, %rax
	add %edx, %eax
	adc $0, %eax
	mov %eax, %edx
	shr $16, %eax
	add %dx, %ax
	adc $0, %ax

	not %ax
	retq

	# Add in last bytes
4:	lea (,%esi,8),%ecx
	mov $-1, %rdx
	neg %ecx
	shr %cl, %rdx
	and (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	jmp 3b

5:	test $1, %edi
	jne 9f # Odd-aligned is annoying
	
	mov %edi, %ecx
	neg %ecx
	and $0xf, %ecx
	
	cmp %ecx, %esi
	cmovl %esi, %ecx  # Work to do is min(size, alignment)
	sub %ecx, %esi
	
	test $8, %ecx
	je  6f
	movq (%rdi), %rax
	add $8, %rdi
	
6:	test $4, %ecx
	je  7f
	movl (%rdi), %edx
	add %rdx, %rax
	adc $0, %rax
	add $4, %rdi

7:	test $2, %ecx
	je 8f
	movzwq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $2, %rdi

8:	test $1, %ecx
	je 0b
	movzbq (%rdi), %rdx
	add %rdx, %rax
	adc $0, %rax
	add $1, %rdi

	test %esi, %esi # Nothing left to do?
	je 3b
	
	# Continue with the rest
	jmp 0b

9:	movzbw (%rdi), %r8w
	inc %rdi
	dec %esi
	je 10f # Only one byte to do?
	call checksum15
	not %ax
	xchg %al, %ah
	add %r8w, %ax
	adc $0, %ax
	not %ax
	retq

10:	mov %r8w, %ax
	not %ax
	retq
.size checksum15, .-checksum15

The results don't change all that much:

Size6410231024
Time (s)0.1281.101.04

The only major difference is the small 64-byte packet time has gone up due to it requiring two loop iterations instead of one. However, it is still faster than the earlier unvectorized code. On the other hand, we may want to keep with the 64-byte unrolled version due to this slight speed loss.

Summary

After trying many many optimization tricks, we can find a method of calculating the TCP/IP checksum up to an order of magnitude faster than the original C code. Most of the gains were due to using computation on larger chunks of data (64-bits rather than 16 bits at a time). More gains were due to unrolling the inner loop, and controlling use of the carry flag. Small gains could be obtained by controlling alignment, and breaking slow-paths out of the main control flow.

Unfortunately vectorization was not a win on this machine for this problem. The non-orthogonality of the SSE instructions can hurt quite a bit when the operation you need isn't supported. However, this isn't the full story. On more recent machines the adc instruction which is key to the unvectorized algorithm is much much slower. In addition, the vectorized instructions are much much faster, plus there are more of them doing new things. The combination of these changes could mean that one of the "slower" functions above could be better.

If vectorization does prove to be better on your machine, perhaps you may want to explore adding the Duff's device trick to the best vectorized version. You may also want to explore using instructions added after SSE2, or perhaps even the 256-bit AVX extensions. As hardware changes over time, so does the most optimal version of an algorithm. Most of the time, it isn't worth it to hyper-optimize something. In some cases, where you absolutely need the performance edge, it can be highly profitable though.

Comments

yolo swag said...
best article i've read in forever

captcha is waay too hard
said...
Enter your comments here
Sune said...
Very detailed and thorough. Thanks!
oiskuu said...
Hello, my few words of advice.

It is generally futile to unroll loops without breaking dependencies.
Sum loop based on adc is essentially one long chain of dependent ops.
I've run couple of brief tests on a Lynnfield box, here's how checksum6 performs:
<pre>

$ perf stat ./a.out 1000
10000000 loops; len=1000; res = ...

        6827798474 cycles # 2.960 GHz
        5574096465 stalled-cycles-frontend # 81.64% frontend cycles idle
        4505422340 stalled-cycles-backend # 65.99% backend cycles idle
        2550630533 instructions # 0.37 insns per cycle
                                             # 2.19 stalled cycles per insn
...

More results. checksum1 and checksum2, compiled with gcc-4.5.2 -O2, are
~ 1.05 clocks/byte (IPC 2.86), and 0.4 clocks/byte (IPC 2.25), respectively.

checksum3 is ~ 0.53 clocks/byte (IPC 1.51), checksum6 ~ 0.68 clocks/byte (IPC 0.37)


I wrote a version of checksum with unrolled loop of four uint32 adds into
four different uint64 registers, disassembles like this:
  400620: 44 8b 10 mov (%rax),%r10d
  400623: 4c 01 d2 add %r10,%rdx
  400626: 44 8b 50 04 mov 0x4(%rax),%r10d
  40062a: 4d 01 d1 add %r10,%r9
  40062d: 44 8b 50 08 mov 0x8(%rax),%r10d
  400631: 4d 01 d0 add %r10,%r8
  400634: 44 8b 50 0c mov 0xc(%rax),%r10d
  400638: 48 83 c0 10 add $0x10,%rax
  40063c: 4c 01 d1 add %r10,%rcx
  40063f: 4c 39 d8 cmp %r11,%rax
  400642: 75 dc jne 400620 <checksum77+0x30>

$ perf stat ./a.out 1000
10000000 loops; len=1000; res = ...

        2740821769 cycles # 2.959 GHz
         905954412 stalled-cycles-frontend # 33.05% frontend cycles idle
          29838920 stalled-cycles-backend # 1.09% backend cycles idle
        7565283438 instructions # 2.76 insns per cycle
                                             # 0.12 stalled cycles per insn
...
</pre>
What cpu did you run your tests on?

Finally, I'd like to let you know that I'm a robot, and that I find your captchas way too simple!
ddennerline said...
Thanks for writing intelligent and insightlful article.

I found your article on SSE2 optimization and IP checksum via Google as I have been interested in using Altivec/SSE2 version of IP checksum for many years. I took some of the ideas in article and coded up a slightly different version. This version uses three simple idea:
1) No alignment checking (causes branch mispredictions)
2) Use two independent memory strides by breaking buffer in half
3) Sum each stride separately (for multi-execution CPUs)

I originally tried verifying 64-bit sum version that has overflow checking. VTune showed a large number of branch mispredictions, so the 32-bit sum without overflow checking performs better.

Unfortunately, this code crashes using gcc -O3, but it runs correctly on VS2008. In tracking down the problem, it appears that gcc for 64-bit platforms has -sse optimization enabled by default. Since the code below doesn't care about integer alignment, it should technically work on Intel CPUs (not Sparc or PPC).

What I found interesting is that gcc uses SSE2 instruction when your article in-fact shows them as no-win. I suspect the code is crashing due to an SSE2 alignment issue (or gcc loop unrolling defect).

x00007ffff6b4c7c0 in in_cksum (data=0x68b2cd "\n\352\f8\220\204;\235\263\067P\030\372\360m\201", len=499,
    initialValue=<optimized out>) at ../../modules/platform_linux64/platform/../../platform/checksum64.c:59
59 sum += *buf;
   0x00007ffff6b4c7b4 <in_cksum+196>: 66 0f ef d2 pxor %xmm2,%xmm2
   0x00007ffff6b4c7b8 <in_cksum+200>: 0f 1f 84 00 00 00 00 00 nopl 0x0(%rax,%rax,1)
=> 0x00007ffff6b4c7c0 <in_cksum+208>: 66 43 0f 6f 5c 15 00 movdqa 0x0(%r13,%r10,1),%xmm3
   0x00007ffff6b4c7c7 <in_cksum+215>: 41 83 c3 01 add $0x1,%r11d
   0x00007ffff6b4c7cb <in_cksum+219>: 66 0f 6f e3 movdqa %xmm3,%xmm4
   0x00007ffff6b4c7cf <in_cksum+223>: 66 0f 6a da punpckhdq %xmm2,%xmm3
   0x00007ffff6b4c7d3 <in_cksum+227>: 66 0f 62 e2 punpckldq %xmm2,%xmm4
   0x00007ffff6b4c7d7 <in_cksum+231>: 66 0f d4 cc paddq %xmm4,%xmm1
   0x00007ffff6b4c7db <in_cksum+235>: 66 0f d4 cb paddq %xmm3,%xmm1

uint_fast16 in_cksum(const unsigned char *data, size_t len, uint_fast32 initialValue) {
    uint64 sum = initialValue;
    uint64 sum2 = 0;
    uint16 final;
    uint_fast32 stride = ((len / sizeof(uint32)) * sizeof(uint32)) / (sizeof(uint32)*2);
    const uint32* buf = (const uint32*)data;
    const uint32* buf2 = (const uint32*)&buf[stride];

    /* Break the buffer up into two different strides - sum each one independently */
    for (; stride--; ++buf, ++buf2) {
        sum += *buf;
        sum2 += *buf2;
    }
    sum += sum2;

    /* Roll second sum back into equation */
    assert(buf2 >= data);
    len -= (((const unsigned char*)buf2) - data);
    data = (const unsigned char*)buf2;

    if (len & 4) {
        sum += *((const uint32*)data);
        data += sizeof(uint32);
        len -= 4;
    }
    if (len & 2) {
        sum += *((const uint16*)data);
        data += sizeof(uint16);
        len -= 2;
    }
    if (len) {
        sum += *data++;
        --len;
    }
    assert(len == 0);


    /* Convert 64-bit value into 16-bit CRC */
    {
        uint32 i1, i2;
        uint16 i4;

        i1 = (uint32)sum;
        i2 = sum >> 32;
        i1 += i2;
        if (i1 < i2)
            i1++;
        final = (uint16)i1;
        i4 = i1 >> 16;
        final += (uint16)i4;
        if (final < i4)
            final++;
    }
    return final;
} /* in_cksum */
home loans said...
I received 1 st home loans when I was 20 and this aided my business a lot. However, I require the consolidation loans again.
credit loans said...
I will recommend not to hold off until you earn enough money to order goods! You should get the personal loans "goodfinance-blog.com" or student loan and feel yourself free
said...
Enter your comments here
Eduardo Panisset said...
On has an issue when hanlding tail less than 8-bytes long:

instead of

      if (size)
{
unsigned char s = *(unsigned char *) buf;
sum += s;
if (sum < s) sum++;
}

shouldn't we have :

      if (size & 1)
{
unsigned char s = *(unsigned char *) buf;
sum += s;
if (sum < s) sum++;
}

Eduardo Panisset
robot said...
I am robot!
A grown up coder said...
Looks to me like you go to assembler space to have fun/show off.

The excuse seems to be that "C" cannot access the carry flag. Then when that does not prove a real advantage, you go ahead and proceed with the standard loop unrolling techniques that are much easier to show in C...

I have a strong feeling a C implementation would be as fast, and 100 times more readable. Probably not so much fun (for you!) though ;-)

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.