Lockless Inc

The Euclidean Algorithm

The task for today is to find the fastest algorithm to calculate the greatest common divisor (GCD) of a pair of numbers. The GCD has a number of properties that allow us to express the GCD of a pair of larger numbers as the GCD of smaller numbers. By using recursion, this leads to a solution.

GCD(0, 0) = 0
GCD(x, 0) = x
GCD(x, y) = GCD(y, x)
GCD(x, y) = GCD(x - y, x)
GCD(a⋅x, a⋅y) = a⋅GCD(x, y)

The first of the above identities follows from convention, where zero divides zero zero times (instead of one time). This makes some identities with the least-common-multiple simpler.

So how should we benchmark an algorithm for this problem? The difficulty is that the GCD of of a pair of two numbers can be quite different to the GCD of another pair, even if the two pairs are quite close in magnitude. To be fair, we need to test many pairs of numbers, and obtain an overall average speed. Another issue is that generating such pairs of numbers can be quite slow. (We don't want to be benchmarking the speed of the random number generator.)

To solve these problems, we allocate a large array to hold 224 pairs of random numbers. Once the array is filled, we can use the rdtsc instruction before and after calculating the GCD of all the pairs. This avoids including the setup time in the benchmark results. Finally, to by dividing by the number of pairs we can find out how many cycles it takes per number pair on average for a given algorithm on this machine.

Euclidean Algorithm

Since the algorithm above is tail-recursive, we can re-express it in an iterative form:


typedef unsigned long long u64b;
u64b gcd0(u64b u, u64b v)
{
	/* Special case when input is zero */
	if (!u) return v;

	/* Repeatedly subtract off the smaller from the larger */
	while (v)
	{
		if (u > v)
		{
			u -= v;
		}
		else
		{
			v -= u;
		}
	}
	
	return u;
}

This subtractive algorithm takes 4700 cycles on average for random pairs of 64 bit numbers.

The above can be substantially improved. If we notice that repeated subtraction is the same as determining the remainder of a division, we can use a modulus operator instead. This is the classical Euclidean algorithm.

GCD(x, y) = GCD(x mod y, x)

In C, this becomes:


u64b gcd1(u64b u, u64b v)
{
	if (!u || !v) return u | v;
	
	/* Repeatedly obtain remainder of smaller from the larger */
	while (v)
	{
		u64b t = v;
		v =	u % v;
		u = t;
	}
	
	return u;
}

The above takes about 2762 cycles per pair. This approaches being twice as fast as the simple subtraction method.

Binary Euclidean Algorithm

Unfortunately, on modern machines, the division instruction is quite slow. This greatly hinders the speed of the above code. To do better, we can explore some additional identities for GCD's:

GCD(2x, 2y) = 2GCD(x, y)
GCD(2x, 2y+1) = GCD(x, 2y+1)
GCD(2x+1, 2y+1) = GCD(2x+1, y - x)

i.e. the GCD of two even numbers is twice the GCD of half of those numbers. The GCD of an even and an odd number may be simplified by noticing that the factor of two cannot be a divisor of the odd number. Finally, the GCD of two odd numbers can be simplified by subtracting the smaller from the larger to yield an even number, which can then in turn be simplified by removing the extra powers of two. Solving for a GCD this way is known as the "Binary Euclidean Algorithm".


u64b gcd2(u64b u, u64b v)
{
	int shift;

	if (!u || !v) return u | v;

	/* Get common multiples of two */
	for (shift = 0; !((u | v) & 1); shift++)
	{
		u /= 2;
		v /= 2;
	}

	/* Make u odd */
	while (!(u & 1)) u /= 2;

	/* From here on, u is always odd. */
	do
	{
		while (!(v & 1)) v /= 2;

		/* Subtract the smaller from the larger */
		if (u < v)
		{
			v -= u;
		}
		else
		{
			u64b t = u - v;
			u = v;
			v = t;
		}
	} while (v);

	return u << shift;
}

The above takes an average of 1450 cycles, which is a nice speed-up.

Now the above can be improved a little more by using the bsf instruction. This allows us to obtain all the powers of two in one step, rather than using a loop. gcc exposes this with the __builtin_ctzll() function. Wrapping that in an inline function for easier typing, and using it leads to:


static inline u64b ffsq(u64b x)
{
	return __builtin_ctzll(x);
}

u64b gcd3(u64b u, u64b v)
{
	int shift;

	if (!u || !v) return u | v;

	shift = ffsq(u | v);
	u >>= ffsq(u);
		
	
	while (v)
	{
		v >>= ffsq(v);

		if (u < v)
		{
			v -= u;
		}
		else
		{
			u64b t = u - v;
			u = v;
			v = t;
		}
	}

	return u << shift;
}

This is yet again a very nice improvement, taking 735 cycles per pair.

Base six Euclidean Algorithm

The reason the Binary Euclidean Algorithm works so well is that two is a common prime factor of numbers, and computers are very good at manipulating numbers in base two (binary). The next-most common prime factor is three. It would be nice if we could extend the binary algorithm to use it as well. Fortunately, testing to see if a number is a multiple of three is relatively easy.

If 0xaaaaaaaaaaaaaaab * x <= 0x5555555555555555 then the 64 bit unsigned integer,x is a multiple of three. To show this, note how the number 0xaaaaaaaaaaaaaaab is the inverse of three mod 264. When multiplied by any multiple of three, it will produce the correct answer. Since (264-1)/3 = 0x5555555555555555 is the largest possible result, any other must come from a non-multiple of three.

We can use a similar trick to test to see if a number is an exact multiple of five: 0xcccccccccccccccd * x <= 0x3333333333333333. However, multiples of five are rare enough that this perhaps isn't worth the cost of the multiplication instruction.

The idea here is that since the bsf instruction is slow, we can overlap the execution of the multiply for the multiple-of-three test with it. This should yield an algorithm with an inner loop that is faster than the Binary Euclidean Algorithm because it will shrink powers of two as well as three.

However, there is one other technicality we need to think about. The above doesn't check for multiples of 9, 27, 81 etc., only single multiples of three. If we wish to "shrink" higher powers, we need to repeat the multiply and comparison test. This is unfortunately slow, as then they will not overlap with the bsf instruction. Thus we will only reduce by a single power at each step.


u64b gcd4(u64b u, u64b v)
{
	int s1, s2;
	u64b p = 1;
	u64b c3 = 0x5555555555555555;
	u64b m3 = 0xaaaaaaaaaaaaaaab;
	u64b t1, t2;

	if (!u || !v) return u | v;

	s1 = ffsq(u);
	s2 = ffsq(v);
	
	u >>= s1;
	if (s1 > s2) s1 = s2;
	
	t1 = u * m3;
	t2 = v * m3;

	while ((t1 <= c3) && (t2 <= c3))
	{
		p *= 3;
		u = t1;
		v = t2;
		
		t1 = u * m3;
		t2 = v * m3;
	}

	while (t2 <= c3)
	{
		v = t2;
		t2 = v * m3;
	}
	
	while (t1 <= c3)
	{
		u = t1;
		t1 = u * m3;
	}
	
	while (v)
	{
		int t3 = ffsq(v);
		
		/* Try to shrink a power of three */
		t1 = v * m3;
		if (t1 <= c3) v = t1;
		
		/* Now remove all powers of two */
		v >>= t3;

		/* Binary Euclidean Algorithm part */
		if (u < v)
		{
			v -= u;
		}
		else
		{
			u64b t = u - v;
			u = v;
			v = t;
		}
	}

	return (u << s1) * p;
}

The results are unfortunately disappointing, taking 848 cycles per pair, slower than the Binary Euclidean Algorithm. However, looking at the generated assembly language, it is easy to see why. The use of large 64bit constants seems to confuse gcc. Instead of storing them in registers, it repeatedly uses the slow and bulky movabs instruction to reload them. It also isn't particularly good at optimizing the critical inner loop. To do better, we'll need to move to asm from C.

Using ASM

The Binary Euclidean Algorithm is relatively easy to convert into assembly language. However, there are a few optimizations we can make. The first is to notice that we end up shifting the result by ffs(u | v). If we know the location of the lowest set bit, we can multiply by that instead. Fortunately, there is a bit-hack x&(-x) that does what we want. We can use it to avoid a slow bsf instruction.

Another trick is to notice that the comparison used to determine whether to subtract u from v or v from u sets the same flags as a subtraction. Thus, we can pretend that the subtraction does things the "right" way, and then fix things up later if it doesn't. Since the fix-up only takes two instructions, this is quite nice. The resulting code is:


.globl gcd5
.type   gcd5,@function
.align 16
gcd5:
	mov    %rsi, %rax
	or     %rdi, %rax

# Check for zero arguments
	test   %rsi, %rsi
	je     2f
	test   %rdi, %rdi
	je     2f
	
# Start to calculate 1 << ffs(u | v)
	mov    %rax, %rdx

# Start to calculate u >> ffs(u)
	bsf    %rdi, %rcx
	neg    %rax
	shr    %cl, %rdi
	and    %rdx, %rax

# The inner loop
.align 16
1:	bsf    %rsi, %rcx
	shr    %cl, %rsi

# Case 1, u < v
	sub    %rdi, %rsi
	ja     1b

# Otherwise, we need to do more calculation
	add    %rsi, %rdi

# This in addition to correcting the sign of v, also tests v against zero
	neg    %rsi
	jne    1b

# Done, multiply by shift factor
	imul   %rdi, %rax
2:	retq
.size gcd5, .-gcd5

This takes 903 cycles per pair, slower than both the C algorithms above! What is going on? It turns out although the assembly code is very compact, it suffers from a fatal flaw. The jump in the inner loop is unpredictable. The resulting miss-predictions cause a large slowdown. gcc is able to generate branchless code there, so performs better.

So how shall we remove the conditional jump? On obvious trick is to use the sbb instruction to create a mask. We can make a conditional add by a logical-and with the mask followed by an add. We can make a conditional negation by xoring by the mask and then subtracting it. (This is how a branchless abs function is often implemented. Doing this we have:


.globl gcd6
.type   gcd6,@function
.align 16
gcd6:
	mov    %rsi, %rax
	or     %rdi, %rax
	test   %rsi, %rsi
	je     2f
	test   %rdi, %rdi
	je     2f
	mov    %rax, %rdx
	bsf    %rdi, %rcx
	neg    %rax
	shr    %cl, %rdi
	and    %rdx, %rax

# The inner loop
.align 16
1:	bsf    %rsi, %rcx
	shr    %cl, %rsi
	sub    %rdi, %rsi

# Create the mask
	sbb    %rdx, %rdx

# Duplicate the mask
	mov    %rdx, %rcx

# Start of conditional add
	and    %rsi, %rdx

# Start of conditional negation
	xor    %rcx, %rsi

# Complete conditional add
	add    %rdx, %rdi

# Complete conditional negation
	sub    %rcx, %rsi
	jne    1b
	
	imul   %rdi, %rax
2:	retq
.size gcd6, .-gcd6

This takes 753 cycles per pair. Faster, but still not good enough to beat the C version. Perhaps instead of being quite so tricky, we should use conditional move instructions. By finding the maximum and minimum we can then do the subtraction afterwards. This means we have a compare and a subtraction... but this may be faster. The resulting code looks like:


.globl gcd7
.type   gcd7,@function
.align 16
gcd7:
	mov    %rsi, %rax
	or     %rdi, %rax
	test   %rsi, %rsi
	je     2f
	test   %rdi, %rdi
	je     2f
	mov    %rax, %rdx
	bsf    %rdi, %rcx
	neg    %rax
	shr    %cl, %rdi
	and    %rdx, %rax

.align 16
1:	bsf    %rsi, %rcx

# Save u for later
	mov    %rdi, %rdx
	shr    %cl, %rsi

# Compare u and v
	cmp    %rsi, %rdi

# Get minimum in %rdi, and maximum in %rsi
	cmova  %rsi, %rdi
	cmova  %rdx, %rsi

# Do the subtraction
	sub    %rdi, %rsi

	jne    1b
	imul   %rdi, %rax
2:	retq
.size gcd7, .-gcd7

This takes 663 cycles per pair, and finally we are faster than C. However, with a little more thinking, we may be able to do even better by rearranging the order of the operations. We do this by subtracting twice: u - v, and v - u, and then picking the needed results via conditional moves. This removes the cmp instruction, and obtain more inter-instruction parallelism:


.globl gcd8
.type   gcd8,@function
.align 16
gcd8:
	mov    %rsi, %rax
	or     %rdi, %rax
	test   %rsi, %rsi
	je     2f
	test   %rdi, %rdi
	je     2f
	mov    %rax, %rdx
	bsf    %rdi, %rcx
	neg    %rax
	shr    %cl, %rdi
	and    %rdx, %rax

#   Inner loop combining conditional moves and comparison-removal
.align 16
1:	bsf    %rsi, %rcx
	mov    %rdi, %rdx
	shr    %cl, %rsi
	sub    %rsi, %rdx
	mov    %rsi, %rcx
	sub    %rdi, %rsi
	cmovc  %rdx, %rsi
	cmovc  %rcx, %rdi
	jne    1b
	imul   %rdi, %rax
2:	retq
.size gcd8, .-gcd8

The above code takes 574 cycles per pair, 20% faster than the C method. Unfortunately, it looks to be rather difficult to improve things further. One obvious trick is to use the fact that the test for zero of %rdi and the zero-flag result of the bsf instruction are redundant. However, removing the extra test doesn't result in any speedup. Perhaps the reason for this is that the inner loop is aligned at a 16 byte interval. By coincidence, the length of the initialization code is a multiple of 16 bytes already, so the alignment directive actually does nothing in this case. However, if we start removing instructions, the assembler will need to add a nop, counteracting any speedup we had hoped to get.

Now that we have the best implementation of the Binary Euclidean Algorithm we could find, we can use the same tricks to speed up the base six algorithm. The resulting code looks like:


.globl gcd9
.type   gcd9,@function
.align 16
gcd9:
	mov    %rsi, %rax
	or     %rdi, %rax
	test   %rsi, %rsi
	je     5f
	test   %rdi, %rdi
	je     5f
	
	bsf    %rdi, %rcx
	mov    %rax, %r11
	mov    %rsi, %r9
	shr    %cl, %rdi
	movabs $0x5555555555555555, %r8
	neg    %rax
	mov    %rdi, %r10

# Use the fact that the multiplier and comparison are related to avoid movabs
	lea    0x1(%r8, %r8), %rdx
	and    %r11, %rax
	mov    %rdx, %r11
	jmp    2f

.align 8
# The loop to remove common powers of three
# Multiply %rax by three
1:	lea    (%rax, %rax, 2), %rax
	mov    %r10, %rdi
	mov    %r9, %rsi
	
2:	imul   %rdx, %r10
	cmp    %r8, %r10
	ja     4f
	
	imul   %rdx, %r9
	cmp    %r8, %r9
	jbe    1b

# Remove powers of three from u.
3:  mov    %r10, %rdi
    imul   %rdx, %r10
	cmp    %r8, %r10
	jbe    3b

# The inner loop
.align 16

# Overlap the execution of the multiply and bitscan
4:	imul   %rsi, %r11
	bsf    %rsi, %rcx
	mov    %rdi, %r10
	cmp    %r8, %r11
	cmovbe %r11, %rsi

# Exact division by two and three commute
	shr    %cl, %rsi

# Similar code to the Binary Euclidean Algorithm
	mov    %rdx, %r11
	sub    %rsi, %r10
	mov    %rsi, %rcx
	sub    %rdi, %rsi
	cmovc  %r10, %rsi
	cmovc  %rcx, %rdi
	jne    4b
	imul   %rdi, %rax
5:	retq
.size gcd9, .-gcd9

This takes a slightly disappointing 594 cycles per pair, just a little bit slower than the Binary Euclidean Algorithm. What is going on? It turns out that the inner loop here is actually faster than the binary case. However, the initial removal of powers of three is quite slow. The slow initialization makes all the difference, and the timings are worse. Perhaps with larger numbers (which need more inner loop iterations on average), things would be different.

Summary

It isn't easy to find the fastest way of doing something. You may need to try many different algorithms, and implementations of those algorithms. In this case, the assembly version of the Binary Euclidean Algorithm was the fastest. It is eight times faster than the naive code we started with, and nearly five times faster than the C implementation of the Euclidean Algorithm described in many textbooks.

Another thing to notice is that compilers are improving over time. The results from the pre-release version of gcc 4.6 used here are particularly good, with the tuned asm version only being about 28% faster. If you try to use simple asm code, then the compiler may do a better job.

Comments

said...

How do the ideas presented here relate to finding the quotient ((int)i/j) and remainder (i%j) of integers?

eappyqtokid said...
Rc3GMl , [url=http://uhzamjrviibi.com/]uhzamjrviibi[/url], [link=http://ueuqumehbwxt.com/]ueuqumehbwxt[/link], http://bxqktiivaxjw.com/
Akira Shirase said...
Some of these codes are pretty hard to understand. I suggest running these codes with the same inputs then flushing the process of solving it will be a great visualization of it. I hope the masters who did these have such time to improve this site because this is very educational :) hope you continue posting great stuff.

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.