# The Euclidean Algorithm

The task for today is to find the fastest algorithm to calculate the greatest common divisor (GCD) of a pair of numbers. The GCD has a number of properties that allow us to express the GCD of a pair of larger numbers as the GCD of smaller numbers. By using recursion, this leads to a solution.

 GCD(0, 0) = 0 GCD(x, 0) = x GCD(x, y) = GCD(y, x) GCD(x, y) = GCD(x - y, x) GCD(a⋅x, a⋅y) = a⋅GCD(x, y)

The first of the above identities follows from convention, where zero divides zero zero times (instead of one time). This makes some identities with the least-common-multiple simpler.

So how should we benchmark an algorithm for this problem? The difficulty is that the GCD of of a pair of two numbers can be quite different to the GCD of another pair, even if the two pairs are quite close in magnitude. To be fair, we need to test many pairs of numbers, and obtain an overall average speed. Another issue is that generating such pairs of numbers can be quite slow. (We don't want to be benchmarking the speed of the random number generator.)

To solve these problems, we allocate a large array to hold 224 pairs of random numbers. Once the array is filled, we can use the `rdtsc` instruction before and after calculating the GCD of all the pairs. This avoids including the setup time in the benchmark results. Finally, to by dividing by the number of pairs we can find out how many cycles it takes per number pair on average for a given algorithm on this machine.

### Euclidean Algorithm

Since the algorithm above is tail-recursive, we can re-express it in an iterative form:

``````
typedef unsigned long long u64b;
u64b gcd0(u64b u, u64b v)
{
/* Special case when input is zero */
if (!u) return v;

/* Repeatedly subtract off the smaller from the larger */
while (v)
{
if (u > v)
{
u -= v;
}
else
{
v -= u;
}
}

return u;
}
``````

This subtractive algorithm takes 4700 cycles on average for random pairs of 64 bit numbers.

The above can be substantially improved. If we notice that repeated subtraction is the same as determining the remainder of a division, we can use a modulus operator instead. This is the classical Euclidean algorithm.

 GCD(x, y) = GCD(x mod y, x)

In C, this becomes:

``````
u64b gcd1(u64b u, u64b v)
{
if (!u || !v) return u | v;

/* Repeatedly obtain remainder of smaller from the larger */
while (v)
{
u64b t = v;
v =	u % v;
u = t;
}

return u;
}
``````

The above takes about 2762 cycles per pair. This approaches being twice as fast as the simple subtraction method.

### Binary Euclidean Algorithm

Unfortunately, on modern machines, the division instruction is quite slow. This greatly hinders the speed of the above code. To do better, we can explore some additional identities for GCD's:

 GCD(2x, 2y) = 2GCD(x, y) GCD(2x, 2y+1) = GCD(x, 2y+1) GCD(2x+1, 2y+1) = GCD(2x+1, y - x)

i.e. the GCD of two even numbers is twice the GCD of half of those numbers. The GCD of an even and an odd number may be simplified by noticing that the factor of two cannot be a divisor of the odd number. Finally, the GCD of two odd numbers can be simplified by subtracting the smaller from the larger to yield an even number, which can then in turn be simplified by removing the extra powers of two. Solving for a GCD this way is known as the "Binary Euclidean Algorithm".

``````
u64b gcd2(u64b u, u64b v)
{
int shift;

if (!u || !v) return u | v;

/* Get common multiples of two */
for (shift = 0; !((u | v) & 1); shift++)
{
u /= 2;
v /= 2;
}

/* Make u odd */
while (!(u & 1)) u /= 2;

/* From here on, u is always odd. */
do
{
while (!(v & 1)) v /= 2;

/* Subtract the smaller from the larger */
if (u < v)
{
v -= u;
}
else
{
u64b t = u - v;
u = v;
v = t;
}
} while (v);

return u << shift;
}
``````

The above takes an average of 1450 cycles, which is a nice speed-up.

Now the above can be improved a little more by using the `bsf` instruction. This allows us to obtain all the powers of two in one step, rather than using a loop. gcc exposes this with the __builtin_ctzll() function. Wrapping that in an inline function for easier typing, and using it leads to:

``````
static inline u64b ffsq(u64b x)
{
return __builtin_ctzll(x);
}

u64b gcd3(u64b u, u64b v)
{
int shift;

if (!u || !v) return u | v;

shift = ffsq(u | v);
u >>= ffsq(u);

while (v)
{
v >>= ffsq(v);

if (u < v)
{
v -= u;
}
else
{
u64b t = u - v;
u = v;
v = t;
}
}

return u << shift;
}
``````

This is yet again a very nice improvement, taking 735 cycles per pair.

### Base six Euclidean Algorithm

The reason the Binary Euclidean Algorithm works so well is that two is a common prime factor of numbers, and computers are very good at manipulating numbers in base two (binary). The next-most common prime factor is three. It would be nice if we could extend the binary algorithm to use it as well. Fortunately, testing to see if a number is a multiple of three is relatively easy.

If `0xaaaaaaaaaaaaaaab * x <= 0x5555555555555555` then the 64 bit unsigned integer,`x` is a multiple of three. To show this, note how the number `0xaaaaaaaaaaaaaaab` is the inverse of three mod 264. When multiplied by any multiple of three, it will produce the correct answer. Since `(264-1)/3 = 0x5555555555555555` is the largest possible result, any other must come from a non-multiple of three.

We can use a similar trick to test to see if a number is an exact multiple of five: `0xcccccccccccccccd * x <= 0x3333333333333333`. However, multiples of five are rare enough that this perhaps isn't worth the cost of the multiplication instruction.

The idea here is that since the `bsf` instruction is slow, we can overlap the execution of the multiply for the multiple-of-three test with it. This should yield an algorithm with an inner loop that is faster than the Binary Euclidean Algorithm because it will shrink powers of two as well as three.

However, there is one other technicality we need to think about. The above doesn't check for multiples of 9, 27, 81 etc., only single multiples of three. If we wish to "shrink" higher powers, we need to repeat the multiply and comparison test. This is unfortunately slow, as then they will not overlap with the `bsf` instruction. Thus we will only reduce by a single power at each step.

``````
u64b gcd4(u64b u, u64b v)
{
int s1, s2;
u64b p = 1;
u64b c3 = 0x5555555555555555;
u64b m3 = 0xaaaaaaaaaaaaaaab;
u64b t1, t2;

if (!u || !v) return u | v;

s1 = ffsq(u);
s2 = ffsq(v);

u >>= s1;
if (s1 > s2) s1 = s2;

t1 = u * m3;
t2 = v * m3;

while ((t1 <= c3) && (t2 <= c3))
{
p *= 3;
u = t1;
v = t2;

t1 = u * m3;
t2 = v * m3;
}

while (t2 <= c3)
{
v = t2;
t2 = v * m3;
}

while (t1 <= c3)
{
u = t1;
t1 = u * m3;
}

while (v)
{
int t3 = ffsq(v);

/* Try to shrink a power of three */
t1 = v * m3;
if (t1 <= c3) v = t1;

/* Now remove all powers of two */
v >>= t3;

/* Binary Euclidean Algorithm part */
if (u < v)
{
v -= u;
}
else
{
u64b t = u - v;
u = v;
v = t;
}
}

return (u << s1) * p;
}
``````

The results are unfortunately disappointing, taking 848 cycles per pair, slower than the Binary Euclidean Algorithm. However, looking at the generated assembly language, it is easy to see why. The use of large 64bit constants seems to confuse gcc. Instead of storing them in registers, it repeatedly uses the slow and bulky `movabs` instruction to reload them. It also isn't particularly good at optimizing the critical inner loop. To do better, we'll need to move to asm from C.

### Using ASM

The Binary Euclidean Algorithm is relatively easy to convert into assembly language. However, there are a few optimizations we can make. The first is to notice that we end up shifting the result by `ffs(u | v)`. If we know the location of the lowest set bit, we can multiply by that instead. Fortunately, there is a bit-hack `x&(-x)` that does what we want. We can use it to avoid a slow `bsf` instruction.

Another trick is to notice that the comparison used to determine whether to subtract u from v or v from u sets the same flags as a subtraction. Thus, we can pretend that the subtraction does things the "right" way, and then fix things up later if it doesn't. Since the fix-up only takes two instructions, this is quite nice. The resulting code is:

``````
.globl gcd5
.type   gcd5,@function
.align 16
gcd5:
mov    %rsi, %rax
or     %rdi, %rax

# Check for zero arguments
test   %rsi, %rsi
je     2f
test   %rdi, %rdi
je     2f

# Start to calculate 1 << ffs(u | v)
mov    %rax, %rdx

# Start to calculate u >> ffs(u)
bsf    %rdi, %rcx
neg    %rax
shr    %cl, %rdi
and    %rdx, %rax

# The inner loop
.align 16
1:	bsf    %rsi, %rcx
shr    %cl, %rsi

# Case 1, u < v
sub    %rdi, %rsi
ja     1b

# Otherwise, we need to do more calculation

# This in addition to correcting the sign of v, also tests v against zero
neg    %rsi
jne    1b

# Done, multiply by shift factor
imul   %rdi, %rax
2:	retq
.size gcd5, .-gcd5
``````

This takes 903 cycles per pair, slower than both the C algorithms above! What is going on? It turns out although the assembly code is very compact, it suffers from a fatal flaw. The jump in the inner loop is unpredictable. The resulting miss-predictions cause a large slowdown. gcc is able to generate branchless code there, so performs better.

So how shall we remove the conditional jump? On obvious trick is to use the `sbb` instruction to create a mask. We can make a conditional add by a logical-and with the mask followed by an add. We can make a conditional negation by xoring by the mask and then subtracting it. (This is how a branchless `abs` function is often implemented. Doing this we have:

``````
.globl gcd6
.type   gcd6,@function
.align 16
gcd6:
mov    %rsi, %rax
or     %rdi, %rax
test   %rsi, %rsi
je     2f
test   %rdi, %rdi
je     2f
mov    %rax, %rdx
bsf    %rdi, %rcx
neg    %rax
shr    %cl, %rdi
and    %rdx, %rax

# The inner loop
.align 16
1:	bsf    %rsi, %rcx
shr    %cl, %rsi
sub    %rdi, %rsi

sbb    %rdx, %rdx

mov    %rdx, %rcx

and    %rsi, %rdx

# Start of conditional negation
xor    %rcx, %rsi

# Complete conditional negation
sub    %rcx, %rsi
jne    1b

imul   %rdi, %rax
2:	retq
.size gcd6, .-gcd6
``````

This takes 753 cycles per pair. Faster, but still not good enough to beat the C version. Perhaps instead of being quite so tricky, we should use conditional move instructions. By finding the maximum and minimum we can then do the subtraction afterwards. This means we have a compare and a subtraction... but this may be faster. The resulting code looks like:

``````
.globl gcd7
.type   gcd7,@function
.align 16
gcd7:
mov    %rsi, %rax
or     %rdi, %rax
test   %rsi, %rsi
je     2f
test   %rdi, %rdi
je     2f
mov    %rax, %rdx
bsf    %rdi, %rcx
neg    %rax
shr    %cl, %rdi
and    %rdx, %rax

.align 16
1:	bsf    %rsi, %rcx

# Save u for later
mov    %rdi, %rdx
shr    %cl, %rsi

# Compare u and v
cmp    %rsi, %rdi

# Get minimum in %rdi, and maximum in %rsi
cmova  %rsi, %rdi
cmova  %rdx, %rsi

# Do the subtraction
sub    %rdi, %rsi

jne    1b
imul   %rdi, %rax
2:	retq
.size gcd7, .-gcd7
``````

This takes 663 cycles per pair, and finally we are faster than C. However, with a little more thinking, we may be able to do even better by rearranging the order of the operations. We do this by subtracting twice: `u - v`, and `v - u`, and then picking the needed results via conditional moves. This removes the `cmp` instruction, and obtain more inter-instruction parallelism:

``````
.globl gcd8
.type   gcd8,@function
.align 16
gcd8:
mov    %rsi, %rax
or     %rdi, %rax
test   %rsi, %rsi
je     2f
test   %rdi, %rdi
je     2f
mov    %rax, %rdx
bsf    %rdi, %rcx
neg    %rax
shr    %cl, %rdi
and    %rdx, %rax

#   Inner loop combining conditional moves and comparison-removal
.align 16
1:	bsf    %rsi, %rcx
mov    %rdi, %rdx
shr    %cl, %rsi
sub    %rsi, %rdx
mov    %rsi, %rcx
sub    %rdi, %rsi
cmovc  %rdx, %rsi
cmovc  %rcx, %rdi
jne    1b
imul   %rdi, %rax
2:	retq
.size gcd8, .-gcd8
``````

The above code takes 574 cycles per pair, 20% faster than the C method. Unfortunately, it looks to be rather difficult to improve things further. One obvious trick is to use the fact that the test for zero of `%rdi` and the zero-flag result of the `bsf` instruction are redundant. However, removing the extra test doesn't result in any speedup. Perhaps the reason for this is that the inner loop is aligned at a 16 byte interval. By coincidence, the length of the initialization code is a multiple of 16 bytes already, so the alignment directive actually does nothing in this case. However, if we start removing instructions, the assembler will need to add a `nop`, counteracting any speedup we had hoped to get.

Now that we have the best implementation of the Binary Euclidean Algorithm we could find, we can use the same tricks to speed up the base six algorithm. The resulting code looks like:

``````
.globl gcd9
.type   gcd9,@function
.align 16
gcd9:
mov    %rsi, %rax
or     %rdi, %rax
test   %rsi, %rsi
je     5f
test   %rdi, %rdi
je     5f

bsf    %rdi, %rcx
mov    %rax, %r11
mov    %rsi, %r9
shr    %cl, %rdi
movabs \$0x5555555555555555, %r8
neg    %rax
mov    %rdi, %r10

# Use the fact that the multiplier and comparison are related to avoid movabs
lea    0x1(%r8, %r8), %rdx
and    %r11, %rax
mov    %rdx, %r11
jmp    2f

.align 8
# The loop to remove common powers of three
# Multiply %rax by three
1:	lea    (%rax, %rax, 2), %rax
mov    %r10, %rdi
mov    %r9, %rsi

2:	imul   %rdx, %r10
cmp    %r8, %r10
ja     4f

imul   %rdx, %r9
cmp    %r8, %r9
jbe    1b

# Remove powers of three from u.
3:  mov    %r10, %rdi
imul   %rdx, %r10
cmp    %r8, %r10
jbe    3b

# The inner loop
.align 16

# Overlap the execution of the multiply and bitscan
4:	imul   %rsi, %r11
bsf    %rsi, %rcx
mov    %rdi, %r10
cmp    %r8, %r11
cmovbe %r11, %rsi

# Exact division by two and three commute
shr    %cl, %rsi

# Similar code to the Binary Euclidean Algorithm
mov    %rdx, %r11
sub    %rsi, %r10
mov    %rsi, %rcx
sub    %rdi, %rsi
cmovc  %r10, %rsi
cmovc  %rcx, %rdi
jne    4b
imul   %rdi, %rax
5:	retq
.size gcd9, .-gcd9
``````

This takes a slightly disappointing 594 cycles per pair, just a little bit slower than the Binary Euclidean Algorithm. What is going on? It turns out that the inner loop here is actually faster than the binary case. However, the initial removal of powers of three is quite slow. The slow initialization makes all the difference, and the timings are worse. Perhaps with larger numbers (which need more inner loop iterations on average), things would be different.

### Summary

It isn't easy to find the fastest way of doing something. You may need to try many different algorithms, and implementations of those algorithms. In this case, the assembly version of the Binary Euclidean Algorithm was the fastest. It is eight times faster than the naive code we started with, and nearly five times faster than the C implementation of the Euclidean Algorithm described in many textbooks.

Another thing to notice is that compilers are improving over time. The results from the pre-release version of gcc 4.6 used here are particularly good, with the tuned asm version only being about 28% faster. If you try to use simple asm code, then the compiler may do a better job.

said...

How do the ideas presented here relate to finding the quotient ((int)i/j) and remainder (i%j) of integers?

eappyqtokid said...