The Euclidean Algorithm
The task for today is to find the fastest algorithm to calculate the greatest common divisor (GCD) of a pair of numbers. The GCD has a number of properties that allow us to express the GCD of a pair of larger numbers as the GCD of smaller numbers. By using recursion, this leads to a solution.
GCD(0, 0) = 0 |
GCD(x, 0) = x |
GCD(x, y) = GCD(y, x) |
GCD(x, y) = GCD(x - y, x) |
GCD(a⋅x, a⋅y) = a⋅GCD(x, y) |
The first of the above identities follows from convention, where zero divides zero zero times (instead of one time). This makes some identities with the least-common-multiple simpler.
So how should we benchmark an algorithm for this problem? The difficulty is that the GCD of of a pair of two numbers can be quite different to the GCD of another pair, even if the two pairs are quite close in magnitude. To be fair, we need to test many pairs of numbers, and obtain an overall average speed. Another issue is that generating such pairs of numbers can be quite slow. (We don't want to be benchmarking the speed of the random number generator.)
To solve these problems, we allocate a large array to hold 224 pairs of random numbers. Once the array is filled, we can use the rdtsc instruction before and after calculating the GCD of all the pairs. This avoids including the setup time in the benchmark results. Finally, to by dividing by the number of pairs we can find out how many cycles it takes per number pair on average for a given algorithm on this machine.
Euclidean Algorithm
Since the algorithm above is tail-recursive, we can re-express it in an iterative form:
typedef unsigned long long u64b;
u64b gcd0(u64b u, u64b v)
{
/* Special case when input is zero */
if (!u) return v;
/* Repeatedly subtract off the smaller from the larger */
while (v)
{
if (u > v)
{
u -= v;
}
else
{
v -= u;
}
}
return u;
}
This subtractive algorithm takes 4700 cycles on average for random pairs of 64 bit numbers.
The above can be substantially improved. If we notice that repeated subtraction is the same as determining the remainder of a division, we can use a modulus operator instead. This is the classical Euclidean algorithm.
GCD(x, y) = GCD(x mod y, x) |
In C, this becomes:
u64b gcd1(u64b u, u64b v)
{
if (!u || !v) return u | v;
/* Repeatedly obtain remainder of smaller from the larger */
while (v)
{
u64b t = v;
v = u % v;
u = t;
}
return u;
}
The above takes about 2762 cycles per pair. This approaches being twice as fast as the simple subtraction method.
Binary Euclidean Algorithm
Unfortunately, on modern machines, the division instruction is quite slow. This greatly hinders the speed of the above code. To do better, we can explore some additional identities for GCD's:
GCD(2x, 2y) = 2GCD(x, y) |
GCD(2x, 2y+1) = GCD(x, 2y+1) |
GCD(2x+1, 2y+1) = GCD(2x+1, y - x) |
i.e. the GCD of two even numbers is twice the GCD of half of those numbers. The GCD of an even and an odd number may be simplified by noticing that the factor of two cannot be a divisor of the odd number. Finally, the GCD of two odd numbers can be simplified by subtracting the smaller from the larger to yield an even number, which can then in turn be simplified by removing the extra powers of two. Solving for a GCD this way is known as the "Binary Euclidean Algorithm".
u64b gcd2(u64b u, u64b v)
{
int shift;
if (!u || !v) return u | v;
/* Get common multiples of two */
for (shift = 0; !((u | v) & 1); shift++)
{
u /= 2;
v /= 2;
}
/* Make u odd */
while (!(u & 1)) u /= 2;
/* From here on, u is always odd. */
do
{
while (!(v & 1)) v /= 2;
/* Subtract the smaller from the larger */
if (u < v)
{
v -= u;
}
else
{
u64b t = u - v;
u = v;
v = t;
}
} while (v);
return u << shift;
}
The above takes an average of 1450 cycles, which is a nice speed-up.
Now the above can be improved a little more by using the bsf instruction. This allows us to obtain all the powers of two in one step, rather than using a loop. gcc exposes this with the __builtin_ctzll() function. Wrapping that in an inline function for easier typing, and using it leads to:
static inline u64b ffsq(u64b x)
{
return __builtin_ctzll(x);
}
u64b gcd3(u64b u, u64b v)
{
int shift;
if (!u || !v) return u | v;
shift = ffsq(u | v);
u >>= ffsq(u);
while (v)
{
v >>= ffsq(v);
if (u < v)
{
v -= u;
}
else
{
u64b t = u - v;
u = v;
v = t;
}
}
return u << shift;
}
This is yet again a very nice improvement, taking 735 cycles per pair.
Base six Euclidean Algorithm
The reason the Binary Euclidean Algorithm works so well is that two is a common prime factor of numbers, and computers are very good at manipulating numbers in base two (binary). The next-most common prime factor is three. It would be nice if we could extend the binary algorithm to use it as well. Fortunately, testing to see if a number is a multiple of three is relatively easy.
If 0xaaaaaaaaaaaaaaab * x <= 0x5555555555555555 then the 64 bit unsigned integer,x is a multiple of three. To show this, note how the number 0xaaaaaaaaaaaaaaab is the inverse of three mod 264. When multiplied by any multiple of three, it will produce the correct answer. Since (264-1)/3 = 0x5555555555555555 is the largest possible result, any other must come from a non-multiple of three.
We can use a similar trick to test to see if a number is an exact multiple of five: 0xcccccccccccccccd * x <= 0x3333333333333333 . However, multiples of five are rare enough that this perhaps isn't worth the cost of the multiplication instruction.
The idea here is that since the bsf instruction is slow, we can overlap the execution of the multiply for the multiple-of-three test with it. This should yield an algorithm with an inner loop that is faster than the Binary Euclidean Algorithm because it will shrink powers of two as well as three.
However, there is one other technicality we need to think about. The above doesn't check for multiples of 9, 27, 81 etc., only single multiples of three. If we wish to "shrink" higher powers, we need to repeat the multiply and comparison test. This is unfortunately slow, as then they will not overlap with the bsf instruction. Thus we will only reduce by a single power at each step.
u64b gcd4(u64b u, u64b v)
{
int s1, s2;
u64b p = 1;
u64b c3 = 0x5555555555555555;
u64b m3 = 0xaaaaaaaaaaaaaaab;
u64b t1, t2;
if (!u || !v) return u | v;
s1 = ffsq(u);
s2 = ffsq(v);
u >>= s1;
if (s1 > s2) s1 = s2;
t1 = u * m3;
t2 = v * m3;
while ((t1 <= c3) && (t2 <= c3))
{
p *= 3;
u = t1;
v = t2;
t1 = u * m3;
t2 = v * m3;
}
while (t2 <= c3)
{
v = t2;
t2 = v * m3;
}
while (t1 <= c3)
{
u = t1;
t1 = u * m3;
}
while (v)
{
int t3 = ffsq(v);
/* Try to shrink a power of three */
t1 = v * m3;
if (t1 <= c3) v = t1;
/* Now remove all powers of two */
v >>= t3;
/* Binary Euclidean Algorithm part */
if (u < v)
{
v -= u;
}
else
{
u64b t = u - v;
u = v;
v = t;
}
}
return (u << s1) * p;
}
The results are unfortunately disappointing, taking 848 cycles per pair, slower than the Binary Euclidean Algorithm. However, looking at the generated assembly language, it is easy to see why. The use of large 64bit constants seems to confuse gcc. Instead of storing them in registers, it repeatedly uses the slow and bulky movabs instruction to reload them. It also isn't particularly good at optimizing the critical inner loop. To do better, we'll need to move to asm from C.
Using ASM
The Binary Euclidean Algorithm is relatively easy to convert into assembly language. However, there are a few optimizations we can make. The first is to notice that we end up shifting the result by ffs(u | v) . If we know the location of the lowest set bit, we can multiply by that instead. Fortunately, there is a bit-hack x&(-x) that does what we want. We can use it to avoid a slow bsf instruction.
Another trick is to notice that the comparison used to determine whether to subtract u from v or v from u sets the same flags as a subtraction. Thus, we can pretend that the subtraction does things the "right" way, and then fix things up later if it doesn't. Since the fix-up only takes two instructions, this is quite nice. The resulting code is:
.globl gcd5
.type gcd5,@function
.align 16
gcd5:
mov %rsi, %rax
or %rdi, %rax
# Check for zero arguments
test %rsi, %rsi
je 2f
test %rdi, %rdi
je 2f
# Start to calculate 1 << ffs(u | v)
mov %rax, %rdx
# Start to calculate u >> ffs(u)
bsf %rdi, %rcx
neg %rax
shr %cl, %rdi
and %rdx, %rax
# The inner loop
.align 16
1: bsf %rsi, %rcx
shr %cl, %rsi
# Case 1, u < v
sub %rdi, %rsi
ja 1b
# Otherwise, we need to do more calculation
add %rsi, %rdi
# This in addition to correcting the sign of v, also tests v against zero
neg %rsi
jne 1b
# Done, multiply by shift factor
imul %rdi, %rax
2: retq
.size gcd5, .-gcd5
This takes 903 cycles per pair, slower than both the C algorithms above! What is going on? It turns out although the assembly code is very compact, it suffers from a fatal flaw. The jump in the inner loop is unpredictable. The resulting miss-predictions cause a large slowdown. gcc is able to generate branchless code there, so performs better.
So how shall we remove the conditional jump? On obvious trick is to use the sbb instruction to create a mask. We can make a conditional add by a logical-and with the mask followed by an add. We can make a conditional negation by xoring by the mask and then subtracting it. (This is how a branchless abs function is often implemented. Doing this we have:
.globl gcd6
.type gcd6,@function
.align 16
gcd6:
mov %rsi, %rax
or %rdi, %rax
test %rsi, %rsi
je 2f
test %rdi, %rdi
je 2f
mov %rax, %rdx
bsf %rdi, %rcx
neg %rax
shr %cl, %rdi
and %rdx, %rax
# The inner loop
.align 16
1: bsf %rsi, %rcx
shr %cl, %rsi
sub %rdi, %rsi
# Create the mask
sbb %rdx, %rdx
# Duplicate the mask
mov %rdx, %rcx
# Start of conditional add
and %rsi, %rdx
# Start of conditional negation
xor %rcx, %rsi
# Complete conditional add
add %rdx, %rdi
# Complete conditional negation
sub %rcx, %rsi
jne 1b
imul %rdi, %rax
2: retq
.size gcd6, .-gcd6
This takes 753 cycles per pair. Faster, but still not good enough to beat the C version. Perhaps instead of being quite so tricky, we should use conditional move instructions. By finding the maximum and minimum we can then do the subtraction afterwards. This means we have a compare and a subtraction... but this may be faster. The resulting code looks like:
.globl gcd7
.type gcd7,@function
.align 16
gcd7:
mov %rsi, %rax
or %rdi, %rax
test %rsi, %rsi
je 2f
test %rdi, %rdi
je 2f
mov %rax, %rdx
bsf %rdi, %rcx
neg %rax
shr %cl, %rdi
and %rdx, %rax
.align 16
1: bsf %rsi, %rcx
# Save u for later
mov %rdi, %rdx
shr %cl, %rsi
# Compare u and v
cmp %rsi, %rdi
# Get minimum in %rdi, and maximum in %rsi
cmova %rsi, %rdi
cmova %rdx, %rsi
# Do the subtraction
sub %rdi, %rsi
jne 1b
imul %rdi, %rax
2: retq
.size gcd7, .-gcd7
This takes 663 cycles per pair, and finally we are faster than C. However, with a little more thinking, we may be able to do even better by rearranging the order of the operations. We do this by subtracting twice: u - v , and v - u , and then picking the needed results via conditional moves. This removes the cmp instruction, and obtain more inter-instruction parallelism:
.globl gcd8
.type gcd8,@function
.align 16
gcd8:
mov %rsi, %rax
or %rdi, %rax
test %rsi, %rsi
je 2f
test %rdi, %rdi
je 2f
mov %rax, %rdx
bsf %rdi, %rcx
neg %rax
shr %cl, %rdi
and %rdx, %rax
# Inner loop combining conditional moves and comparison-removal
.align 16
1: bsf %rsi, %rcx
mov %rdi, %rdx
shr %cl, %rsi
sub %rsi, %rdx
mov %rsi, %rcx
sub %rdi, %rsi
cmovc %rdx, %rsi
cmovc %rcx, %rdi
jne 1b
imul %rdi, %rax
2: retq
.size gcd8, .-gcd8
The above code takes 574 cycles per pair, 20% faster than the C method. Unfortunately, it looks to be rather difficult to improve things further. One obvious trick is to use the fact that the test for zero of %rdi and the zero-flag result of the bsf instruction are redundant. However, removing the extra test doesn't result in any speedup. Perhaps the reason for this is that the inner loop is aligned at a 16 byte interval. By coincidence, the length of the initialization code is a multiple of 16 bytes already, so the alignment directive actually does nothing in this case. However, if we start removing instructions, the assembler will need to add a nop , counteracting any speedup we had hoped to get.
Now that we have the best implementation of the Binary Euclidean Algorithm we could find, we can use the same tricks to speed up the base six algorithm. The resulting code looks like:
.globl gcd9
.type gcd9,@function
.align 16
gcd9:
mov %rsi, %rax
or %rdi, %rax
test %rsi, %rsi
je 5f
test %rdi, %rdi
je 5f
bsf %rdi, %rcx
mov %rax, %r11
mov %rsi, %r9
shr %cl, %rdi
movabs $0x5555555555555555, %r8
neg %rax
mov %rdi, %r10
# Use the fact that the multiplier and comparison are related to avoid movabs
lea 0x1(%r8, %r8), %rdx
and %r11, %rax
mov %rdx, %r11
jmp 2f
.align 8
# The loop to remove common powers of three
# Multiply %rax by three
1: lea (%rax, %rax, 2), %rax
mov %r10, %rdi
mov %r9, %rsi
2: imul %rdx, %r10
cmp %r8, %r10
ja 4f
imul %rdx, %r9
cmp %r8, %r9
jbe 1b
# Remove powers of three from u.
3: mov %r10, %rdi
imul %rdx, %r10
cmp %r8, %r10
jbe 3b
# The inner loop
.align 16
# Overlap the execution of the multiply and bitscan
4: imul %rsi, %r11
bsf %rsi, %rcx
mov %rdi, %r10
cmp %r8, %r11
cmovbe %r11, %rsi
# Exact division by two and three commute
shr %cl, %rsi
# Similar code to the Binary Euclidean Algorithm
mov %rdx, %r11
sub %rsi, %r10
mov %rsi, %rcx
sub %rdi, %rsi
cmovc %r10, %rsi
cmovc %rcx, %rdi
jne 4b
imul %rdi, %rax
5: retq
.size gcd9, .-gcd9
This takes a slightly disappointing 594 cycles per pair, just a little bit slower than the Binary Euclidean Algorithm. What is going on? It turns out that the inner loop here is actually faster than the binary case. However, the initial removal of powers of three is quite slow. The slow initialization makes all the difference, and the timings are worse. Perhaps with larger numbers (which need more inner loop iterations on average), things would be different.
Summary
It isn't easy to find the fastest way of doing something. You may need to try many different algorithms, and implementations of those algorithms. In this case, the assembly version of the Binary Euclidean Algorithm was the fastest. It is eight times faster than the naive code we started with, and nearly five times faster than the C implementation of the Euclidean Algorithm described in many textbooks.
Another thing to notice is that compilers are improving over time. The results from the pre-release version of gcc 4.6 used here are particularly good, with the tuned asm version only being about 28% faster. If you try to use simple asm code, then the compiler may do a better job.
|
Comments
said...How do the ideas presented here relate to finding the quotient ((int)i/j) and remainder (i%j) of integers?