## The Euclidean AlgorithmThe task for today is to find the fastest algorithm to calculate the greatest common divisor (GCD) of a pair of numbers. The GCD has a number of properties that allow us to express the GCD of a pair of larger numbers as the GCD of smaller numbers. By using recursion, this leads to a solution.
The first of the above identities follows from convention, where zero divides zero zero times (instead of one time). This makes some identities with the least-common-multiple simpler. So how should we benchmark an algorithm for this problem? The difficulty is that the GCD of of a pair of two numbers can be quite different to the GCD of another pair, even if the two pairs are quite close in magnitude. To be fair, we need to test many pairs of numbers, and obtain an overall average speed. Another issue is that generating such pairs of numbers can be quite slow. (We don't want to be benchmarking the speed of the random number generator.) To solve these problems, we allocate a large array to hold 2 ## Euclidean AlgorithmSince the algorithm above is tail-recursive, we can re-express it in an iterative form:
This subtractive algorithm takes 4700 cycles on average for random pairs of 64 bit numbers. The above can be substantially improved. If we notice that repeated subtraction is the same as determining the remainder of a division, we can use a modulus operator instead. This is the classical Euclidean algorithm.
In C, this becomes:
The above takes about 2762 cycles per pair. This approaches being twice as fast as the simple subtraction method. ## Binary Euclidean AlgorithmUnfortunately, on modern machines, the division instruction is quite slow. This greatly hinders the speed of the above code. To do better, we can explore some additional identities for GCD's:
i.e. the GCD of two even numbers is twice the GCD of half of those numbers. The GCD of an even and an odd number may be simplified by noticing that the factor of two cannot be a divisor of the odd number. Finally, the GCD of two odd numbers can be simplified by subtracting the smaller from the larger to yield an even number, which can then in turn be simplified by removing the extra powers of two. Solving for a GCD this way is known as the "Binary Euclidean Algorithm".
The above takes an average of 1450 cycles, which is a nice speed-up. Now the above can be improved a little more by using the
This is yet again a very nice improvement, taking 735 cycles per pair. ## Base six Euclidean AlgorithmThe reason the Binary Euclidean Algorithm works so well is that two is a common prime factor of numbers, and computers are very good at manipulating numbers in base two (binary). The next-most common prime factor is three. It would be nice if we could extend the binary algorithm to use it as well. Fortunately, testing to see if a number is a multiple of three is relatively easy. If We can use a similar trick to test to see if a number is an exact multiple of five: The idea here is that since the However, there is one other technicality we need to think about. The above doesn't check for multiples of 9, 27, 81 etc., only single multiples of three. If we wish to "shrink" higher powers, we need to repeat the multiply and comparison test. This is unfortunately slow, as then they will not overlap with the
The results are unfortunately disappointing, taking 848 cycles per pair, slower than the Binary Euclidean Algorithm. However, looking at the generated assembly language, it is easy to see why. The use of large 64bit constants seems to confuse gcc. Instead of storing them in registers, it repeatedly uses the slow and bulky ## Using ASMThe Binary Euclidean Algorithm is relatively easy to convert into assembly language. However, there are a few optimizations we can make. The first is to notice that we end up shifting the result by Another trick is to notice that the comparison used to determine whether to subtract u from v or v from u sets the same flags as a subtraction. Thus, we can pretend that the subtraction does things the "right" way, and then fix things up later if it doesn't. Since the fix-up only takes two instructions, this is quite nice. The resulting code is:
This takes 903 cycles per pair, slower than both the C algorithms above! What is going on? It turns out although the assembly code is very compact, it suffers from a fatal flaw. The jump in the inner loop is unpredictable. The resulting miss-predictions cause a large slowdown. gcc is able to generate branchless code there, so performs better. So how shall we remove the conditional jump? On obvious trick is to use the
This takes 753 cycles per pair. Faster, but still not good enough to beat the C version. Perhaps instead of being quite so tricky, we should use conditional move instructions. By finding the maximum and minimum we can then do the subtraction afterwards. This means we have a compare and a subtraction... but this may be faster. The resulting code looks like:
This takes 663 cycles per pair, and finally we are faster than C. However, with a little more thinking, we may be able to do even better by rearranging the order of the operations. We do this by subtracting twice:
The above code takes 574 cycles per pair, 20% faster than the C method. Unfortunately, it looks to be rather difficult to improve things further. One obvious trick is to use the fact that the test for zero of Now that we have the best implementation of the Binary Euclidean Algorithm we could find, we can use the same tricks to speed up the base six algorithm. The resulting code looks like:
This takes a slightly disappointing 594 cycles per pair, just a little bit slower than the Binary Euclidean Algorithm. What is going on? It turns out that the inner loop here is actually faster than the binary case. However, the initial removal of powers of three is quite slow. The slow initialization makes all the difference, and the timings are worse. Perhaps with larger numbers (which need more inner loop iterations on average), things would be different. ## SummaryIt isn't easy to find the fastest way of doing something. You may need to try many different algorithms, and implementations of those algorithms. In this case, the assembly version of the Binary Euclidean Algorithm was the fastest. It is eight times faster than the naive code we started with, and nearly five times faster than the C implementation of the Euclidean Algorithm described in many textbooks. Another thing to notice is that compilers are improving over time. The results from the pre-release version of gcc 4.6 used here are particularly good, with the tuned asm version only being about 28% faster. If you try to use simple asm code, then the compiler may do a better job. |

About Us | Returns Policy | Privacy Policy | Send us Feedback |

Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |

## Comments

said...How do the ideas presented here relate to finding the quotient ((int)i/j) and remainder (i%j) of integers?