There is an interesting article at google research that discusses the fact that many implementations of the binary search algorithm are incorrect. The example given is that in Programming Pearls translated into java. Reproducing the original code in the book, together with some explanatory comments:
The article notes that this is typically how a binary search is implemented, and that it has a subtle bug. The problematic line is
Obviously we need to fix the overflow. One given solution is to replace the line with
Whether it does this or not is a function of how smart your compiler is. The smarter the compiler, the better it is at finding optimizations like this that undo your hard work of trying to get numerically correct answers. By rearranging further, it may be possible to confuse a compiler enough in order to give the correct answer. i.e. if we move the
Another solution from the article is to use the substitution
A solution given in the comments below the main article does finally give the right answer. It uses the fact that the addition operation can be split up bitwise. The carry between bits can be obtained by the logical-and of the two summands. The resultant bit can be obtained by the exclusive or. Thus we have
Are we done? Unfortunately, the above misses the forest for the trees. We are implementing a binary search algorithm. What does this imply? Firstly, the array we are searching through must fit in memory. Secondly it must make sense for us to be searching using this algorithm. The first implication means that the size of the array is constrained. If we have a 64bit machine, then the maximal address space for user-space (or kernel-space) is 63bits, as the address space is divided between them. Since a program must be running, and it itself takes up room, then less than 63bits of address space must be devoted to the array. (This is ignoring the fact that implementations of the 64bit architecture actually only use 48 address bits currently - we might as well be a little future-proof.) This means that no matter what happens, the overflow error doesn't exist on the 64bit case provided we use a 64bit integer.
The 32bit case is harder. It is possible to have a 3GiB/1GiB or even 4GiB/4GiB user to kernel-space split there, so we need to think a little more. Looking at the code, we know that
The argument for short integers is a little more precarious. We can have 231 of those in our address space, and this may cause an overflow if 231+231 is calculated. However, to get this large a number of short integers requires our program implementing the binary search to take up zero space. If our program takes up a grand total of two bytes (or more, obviously), then the overflow cannot happen. So pragmatically, we can ignore the problem here as well.
This leaves the final case, we are doing a binary search on individual bytes. The above address-space arguments fail for this case, there are indeed realistic hardware setups which could allow an overflow condition. However, we still have the second implication above which we haven't used until now. Why are we using a binary search? We are trying to find the location of an element with a given value in a sorted array. Can you imagine doing this on 4GiB of sorted bytes? The whole operation doesn't make sense. The search can easily be replaced by a 256-entry table with the cumulative sum of the number bytes with a given value. The algorithm can then be implemented by a trivial table-lookup.
So what is the correct solution? The easiest solution is to simply replace the type of
The article has been used by the proponents of higher-level languages to say that C is fundamentally unsound, and that their pet language could never suffer this problem. Unfortunately for them, I think their argument fails . As has been shown, the only actual case where the bug appears, is where the algorithm is so obviously inefficient that it would be replaced with something better. If you are using a low-level language like C, then speed tends to be one of the things you care about otherwise you wouldn't be using it. Thus the C programmer tends to be on the lookout for silliness like this. However, taken from the high-level language point of view, the lack of fixed-sized integers does prevent some classes of bugs. Unfortunately, it does this at a cost of an order of magnitude or more of performance.
The extra speed of languages like C comes at a price. The programmer simply needs to think about the underlying architecture much more often.
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.