## Interval ArithmeticIn some cases one needs to have absolute confidence in numerical calculations. The problem is that the input data may be uncertain. The error from this can grow until it dominates the results. In order to know if this is occurring one can use interval arithmetic to model how the range of possible values changes throughout the calculation. Instead of having a single floating point number in a variable, we now use two. The minimum possible, and maximum possible given the "error". Thus we have for example
Another example where this can be useful is numerically finding the extremums of a function. Using interval arithmetic, we can obtain limits on the possible range of the function within a given segment. We can then subdivide the segments we are interested in recursively to find the minimum or maximum. Things proceed even faster if we also know the function's derivative. We can maintain that that derivative is zero somewhere within the segment. By rejecting segments where this isn't the case, and which are known to be either to high or low in value compared to the best ones, we can home in on the solution. For multidimensional minimization problems of "difficult" analytic functions, this method can be among the best. The problem with interval arithmetic is that it is guaranteed to give the best limits only in the case where a variable is used once in the formula being calculated. (i.e. the assumption is that there is no correlation between inputs.) If one calculates something like ## Interval Floating PointDue to the fact that intervals are pairs of numbers it is extremely convenient to pass them around in SSE registers, which can represent vectors of two double precision floats. We will use the low entry as the minimum, and the high entry as the maximum for the interval. Code to create such an encoding, and then access the component parts is:
In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off. Adding two numbers in interval form then isn't to difficult. We extract the minimum and maximum, and then add them. The trick here is rounding. By default, the computer will "round to even". This isn't what we want, as half the time the rounding will be in the wrong direction. We need to make sure that the rounding always increases the size of the region, otherwise points inside can "escape", and we will have incorrect results. Thus we will need to fiddle with the SSE hardware's rounding modes. Firstly, we need to be able to save and restore the state, so that when we are done with interval arithmetic we can return things to the way they were. (Round to even is much more accurate with normal floating point as errors are not compounded secularly, and tend to cancel out instead.)
Now, we will need to make sure the rounding is in the right direction for each addition. The maximums will need to round towards positive infinity, and the minimums to negative infinity:
Subtraction works similarly:
Multiplication is quite a bit more complex. Fortunately, we don't need to do the minimum and maximum of eight different multiplies. By investigating whether or not the intervals overlap zero, or are always positive or negative we can break the problem down into cases where a maximum of two multiplies are needed, and where usually only one is required. The code for this looks like:
Division is much more complex, as we need to worry about division by zero. Fortunately, there is an easier way. We can break the problem down into obtaining the reciprocal, and then multiplying that reciprocal by the dividend. The extra work isn't too much compared to how slow a division is, and saves quite a bit of code:
Finally, we can obtain square roots in the obvious manner:
## SSE VectorizationThe above code works, and is very easy to follow. However, it isn't very fast. The problem is that changing the rounding mode is a slow operation. It would be very nice if we could maintain a single rounding mode throughout, and use some trickery to make things work correctly. Fortunately, this has been shown to be possible by Branimir Lamov. What he does is instead of storing the maximum, he stores its negative. Thus both the minimum and maximum then round in the same direction: towards negative infinity. Using this, he describes how one can make simple routines for addition and subtraction. Finally, he shows how with the addition of a single new instruction, interval multiplication can be made to be efficient. Unfortunately, such an instruction doesn't exist, and trying to emulate it leads to relatively slow code. Lets see what is possible with the SSE2 instruction set as it exists, rather than how we might like it to be. We will do a similar trick to B. Lamov, but store the negative of the minimum instead of the maximum, and round towards positive infinity. Thus we have the helper functions:
Addition in this encoding is quite simple:
Subtraction is a little more complex. However, by breaking it into two steps it is easy to see how to convert into SSE instructions. Firstly, we have negation:
Multiplication with SSE acceleration isn't trivial. As can be seen by the non-accelerated version, the underlying algorithm is quite complex. However, as we shall see, it is possible to simplify things enormously without any cost in accuracy. Firstly, we shall benchmark the original algorithm. To do this, we test a series of random pairs of intervals to multiply, making sure to include extreme values such as ±0 and ±∞. By pre-computing the set of intervals to use, and storing them in an array we try to minimize the benchmark overhead. The end result is that the "empty" loop consisting of just a component-wise multiply takes about 4 cycles per iteration, and the algorithm above takes a very slow 110 cycles per call. Firstly, we can do a direct translation of the old interval multiply code. The only difficulty here is making sure all the rounding is done in the right way. Thus there are a few new minus signs in the formulae, and a few of the comparisons change direction:
Simply by removing all of the rounding-mode changes we double the speed, and it now takes 50 cycles on average. Next we can replace the component-wise operations with their SSE analogues. The only trick here is noticing that there are two different ways to swap the upper and lower parts of a SSE register. One way is via
This is slightly faster, taking about 46.5 cycles on average, since we gain some instruction-level parallelism. Unfortunately, this particular computer has slow SSE2 operations (it internally breaks them down into two 64 bit micro-ops rather than having a full 128 bit unit), so the speed up isn't as great as we might predict. On other, newer, machines the results could be much better. The next thing to worry about is all the branching in the above algorithm. This slows it down quite a bit. Perhaps with a little data massaging we can improve things. One way to do this is to remember that negating an interval with this encoding is a trivial swap of the high and low parts. Thus if we conditionally swap
This code is faster again, taking 44 cycles on average. However, by accessing the upper parts of the SSE register in order to determine whether to negate or not, the compiler has to generate rather bulky code. If instead, we use the
Unfortunately, this is a disappointing 48.5 cycles per call. It seems on this machine, the
However, this isn't the only way. By using the power of the
Using the above conditional swap subroutine, we can replace the
Unfortunately, the resulting timings are horrible: taking 62.5 cycles on average. However, there is another optimization we have forgotten. Since we have reduced the number of cases, it is now possible to remove the "one overlap" code. (The two-overlap case works for it.) Doing this small change:
Leading to 62 cycles per call. Thus was have a small improvement, but still not as good as the earlier algorithms. However, we have now reduced the complexity of the branching enough that we can start to think about using SSE operations for that. Also noticing that we really don't need to think about conditional swaps if we are in the double-zero-overlap or single-overlap cases, then we can simplify even more:
This completes in an impressive 37 cycles per call, and is the fastest algorithm so far. There are a few subtleties in the above though. The first is that we use the The above isn't quite optimal. It has three conditional swaps, where it is actually possible to do it in two. By looking at what is actually required we can rearrange the logic so that the condition for
This takes 33 cycles on average on this machine, more than three times faster than the naive method. Unfortunately, there doesn't seem to be a simple way to convert the final test and branch into SSE code without slowdowns. (The logic in each path is just too different.) However, if anyone finds a better way with standard SSE2 instructions it would be interesting to know. (Note that the Now division isn't quite so difficult as multiplication if we simply reuse the above algorithm. The trick again is to use a reciprocal + multiply:
Finally, this leaves us with the square root operation. The problem with it is that we cannot use the negation trick in order to get correct rounding for the minimum bound of the interval. Square roots of negative numbers yield a NaN, so we will need to try something else. One way to do it is to do component-wise square roots, and then adjust the minimum if rounding occurred:
Unfortunately, the So how much do we need to scale the minimum? By doing a Taylor expansion, it is possible to show that we need to scale the minimum down by 2 ulp. However, the multiply instruction to do that also rounds the wrong way. Through some experimentation, it seems 8 ulp is what is needed, including the double rounding effects. Using that we have:
Thus even highly non-linear functions like the interval square root can be optimized by using instruction-level parallelism with SSE. |

About Us | Returns Policy | Privacy Policy | Send us Feedback |

Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |

## Comments

sfuerst said...It includes a fully-branchless two-mulpd version. I wonder what your measurement of its performance will be.