Interval Arithmetic
In some cases one needs to have absolute confidence in numerical calculations. The problem is that the input data may be uncertain. The error from this can grow until it dominates the results. In order to know if this is occurring one can use interval arithmetic to model how the range of possible values changes throughout the calculation. Instead of having a single floating point number in a variable, we now use two. The minimum possible, and maximum possible given the "error".
Thus we have for example x=[a, b], y=[c,d] , with x ranging between a and b , and y between c and d . Using this we can obtain simple algorithms for the basic arithmetic operations:
x + y = | [a+c, b+d] |
x - y = | [a-d, b-c] |
x × y = | [min(ac, ad, bc, bd), max(ac, ad, bc, bd)] |
x / y = | [min(a/c, a/d, b/c, b/d), max(a/c, a/d, b/c, b/d)] |
Another example where this can be useful is numerically finding the extremums of a function. Using interval arithmetic, we can obtain limits on the possible range of the function within a given segment. We can then subdivide the segments we are interested in recursively to find the minimum or maximum. Things proceed even faster if we also know the function's derivative. We can maintain that that derivative is zero somewhere within the segment. By rejecting segments where this isn't the case, and which are known to be either to high or low in value compared to the best ones, we can home in on the solution. For multidimensional minimization problems of "difficult" analytic functions, this method can be among the best.
The problem with interval arithmetic is that it is guaranteed to give the best limits only in the case where a variable is used once in the formula being calculated. (i.e. the assumption is that there is no correlation between inputs.) If one calculates something like f(x)=x2 , then the fact that the two versions of x in the multiply are obviously highly correlated, means that the results aren't quite as tight. Fortunately, it is often possible to rearrange formulae to reduce this effect.
Interval Floating Point
Due to the fact that intervals are pairs of numbers it is extremely convenient to pass them around in SSE registers, which can represent vectors of two double precision floats. We will use the low entry as the minimum, and the high entry as the maximum for the interval. Code to create such an encoding, and then access the component parts is:
#include <math.h>
#include <emmintrin.h>
double in1_min(__m128d x)
{
return x[0];
}
double in1_max(__m128d x)
{
return x[1];
}
__m128d in1_create(double min, double max)
{
return (__m128d){min, max};
}
In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off.
Adding two numbers in interval form then isn't to difficult. We extract the minimum and maximum, and then add them. The trick here is rounding. By default, the computer will "round to even". This isn't what we want, as half the time the rounding will be in the wrong direction. We need to make sure that the rounding always increases the size of the region, otherwise points inside can "escape", and we will have incorrect results.
Thus we will need to fiddle with the SSE hardware's rounding modes. Firstly, we need to be able to save and restore the state, so that when we are done with interval arithmetic we can return things to the way they were. (Round to even is much more accurate with normal floating point as errors are not compounded secularly, and tend to cancel out instead.)
static unsigned msr_orig;
void interval_on(void)
{
/* Save rounding modes */
msr_orig = _mm_getcsr();
}
void interval_off(void)
{
/* Restore them */
_mm_setcsr(msr_orig);
}
Now, we will need to make sure the rounding is in the right direction for each addition. The maximums will need to round towards positive infinity, and the minimums to negative infinity:
__m128d in1_add(__m128d x, __m128d y)
{
double rmin, rmax;
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_max(x) + in1_max(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_min(x) + in1_min(y);
return in1_create(rmin, rmax);
}
Subtraction works similarly:
__m128d in1_sub(__m128d x, __m128d y)
{
double rmin, rmax;
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_max(x) - in1_min(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_min(x) - in1_max(y);
return in1_create(rmin, rmax);
}
Multiplication is quite a bit more complex. Fortunately, we don't need to do the minimum and maximum of eight different multiplies. By investigating whether or not the intervals overlap zero, or are always positive or negative we can break the problem down into cases where a maximum of two multiplies are needed, and where usually only one is required. The code for this looks like:
__m128d in1_mul(__m128d x, __m128d y)
{
double rmin, rmax;
double t1, t2;
if (in1_max(x) <= 0)
{
/* Both negative */
if (in1_max(y) <= 0)
{
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_min(x) * in1_min(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_max(x) * in1_max(y);
return in1_create(rmin, rmax);
}
/* One negative, one positive */
if (in1_min(y) >= 0)
{
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_max(x) * in1_min(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_min(x) * in1_max(y);
return in1_create(rmin, rmax);
}
/* y is both negative and positive, x is only negative */
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_min(x) * in1_min(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_min(x) * in1_max(y);
return in1_create(rmin, rmax);
}
if (in1_min(x) >= 0)
{
/* One negative, one positive */
if (in1_max(y) <= 0)
{
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_min(x) * in1_max(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_max(x) * in1_min(y);
return in1_create(rmin, rmax);
}
/* Both positive */
if (in1_min(y) >= 0)
{
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_max(x) * in1_max(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_min(x) * in1_min(y);
return in1_create(rmin, rmax);
}
/* y is both negative and positive, x is only positive */
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = in1_max(x) * in1_max(y);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = in1_max(x) * in1_min(y);
return in1_create(rmin, rmax);
}
/* Both overlap zero */
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
t1 = in1_max(x) * in1_max(y);
t2 = in1_min(x) * in1_min(y);
rmax = t1;
if (t2 > rmax) rmax = t2;
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
t1 = in1_max(x) * in1_min(y);
t2 = in1_min(x) * in1_max(y);
rmin = t1;
if (t2 < rmin) rmin = t2;
return in1_create(rmin, rmax);
}
Division is much more complex, as we need to worry about division by zero. Fortunately, there is an easier way. We can break the problem down into obtaining the reciprocal, and then multiplying that reciprocal by the dividend. The extra work isn't too much compared to how slow a division is, and saves quite a bit of code:
__m128d in1_recip(__m128d x)
{
double rmin, rmax;
/* Possible divide by zero */
if ((in1_max(x) >= 0) && (in1_min(x) <= 0))
{
return in1_create(-INFINITY, INFINITY);
}
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = 1.0/in1_min(x);
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = 1.0/in1_max(x);
return in1_create(rmin, rmax);
}
__m128d in1_div(__m128d x, __m128d y)
{
return in1_mul(x, in1_recip(y));
}
Finally, we can obtain square roots in the obvious manner:
__m128d in1_sqrt(__m128d x)
{
double rmin, rmax;
if (in1_min(x) < 0) return in1_create(NAN, NAN);
_MM_SET_ROUNDING_MODE(_MM_ROUND_UP);
rmax = sqrt(in1_max(x));
_MM_SET_ROUNDING_MODE(_MM_ROUND_DOWN);
rmin = sqrt(in1_min(x));
return in1_create(rmin, rmax);
}
SSE Vectorization
The above code works, and is very easy to follow. However, it isn't very fast. The problem is that changing the rounding mode is a slow operation. It would be very nice if we could maintain a single rounding mode throughout, and use some trickery to make things work correctly. Fortunately, this has been shown to be possible by Branimir Lamov. What he does is instead of storing the maximum, he stores its negative. Thus both the minimum and maximum then round in the same direction: towards negative infinity.
Using this, he describes how one can make simple routines for addition and subtraction. Finally, he shows how with the addition of a single new instruction, interval multiplication can be made to be efficient. Unfortunately, such an instruction doesn't exist, and trying to emulate it leads to relatively slow code. Lets see what is possible with the SSE2 instruction set as it exists, rather than how we might like it to be.
We will do a similar trick to B. Lamov, but store the negative of the minimum instead of the maximum, and round towards positive infinity. Thus we have the helper functions:
double in2_min(__m128d x)
{
return -x[0];
}
double in2_max(__m128d x)
{
return x[1];
}
/* Get the real value stored in the SSE register */
double in2_rmin(__m128d x)
{
return x[0];
}
__m128d in2_create(double min, double max)
{
return (__m128d){-min, max};
}
__m128d in2_raw_create(double nmin, double max)
{
return (__m128d){nmin, max};
}
Addition in this encoding is quite simple: [-a, b] + [-c, d] = [-(a+c), b+d] . (Asumming of course that we have set the rounding mode to be towards positive infinity via _MM_SET_ROUNDING_MODE(_MM_ROUND_UP); first.) Since the result of the component-wise addition is exactly what we need, we can use a single SSE instruction for this task:
__m128d in2_add(__m128d x, __m128d y)
{
/* Component-wise addition */
return x + y;
}
Subtraction is a little more complex. However, by breaking it into two steps it is easy to see how to convert into SSE instructions. Firstly, we have negation: -[-a, b] = [-(-b), -a] = [b, -a] which translates into a swap of the high and low parts of the SSE register. Finally, we can re-use the addition algorithm to complete the subtraction algorithm since x-y = x + (-y) . This gives:
__m128d in2_sub(__m128d x, __m128d y)
{
/* Swap the high and low parts of y to negate it */
y = _mm_shuffle_pd(y, y, 1);
/* Component-wise addition */
return x + y;
}
Multiplication with SSE acceleration isn't trivial. As can be seen by the non-accelerated version, the underlying algorithm is quite complex. However, as we shall see, it is possible to simplify things enormously without any cost in accuracy. Firstly, we shall benchmark the original algorithm. To do this, we test a series of random pairs of intervals to multiply, making sure to include extreme values such as ±0 and ±∞. By pre-computing the set of intervals to use, and storing them in an array we try to minimize the benchmark overhead. The end result is that the "empty" loop consisting of just a component-wise multiply takes about 4 cycles per iteration, and the algorithm above takes a very slow 110 cycles per call.
Firstly, we can do a direct translation of the old interval multiply code. The only difficulty here is making sure all the rounding is done in the right way. Thus there are a few new minus signs in the formulae, and a few of the comparisons change direction:
__m128d in2_mul1(__m128d x, __m128d y)
{
double rmin, rmax;
double t1;
double t2;
if (in2_max(x) <= 0)
{
/* Both negative */
if (in2_max(y) <= 0)
{
rmax = in2_rmin(x) * in2_rmin(y);
rmin = (-in2_max(x)) * in2_max(y);
return in2_raw_create(rmin, rmax);
}
/* One negative, one positive */
if (in2_rmin(y) <= 0)
{
rmax = (-in2_max(x)) * in2_rmin(y);
rmin = in2_rmin(x) * in2_max(y);
return in2_raw_create(rmin, rmax);
}
/* y is both negative and positive, x is only negative */
rmax = in2_rmin(x) * in2_rmin(y);
rmin = in2_rmin(x) * in2_max(y);
return in2_raw_create(rmin, rmax);
}
if (in2_rmin(x) <= 0)
{
/* One negative, one positive */
if (in2_max(y) <= 0)
{
rmax = (-in2_rmin(x)) * in2_max(y);
rmin = in2_max(x) * in2_rmin(y);
return in2_raw_create(rmin, rmax);
}
/* Both positive */
if (in2_rmin(y) <= 0)
{
rmax = in2_max(x) * in2_max(y);
rmin = (-in2_rmin(x)) * in2_rmin(y);
return in2_raw_create(rmin, rmax);
}
/* y is both negative and positive, x is only positive */
rmax = in2_max(x) * in2_max(y);
rmin = in2_max(x) * in2_rmin(y);
return in2_raw_create(rmin, rmax);
}
/* Both overlap zero */
t1 = in2_max(x) * in2_max(y);
t2 = in2_rmin(x) * in2_rmin(y);
rmax = t1;
if (t2 > rmax) rmax = t2;
t1 = in2_max(x) * in2_rmin(y);
t2 = in2_rmin(x) * in2_max(y);
rmin = t1;
if (t2 > rmin) rmin = t2;
return in2_raw_create(rmin, rmax);
}
Simply by removing all of the rounding-mode changes we double the speed, and it now takes 50 cycles on average. Next we can replace the component-wise operations with their SSE analogues. The only trick here is noticing that there are two different ways to swap the upper and lower parts of a SSE register. One way is via x = _mm_shuffle_pd(x, x, 1); , and the other via x = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e); . The first is best when the result is required in the original register. The second is better when the result is needed somewhere else. Thus we have:
__m128d in2_mul2(__m128d x, __m128d y)
{
__m128d t1;
__m128d t2;
if (in2_max(x) <= 0)
{
/* Both negative */
if (in2_max(y) <= 0)
{
x = _mm_xor_pd(x, (__m128d) {0.0, -0.0});
t1 = x * y;
return _mm_shuffle_pd(t1, t1, 1);
}
/* One negative, one positive */
if (in2_rmin(y) <= 0)
{
x = _mm_xor_pd(x, (__m128d) {0.0, -0.0});
y = _mm_shuffle_pd(y, y, 1);
return x * y;
}
/* y is both negative and positive, x is only negative */
x = _mm_unpacklo_pd(x, x);
y = _mm_shuffle_pd(y, y, 1);
return x * y;
}
if (in2_rmin(x) <= 0)
{
/* Both positive */
if (in2_rmin(y) <= 0)
{
y = _mm_xor_pd(y, (__m128d) {-0.0, 0.0});
return x * y;
}
/* One negative, one positive */
if (in2_max(y) <= 0)
{
x = _mm_shuffle_pd(x, x, 1);
y = _mm_xor_pd(y, (__m128d) {0.0, -0.0});
return x * y;
}
x = _mm_unpackhi_pd(x, x);
return x * y;
}
/* Both overlap zero */
t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e) * _mm_unpacklo_pd(y, y);
t2 = x * _mm_unpackhi_pd(y, y);
return _mm_max_pd(t1, t2);
}
This is slightly faster, taking about 46.5 cycles on average, since we gain some instruction-level parallelism. Unfortunately, this particular computer has slow SSE2 operations (it internally breaks them down into two 64 bit micro-ops rather than having a full 128 bit unit), so the speed up isn't as great as we might predict. On other, newer, machines the results could be much better.
The next thing to worry about is all the branching in the above algorithm. This slows it down quite a bit. Perhaps with a little data massaging we can improve things. One way to do this is to remember that negating an interval with this encoding is a trivial swap of the high and low parts. Thus if we conditionally swap x and y so that the maximum is always positive, we can remove quite a few of the possibilities. If we count the number of negations, we can then correct the output at the end with an extra swap if required:
__m128d in2_mul3(__m128d x, __m128d y)
{
__m128d t1;
__m128d t2;
int neg = 0;
if (in2_max(x) < 0)
{
x = _mm_shuffle_pd(x, x, 1);
neg = 1;
}
if (in2_max(y) < 0)
{
y = _mm_shuffle_pd(y, y, 1);
neg = !neg;
}
if (in2_rmin(x) <= 0)
{
if (in2_rmin(y) <= 0)
{
/* Both positive */
y = _mm_xor_pd(y, (__m128d) {-0.0, 0.0});
}
else
{
/* One overlaps zero */
x = _mm_unpackhi_pd(x, x);
}
x *= y;
}
else
{
/* Both overlap zero */
t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e) * _mm_unpacklo_pd(y, y);
t2 = x * _mm_unpackhi_pd(y, y);
x = _mm_max_pd(t1, t2);
}
if (neg) x = _mm_shuffle_pd(x, x, 1);
return x;
}
This code is faster again, taking 44 cycles on average. However, by accessing the upper parts of the SSE register in order to determine whether to negate or not, the compiler has to generate rather bulky code. If instead, we use the movmskpd instruction we can avoid some of this:
__m128d in2_mul4(__m128d x, __m128d y)
{
__m128d t1;
__m128d t2;
int n1 = _mm_movemask_pd(x);
int n2 = _mm_movemask_pd(y);
if (n1 & 2) x = _mm_shuffle_pd(x, x, 1);
if (n2 & 2) y = _mm_shuffle_pd(y, y, 1);
if (in2_rmin(x) <= 0)
{
if (in2_rmin(y) <= 0)
{
/* Both positive */
y = _mm_xor_pd(y, (__m128d) {-0.0, 0.0});
}
else
{
/* One overlaps zero */
x = _mm_unpackhi_pd(x, x);
}
x *= y;
}
else
{
/* Both overlap zero */
t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e) * _mm_unpacklo_pd(y, y);
t2 = (__m128d)_mm_shuffle_epi32((__m128i) y, 0xee) * x;
x = _mm_max_pd(t1, t2);
}
if ((n1 ^ n2) & 2) x = _mm_shuffle_pd(x, x, 1);
return x;
}
Unfortunately, this is a disappointing 48.5 cycles per call. It seems on this machine, the movmskpd is annoyingly slow. Lets try a different tack. How about replacing the jumping in the conditional swap with SSE code. The commonest way to do this with SSE is to create a condition mask, and then use the and and andn instructions to obtain the required parts of the result. By logical-oring them together, the answer can be obtained. Such code for a conditional swap of the high and low parts might look like:
__m128d c_swap(__m128d x, __m128d cond)
{
__m128d t = _mm_and_pd((__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e), cond);
cond = _mm_andnot_pd(cond, x);
return _mm_or_pd(cond, t);
}
However, this isn't the only way. By using the power of the xor operation, we can do this a little better. The reason is that the operations can then be perhaps merged with other calculations more easily. (xor commutes and is associative, allowing more possible rearrangements than the non-commuting andn operation:
__m128d c_swap(__m128d x, __m128d cond)
{
__m128d t = _mm_xor_pd((__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e), x);
cond = _mm_and_pd(cond, t);
return _mm_xor_pd(cond, x);
}
Using the above conditional swap subroutine, we can replace the movmskpd instruction with SSE comparisons:
__m128d in2_mul5(__m128d x, __m128d y)
{
__m128d t1;
__m128d t2;
__m128d c = {NAN, 0.0};
__m128d c1 = _mm_cmple_pd(x, c);
__m128d c2 = _mm_cmple_pd(y, c);
__m128d c3;
c1 = _mm_unpackhi_pd(c1, c1);
c2 = _mm_unpackhi_pd(c2, c2);
c3 = _mm_xor_pd(c1, c2);
x = c_swap(x, c1);
y = c_swap(y, c2);
if (in2_rmin(x) <= 0)
{
if (in2_rmin(y) <= 0)
{
/* Both positive */
y = _mm_xor_pd(y, (__m128d) {-0.0, 0.0});
}
else
{
/* One overlaps zero */
x = _mm_unpackhi_pd(x, x);
}
x *= y;
}
else
{
/* Both overlap zero */
t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e) * _mm_unpacklo_pd(y, y);
t2 = (__m128d)_mm_shuffle_epi32((__m128i) y, 0xee) * x;
x = _mm_max_pd(t1, t2);
}
return c_swap(x, c3);
}
Unfortunately, the resulting timings are horrible: taking 62.5 cycles on average. However, there is another optimization we have forgotten. Since we have reduced the number of cases, it is now possible to remove the "one overlap" code. (The two-overlap case works for it.) Doing this small change:
__m128d in2_mul6(__m128d x, __m128d y)
{
__m128d t1;
__m128d t2;
__m128d c = {NAN, 0.0};
__m128d c1 = _mm_cmple_pd(x, c);
__m128d c2 = _mm_cmple_pd(y, c);
__m128d c3;
c1 = _mm_unpackhi_pd(c1, c1);
c2 = _mm_unpackhi_pd(c2, c2);
c3 = _mm_xor_pd(c1, c2);
x = c_swap(x, c1);
y = c_swap(y, c2);
if ((in2_rmin(x) <= 0) && (in2_rmin(y) <= 0))
{
/* Both positive */
x *= _mm_xor_pd(y, (__m128d) {-0.0, 0.0});
}
else
{
/* Both overlap zero */
t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e) * _mm_unpacklo_pd(y, y);
t2 = (__m128d)_mm_shuffle_epi32((__m128i) y, 0xee) * x;
x = _mm_max_pd(t1, t2);
}
return c_swap(x, c3);
}
Leading to 62 cycles per call. Thus was have a small improvement, but still not as good as the earlier algorithms. However, we have now reduced the complexity of the branching enough that we can start to think about using SSE operations for that. Also noticing that we really don't need to think about conditional swaps if we are in the double-zero-overlap or single-overlap cases, then we can simplify even more:
__m128d in2_mul7(__m128d x, __m128d y)
{
__m128d t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0xee);
__m128d t2 = (__m128d)_mm_shuffle_epi32((__m128i) y, 0xee);
__m128d t3 = _mm_xor_pd(x, t1);
__m128d t4 = _mm_xor_pd(y, t2);
if (_mm_movemask_pd(_mm_and_pd(t3, t4)))
{
__m128d c = {NAN, 0.0};
__m128d c1 = _mm_cmple_pd(x, c);
__m128d c2 = _mm_cmple_pd(y, c);
__m128d c3;
c1 = _mm_unpackhi_pd(c1, c1);
c2 = _mm_unpackhi_pd(c2, c2);
c3 = _mm_xor_pd(c1, c2);
x = c_swap(x, c1);
y = c_swap(y, c2);
/* Both positive */
x *= _mm_xor_pd(y, (__m128d) {-0.0, 0.0});
return c_swap(x, c3);
}
/* There is a zero overlap */
t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e) * _mm_unpacklo_pd(y, y);
t2 *= x;
return _mm_max_pd(t1, t2);
}
This completes in an impressive 37 cycles per call, and is the fastest algorithm so far. There are a few subtleties in the above though. The first is that we use the _mm_shuffle_epi32 intrinsic, which encodes into the pshufd instruction instead of using the obvious unpckhpd instruction. Since we still need x and y , this combines the move and modify operations into a single instruction. Another issue is that we really need to be careful about the handling of negative zero. This is the reason for the less-than-or-equal comparison, rather than the obvious less-than in the swapping loop.
The above isn't quite optimal. It has three conditional swaps, where it is actually possible to do it in two. By looking at what is actually required we can rearrange the logic so that the condition for x depends on the state of y , and vise-versa. We also notice that we can reuse t1 and t2 instead of having extra unpack operations for the comparison. Doing this yields the very compact:
__m128d in2_mul8(__m128d x, __m128d y)
{
__m128d t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0xee);
__m128d t2 = (__m128d)_mm_shuffle_epi32((__m128i) y, 0xee);
__m128d t3 = _mm_xor_pd(x, t1);
__m128d t4 = _mm_xor_pd(y, t2);
if (_mm_movemask_pd(_mm_and_pd(t3, t4)))
{
__m128d c = {0.0, 0.0};
__m128d c1 = _mm_cmple_pd(t2, c);
__m128d c2 = _mm_cmple_pd(t1, c);
__m128d c3 = (__m128d) {-0.0, 0.0};
x = c_swap(_mm_xor_pd(x, c3), c1);
y = c_swap(_mm_xor_pd(y, c3), c2);
return x * _mm_xor_pd(y, c3);
}
/* There is a zero overlap */
t1 = (__m128d)_mm_shuffle_epi32((__m128i) x, 0x4e) * _mm_unpacklo_pd(y, y);
t2 *= x;
return _mm_max_pd(t1, t2);
}
This takes 33 cycles on average on this machine, more than three times faster than the naive method. Unfortunately, there doesn't seem to be a simple way to convert the final test and branch into SSE code without slowdowns. (The logic in each path is just too different.) However, if anyone finds a better way with standard SSE2 instructions it would be interesting to know. (Note that the pblendvb instruction in SSE4 might help though...)
Now division isn't quite so difficult as multiplication if we simply reuse the above algorithm. The trick again is to use a reciprocal + multiply:
__m128d in2_recip(__m128d x)
{
if (_mm_movemask_pd(_mm_cmple_pd((__m128d) {0, 0}, x)) == 3)
{
return in2_raw_create(INFINITY, INFINITY);
}
x = _mm_shuffle_pd(x, x, 1);
return _mm_div_pd((__m128d) {-1.0, -1.0}, x);
}
__m128d in2_div(__m128d x, __m128d y)
{
return in2_mul8(x, in2_recip(y));
}
Finally, this leaves us with the square root operation. The problem with it is that we cannot use the negation trick in order to get correct rounding for the minimum bound of the interval. Square roots of negative numbers yield a NaN, so we will need to try something else. One way to do it is to do component-wise square roots, and then adjust the minimum if rounding occurred:
__m128d in2_sqrt1(__m128d x)
{
double rmin, rmax;
if (in2_min(x) < 0) return in2_create(NAN, NAN);
rmax = sqrt(in2_max(x));
rmin = sqrt(in2_min(x));
/* Rounding occurred? */
if (rmin * rmin != in2_min(x))
{
rmin = nextafter(rmin, -INFINITY);
}
return in2_create(rmin, rmax);
}
Unfortunately, the nextafter() library function is very slow, so we need a better way. However, there is a trick that we can use. We can try to offset the inputs into the sqrtpd so that even with the incorrect answer, we obtain right answers. If we scale the minimum down by enough, then even though the rounding is upwards, it won't change the results enough to be inside the bounds. This can lead to slightly larger limits than otherwise, but not by too much.
So how much do we need to scale the minimum? By doing a Taylor expansion, it is possible to show that we need to scale the minimum down by 2 ulp. However, the multiply instruction to do that also rounds the wrong way. Through some experimentation, it seems 8 ulp is what is needed, including the double rounding effects. Using that we have:
__m128d in2_sqrt2(__m128d x)
{
if (in2_rmin(x) > 0) return in2_create(NAN, NAN);
/* Two roundings to counteract with multiply... */
x *= (__m128d) {-0x1.ffffffffffff8p-1, 1.0};
x = _mm_sqrt_pd(x);
return _mm_xor_pd(x, (__m128d) {-0.0, 0});
}
Thus even highly non-linear functions like the interval square root can be optimized by using instruction-level parallelism with SSE.
|
Comments
M said...is your code in the public domain?
It includes a fully-branchless two-mulpd version. I wonder what your measurement of its performance will be.