Classifying Floating Point Numbers
IEEE Floating point numbers consist of three components. The first part is the "significand" which contains an integer which represents the fractional part of the number. Next, follows the "exponent" which is an offseted integer describing what power of two the significand needs to be multiplied by. Finally, the most significant bit contains the sign of the number.
The C programming language on x86_64 systems exposes four different floating point types. Three of these are standard IEEE types of 32, 64 and 128 bits in size. The remaining type is an Intel specific 80 bit floating point number. In addition to the difference in size, the Intel numbers include the leading one in the significand, whereas the IEEE floating point numbers only include the fractional part leaving the leading one implicit.
The different floating point types contain different precisions for the exponents and significands. They are also passed in different registers.
C Type | Size in bits | Exponent size in bits | Significand size in bits | Register |
float | 32 | 8 | 23 | SSE single precision |
double | 64 | 11 | 52 | SSE double precision |
long double | 80 | 15 | 63 + leading 1 | memory + x87 stack |
__float128 | 128 | 15 | 112 | SSE 128bit |
Most of the above floating point types are well known. However, the __float128 type is a little obscure. It is currently poorly supported in the gnu glibc library, lacking most of the math library bindings. The reason for this is that there is currently no hardware support for this type on x86 machines. Software based floating point is used for arithmetic with this type instead, which is quite slow.
IEEE floating point numbers can represent more than just normal numbers. Special bit patterns are used for ±∞, ±0, subnormal numbers, and signaling and non-signaling NaN's. If the exponent has all bits set, or all bits clear, then the number isn't "normal". The two infinities have all exponent bits set, together with all significand bits clear. If a bit is set in the significand, then the result is a NaN. If the highest (or second highest on 80 bit floating point) significand bit is set, then the NaN is signaling, otherwise it is a "quiet" NaN.
Zero is represented by all bits clear, and negative zero has just the sign bit set. If the exponent has all bits clear, and there are still bits set in the significand, then the result is a subnormal number. These numbers do not have the implicit leading one, and allow gradual underflow.
Many algorithms depend on testing floating point numbers to see which type they are classified as. For example, a data stream may represent missing points as NaN's, which will then need to be tested for later on. So what is the fastest way for testing for a NaN?
isnan()
The most obvious function to use is the isnan() function/macro from the C math library. This will return non-zero if the floating point number is a NaN. Benchmarking repeating this 227 times, takes on this machine:
Type | Time in sec |
float | 0.68 |
double | 0.74 |
long double | 1.45 |
__float128 | 6.98 |
As can be seen, the larger the type, the slower the operation is. The __float128 case is extremely slow. This is due to the non-existence of a isnan() function optimized for that type in the glibc library. The compiler makes an implicit conversion to long double type, which slows things down enormously.
Can we do better? It turns out that gcc is smarter than the C library. If we use the fact that NaN's always return false for a comparison we can write a test for them. The compiler is smart enough to recognize this, and use an optimized test in the resulting machine code.
int isnan1(float x)
{
return !(x == x);
}
int isnan2(double x)
{
return !(x == x);
}
int isnan3(long double x)
{
return !(x == x);
}
The fact that __float128 types do not have hardware support makes this technique a loser for them. Instead, we can use SSE instructions to manipulate the 128 bit values. With careful instruction selection, we can get a very fast function. The trick here is that the paddusw instruction (implied by the _mm_adds_epu16() intrinsic) allows us to test the exponent and significand simultaneously. The result after clearing the sign followed by the bounded addition, will only have the high bit set if the exponent was originally all ones. The other bits extracted by the pmovmskb can then be massaged to test for any bits in the significand.
int isnan4(__float128 x)
{
__m128i c1 = {0xffffffffffffffffull, 0x7fffffffffffffffull};
__m128i c2 = {0x7fff7fff7fff7fffull, 0x00017fff7fff7fffull};
__m128i x2 = *(__m128i *) &x;
x2 &= c1;
x2 = _mm_adds_epu16(c2, x2);
return (_mm_movemask_epi8(x2) & 0xaaaa) > 0x8000;
}
The optimized functions are significantly faster:
Type | Original Time in sec | Optimized Time in sec |
float | 0.68 | 0.40 |
double | 0.74 | 0.40 |
long double | 1.45 | 0.57 |
__float128 | 6.98 | 0.50 |
This is quite a bit faster for float and double types, nearly three times faster for long double 's and over an order of magnitude faster for __float128 's.
isfinite()
The isfinite() function/macro returns non-zero if the floating point number passed to it is not an infinity or NaN. Otherwise it returns zero. This can be implemented in C, but it is much faster to use assembly. We can use the pextrw SSE instruction to extract part of the float, and then do a compare to the all-bits set pattern to test for this.
isfinite1:
pextrw $0x1, %xmm0, %rax
and $0x7f80, %ax
sub $0x7f80, %ax
sbb %eax, %eax
retq
isfinite2:
pextrw $0x3, %xmm0, %rax
and $0x7ff0, %ax
sub $0x7ff0, %ax
sbb %eax, %eax
retq
isfinite3:
movw 0x10(%rsp), %ax
and $0x7fff, %ax
sub $0x7fff, %ax
sbb %eax, %eax
retq
isfinite4:
pextrw $0x7, %xmm0, %rax
and $0x7fff, %ax
sub $0x7fff, %ax
sbb %eax, %eax
retq
Where the above assembly functions correspond to the float , double , long double and __float128 versions respectively. Note how the sbb instruction is used to create the result via the carry flag set, or unset, by the previous sub instruction. Minus one is returned when the test passes, and zero if not.
This code is about twice as fast as the functions within glibc, and again over an order of magnitude faster for the __float128 version.
Type | Original Time in sec | Optimized Time in sec |
float | 0.68 | 0.35 |
double | 0.68 | 0.35 |
long double | 0.68 | 0.38 |
__float128 | 5.5 | 0.35 |
signbit()
We can use a similar trick to extract the sign bit of a floating point type. However, it is faster to simply shift all of the non-required bits away rather than use a mask. The resulting optimized asm routines are only three instructions long:
signbit1:
pextrw $0x1, %xmm0, %eax
shr $15, %eax
retq
signbit2:
pextrw $0x3, %xmm0, %eax
shr $15, %eax
retq
signbit3:
movzwl 0x10(%rsp),%eax
shr $15, %eax
retq
signbit4:
pextrw $0x7, %xmm0, %eax
shr $15, %eax
retq
Since the task is so simple, the glibc library code is the same speed for float and double types. However, the above is a fair bit more compact, since it avoids writing to the stack simply to transfer the information from the SSE register.
Type | Original Time in sec | Optimized Time in sec |
float | 0.40 | 0.40 |
double | 0.45 | 0.45 |
long double | 1.18 | 0.52 |
__float128 | 5.6 | 0.40 |
Again, the __float128 version is much faster, and in addition the performance increase for long double types is also significant.
isnormal()
The isnormal() function/macro returns nonzero if its parameter is not an infinity, zero, subnormal or NaN. This corresponds to a number with a exponent that is not either all bits set, or all bits clear. By masking with an and instruction, we can simultaneously check to see if the exponent bits are all zero. This allows us to only need one explicit comparison in the optimized functions.
isnormal1:
pextrw $0x1, %xmm0, %eax
and $0x7f80, %ax
je 1f
cmp $0x7f80, %ax
sbb %eax, %eax
1: repz; retq
isnormal2:
pextrw $0x3, %xmm0, %eax
and $0x7ff0, %ax
je 1f
cmp $0x7ff0, %ax
sbb %eax, %eax
1: repz; retq
isnormal3:
movzwl 0x10(%rsp), %eax
and $0x7fff, %ax
je 1f
cmp $0x7fff, %ax
sbb %eax, %eax
1: repz; retq
isnormal4:
pextrw $0x7, %xmm0, %eax
and $0x7fff, %ax
je 1f
cmp $0x7fff, %ax
sbb %eax, %eax
1: repz; retq
The repz prefix is used on the return instruction to work around a limitation of the AMD64 branch predictor. It avoids having the ret as a target of a branch. Apparently, if this is the case, then the machine cannot correctly predict the target of the return and thus causing a slow-down.
Type | Original Time in sec | Optimized Time in sec |
float | 1.02 | 0.45 |
double | 1.2 | 0.46 |
long double | 1.65 | 0.58 |
__float128 | 6.6 | 0.46 |
As can be seen by the timing results, the optimized assembly versions are about twice as fast as the glibc library, and even faster for long double and __float128 types.
isinf()
The isinf() function/macro is a little different than the previous functions. They can return any non-zero value if their condition is true. isinf() is specified the same way in the C99 standard. However, the glibc library defines its version to return 1 for positive infinity, and -1 for negative infinity. This means that we need to make our return value dependent on the sign if we are to remain compatible. Also note how the isinf() function needs to check both the exponent and the significand since a NaN will have the same exponent bit pattern as an infinity. This makes this function much more complex than the previous routines.
The big question is which is better: SSE code, or 64bit code? SSE instructions work on a large amount of data at a time, but tend to be relatively slow and inflexible. They are also rather non-orthogonal, and the exact instruction you might like to use may not exist. This makes programming using SSE instructions difficult. Conversely, the 64bit instructions are very flexible, but only work with a small amount of data at a time. It turns out that the fastest code uses both instruction sets simultaneously, extracting the best of both worlds.
.align 16
isinf1c1:
.quad 0x000000007fffffff
.quad 0x0000000000000000
isinf1c2:
.quad 0x000000007f7fffff
.quad 0x0000000000000000
isinf1:
movq %xmm0, %rdx
xor %eax, %eax
pand isinf1c1(%rip), %xmm0
sar $30, %edx
ucomiss isinf1c2(%rip), %xmm0
cmova %edx, %eax
retq
.align 16
isinf2c1:
.quad 0x7fffffffffffffff
.quad 0x0000000000000000
isinf2c2:
.quad 0x7fefffffffffffff
.quad 0x0000000000000000
isinf2:
movq %xmm0, %rdx
xor %eax, %eax
pand isinf2c1(%rip), %xmm0
sar $62, %rdx
ucomisd isinf2c2(%rip), %xmm0
cmova %edx, %eax
retq
isinf3:
movzwl 0x10(%rsp), %eax
add %ax, %ax
sbb %edx, %edx
cmp $0xfffe, %ax
je 2f
1: xor %eax, %eax
retq
.align8
2: mov 0x8(%rsp), %rax
add %rax, %rax
jne 1b
lea 0x1(%edx, %edx, 1), %eax
retq
.align 16
isinf4c1:
.quad 0xffffffffffffffff
.quad 0x7fffffffffffffff
isinf4c2:
.quad 0x0000000000000000
.quad 0x7fff000000000000
isinf4:
movmskpd %xmm0, %edx
pand isinf4c1(%rip), %xmm0
mov $0x1, %eax
pcmpeqw isinf4c2(%rip), %xmm0
pmovmskb %xmm0, %edi
cmp $0xffff, %di
cmovne %eax, %edx
sub %edx, %eax
retq
How the above routines work is non-obvious. The basic algorithm for the float and double versions is to first mask of the sign bit. This turns a negative infinity into a positive infinity. We then do a comparison with the largest non-infinite number. If we are larger, and not a NaN, then we need to return the sign. While the SSE instructions are running, we are simultaneously calculating this value. The trick is that the arithmetic shift will yield ±1 since we know the second most significant bit is set. Finally, the results are combined with a conditional move instruction.
The long double algorithm is a little different. Here we extract the sign bit into the carry flag via the first add instruction. The sbb instruction then sets %edx to either all ones or all zeros based on this carry value. If both comparisons pass, then we use the expression eax = 2*edx+1 to convert this into ±1 result we need.
The __float128 version is also different. In this case, it is because the 128bit SSE instructions operate on fragments of the 128 bit registers independently. It is difficult to transfer information from the lower to the upper parts, and vice-versa. Thus to obtain the results, we need to extract the information we need, and store it in normal 64bit registers. The first movmskpd instruction extracts the sign information, thus %edx will be either 2 or 0 depending on the sign of the infinity. We next compare with the bit pattern for possitive infinity, and construct a 16 bit result that contains the information about the status of the exponent and significand via the pcmpeqw and pmovmskb instructions. Finally, the information is combined to return ±1 or 0 depending on the input in a branch-free manner.
Type | Original Time in sec | Optimized Time in sec |
float | 0.75 | 0.35 |
double | 0.78 | 0.35 |
long double | 1.55 | 0.45 |
__float128 | 6.3 | 0.46 |
Again, the resulting optimized assembly routines are much faster than the glibc versions, being at least twice as fast.
fpclassify()
The above functions are defined in terms of the fpclassify function/macro. This returns an integer based on the classification of its parameter. To remain compatible with the glibc library, we will use the same values:
#define FP_NAN 0
#define FP_INFINITE 1
#define FP_ZERO 2
#define FP_SUBNORMAL 3
#define FP_NORMAL 4
The algorithm we will use is to firstly check the expoenent. If it is all-ones, branch to some code that checks the significand. If the significand is all-zeros, return 1, otherwise return zero. If the exponent isn't all-ones, check to see if it is all-zeros. If not, then we have a normal number, and return 4. Finally, we need to return 2 or 3 depending on whether or not the significand is clear. Unfortunately, this algorithm requires a fair amount of jumping, so is quite slow. However, the last steps can be done branchlessly via various setcc instructions, which helps somewhat.
fpclassify1:
movq %xmm0, %rdx
mov $0x4, %eax
mov %edx, %edi
and $0x7f800000, %edx
je 1f
cmp $0x7f800000, %edx
je 2f
repz; retq
.align 8
1: shl $9, %edi
setne %al
add $0x2, %eax
retq
.align 8
2: shl $9, %edi
sete %al
retq
fp2c: .quad 0x7ff0000000000000
fpclassify2:
movq %xmm0, %rdx
mov $0x4, %eax
mov fp2c(%rip), %rsi
mov %rdx, %rdi
and %rsi, %rdx
je 1f
cmp %rsi, %rdx
je 2f
repz; retq
.align 8
1: shl $12, %rdi
setne %al
add $0x2, %eax
retq
.align 8
2: shl $12, %rdi
sete %al
retq
fpclassify3:
movzwl 0x10(%rsp), %edx
mov $0x4, %eax
and $0x7fff, %edx
je 1f
cmp $0x7fff, %edx
je 2f
repz; retq
.align 8
1: mov 0x8(%rsp), %rdx
add %rdx, %rdx
setne %al
add $0x2, %eax
retq
.align 8
2: mov 0x8(%rsp), %rdx
add %rdx, %rdx
sete %al
retq
.align 16
fp4c1:
.quad 0xffffffffffffffff
.quad 0x7fffffffffffffff
fp4c2:
.quad 0x7fff7fff7fff7fff
.quad 0x00017fff7fff7fff
fpclassify4:
pand fp4c1(%rip), %xmm0
paddusw fp4c2(%rip), %xmm0
mov $0x4, %eax
pmovmskb %xmm0, %edx
and $0xaaaa, %edx
cmp $0x8000, %edx
jge 1f
pextrw $0x7, %xmm0, %edi
cmp $0x1, %edi
je 2f
repz; retq
.align 8
2: cmp $0x0, %edx
setne %al
add $0x2, %eax
retq
.align 8
1: sete %al
retq
The above functions have execution times that depend on the the type of numbers tested. This makes constructing a benchmark to test them harder. We will thus test these functions in two ways. The first is to simply use a normal number over and over again. This corresponds to use where most numbers are normal. The other case will test all possibilities by enumerating the values within an array repeatedly.
Normals |
Type | Original Time in sec | Optimized Time in sec |
float | 0.74 | 0.35 |
double | 0.92 | 0.45 |
long double | 1.17 | 0.40 |
__float128 | 6.1 | 0.46 |
Mixed numbers |
Type | Original Time in sec | Optimized Time in sec |
float | 0.75 | 0.51 |
double | 0.88 | 0.57 |
long double | 1.56 | 0.58 |
__float128 | 8.5 | 0.60 |
As can be seen, the mixed input results aren't quite as much of an improvement. However, the normal-only results are at least twice as the glibc. This is due to the branchy nature of the algorithm.
Summary
Using assembly code we can obtain about a 2× speedup for the bit-manipulation algorithms used to classify floating point types, over the C standard library. By using SSE instructions in non-standard ways, it is possible construct 128bit comparisons. By using SSE and normal 64 bit code simultaneously it is possible to construct branch-free versions of some quite complex algorithms.
The above contains quite a few assembly tricks, and it may be worthwhile investigating the source in more detail. For example, note how the sbb instruction can be used after a comparison to store the result. The use of shifts instead of an and instruction + comparison is also interesting. Another trick is to test a set of contiguous bits with an add causing a carry. Finally, the arithmetic shift of the infinity exponent bit-pattern to yield ±1 is also something you won't see all that often.
Obviously spending quite so much effort optimizing normal code isn't worth it. However, the floating point classification algorithms are simple test cases that allow one to explore what possibilities the extremes of optimization allow. By spending similar effort on your inner loops in performance-critical code, similar speed improvements may be possible. The most important thing is to think laterally, you don't have to use instructions the way they are normally used...
|
Comments
said...