# Classifying Floating Point Numbers

IEEE Floating point numbers consist of three components. The first part is the "significand" which contains an integer which represents the fractional part of the number. Next, follows the "exponent" which is an offseted integer describing what power of two the significand needs to be multiplied by. Finally, the most significant bit contains the sign of the number.

The C programming language on x86_64 systems exposes four different floating point types. Three of these are standard IEEE types of 32, 64 and 128 bits in size. The remaining type is an Intel specific 80 bit floating point number. In addition to the difference in size, the Intel numbers include the leading one in the significand, whereas the IEEE floating point numbers only include the fractional part leaving the leading one implicit.

The different floating point types contain different precisions for the exponents and significands. They are also passed in different registers.

 C Type Size in bits Exponent size in bits Significand size in bits Register `float` 32 8 23 SSE single precision `double` 64 11 52 SSE double precision `long double` 80 15 63 + leading 1 memory + x87 stack `__float128` 128 15 112 SSE 128bit

Most of the above floating point types are well known. However, the `__float128` type is a little obscure. It is currently poorly supported in the gnu glibc library, lacking most of the math library bindings. The reason for this is that there is currently no hardware support for this type on x86 machines. Software based floating point is used for arithmetic with this type instead, which is quite slow.

IEEE floating point numbers can represent more than just normal numbers. Special bit patterns are used for ±∞, ±0, subnormal numbers, and signaling and non-signaling NaN's. If the exponent has all bits set, or all bits clear, then the number isn't "normal". The two infinities have all exponent bits set, together with all significand bits clear. If a bit is set in the significand, then the result is a NaN. If the highest (or second highest on 80 bit floating point) significand bit is set, then the NaN is signaling, otherwise it is a "quiet" NaN.

Zero is represented by all bits clear, and negative zero has just the sign bit set. If the exponent has all bits clear, and there are still bits set in the significand, then the result is a subnormal number. These numbers do not have the implicit leading one, and allow gradual underflow.

Many algorithms depend on testing floating point numbers to see which type they are classified as. For example, a data stream may represent missing points as NaN's, which will then need to be tested for later on. So what is the fastest way for testing for a NaN?

### isnan()

The most obvious function to use is the `isnan()` function/macro from the C math library. This will return non-zero if the floating point number is a NaN. Benchmarking repeating this 227 times, takes on this machine:

 Type Time in sec `float` 0.68 `double` 0.74 `long double` 1.45 `__float128` 6.98

As can be seen, the larger the type, the slower the operation is. The `__float128` case is extremely slow. This is due to the non-existence of a `isnan()` function optimized for that type in the glibc library. The compiler makes an implicit conversion to `long double` type, which slows things down enormously.

Can we do better? It turns out that `gcc` is smarter than the C library. If we use the fact that NaN's always return false for a comparison we can write a test for them. The compiler is smart enough to recognize this, and use an optimized test in the resulting machine code.

``````
int isnan1(float x)
{
return !(x == x);
}

int isnan2(double x)
{
return !(x == x);
}

int isnan3(long double x)
{
return !(x == x);
}
``````

The fact that `__float128` types do not have hardware support makes this technique a loser for them. Instead, we can use SSE instructions to manipulate the 128 bit values. With careful instruction selection, we can get a very fast function. The trick here is that the `paddusw` instruction (implied by the `_mm_adds_epu16()` intrinsic) allows us to test the exponent and significand simultaneously. The result after clearing the sign followed by the bounded addition, will only have the high bit set if the exponent was originally all ones. The other bits extracted by the `pmovmskb` can then be massaged to test for any bits in the significand.

``````
int isnan4(__float128 x)
{
__m128i c1 = {0xffffffffffffffffull, 0x7fffffffffffffffull};
__m128i c2 = {0x7fff7fff7fff7fffull, 0x00017fff7fff7fffull};
__m128i x2 = *(__m128i *) &x;

x2 &= c1;
return (_mm_movemask_epi8(x2) & 0xaaaa) > 0x8000;
}
``````

The optimized functions are significantly faster:

 Type Original Time in sec Optimized Time in sec `float` 0.68 0.40 `double` 0.74 0.40 `long double` 1.45 0.57 `__float128` 6.98 0.50

This is quite a bit faster for `float` and `double` types, nearly three times faster for `long double`'s and over an order of magnitude faster for `__float128`'s.

### isfinite()

The `isfinite()` function/macro returns non-zero if the floating point number passed to it is not an infinity or NaN. Otherwise it returns zero. This can be implemented in C, but it is much faster to use assembly. We can use the `pextrw` SSE instruction to extract part of the float, and then do a compare to the all-bits set pattern to test for this.

``````
isfinite1:
pextrw  \$0x1, %xmm0, %rax
and     \$0x7f80, %ax
sub     \$0x7f80, %ax
sbb     %eax, %eax
retq

isfinite2:
pextrw  \$0x3, %xmm0, %rax
and     \$0x7ff0, %ax
sub     \$0x7ff0, %ax
sbb     %eax, %eax
retq

isfinite3:
movw    0x10(%rsp), %ax
and     \$0x7fff, %ax
sub     \$0x7fff, %ax
sbb     %eax, %eax
retq

isfinite4:
pextrw  \$0x7, %xmm0, %rax
and     \$0x7fff, %ax
sub     \$0x7fff, %ax
sbb     %eax, %eax
retq
``````

Where the above assembly functions correspond to the `float`, `double`, `long double` and `__float128` versions respectively. Note how the `sbb` instruction is used to create the result via the carry flag set, or unset, by the previous `sub` instruction. Minus one is returned when the test passes, and zero if not.

This code is about twice as fast as the functions within glibc, and again over an order of magnitude faster for the `__float128` version.

 Type Original Time in sec Optimized Time in sec `float` 0.68 0.35 `double` 0.68 0.35 `long double` 0.68 0.38 `__float128` 5.5 0.35

### signbit()

We can use a similar trick to extract the sign bit of a floating point type. However, it is faster to simply shift all of the non-required bits away rather than use a mask. The resulting optimized asm routines are only three instructions long:

``````
signbit1:
pextrw  \$0x1, %xmm0, %eax
shr     \$15, %eax
retq

signbit2:
pextrw  \$0x3, %xmm0, %eax
shr     \$15, %eax
retq

signbit3:
movzwl  0x10(%rsp),%eax
shr     \$15, %eax
retq

signbit4:
pextrw  \$0x7, %xmm0, %eax
shr     \$15, %eax
retq
``````

Since the task is so simple, the glibc library code is the same speed for `float` and `double` types. However, the above is a fair bit more compact, since it avoids writing to the stack simply to transfer the information from the SSE register.

 Type Original Time in sec Optimized Time in sec `float` 0.40 0.40 `double` 0.45 0.45 `long double` 1.18 0.52 `__float128` 5.6 0.40

Again, the `__float128` version is much faster, and in addition the performance increase for `long double` types is also significant.

### isnormal()

The `isnormal()` function/macro returns nonzero if its parameter is not an infinity, zero, subnormal or NaN. This corresponds to a number with a exponent that is not either all bits set, or all bits clear. By masking with an `and` instruction, we can simultaneously check to see if the exponent bits are all zero. This allows us to only need one explicit comparison in the optimized functions.

``````
isnormal1:
pextrw  \$0x1, %xmm0, %eax
and     \$0x7f80, %ax
je      1f
cmp     \$0x7f80, %ax
sbb %eax, %eax
1:	repz; retq

isnormal2:
pextrw  \$0x3, %xmm0, %eax
and     \$0x7ff0, %ax
je      1f
cmp     \$0x7ff0, %ax
sbb     %eax, %eax
1:	repz; retq

isnormal3:
movzwl  0x10(%rsp), %eax
and     \$0x7fff, %ax
je      1f
cmp     \$0x7fff, %ax
sbb     %eax, %eax
1:	repz; retq

isnormal4:
pextrw  \$0x7, %xmm0, %eax
and     \$0x7fff, %ax
je      1f
cmp     \$0x7fff, %ax
sbb     %eax, %eax
1:	repz; retq
``````

The `repz` prefix is used on the return instruction to work around a limitation of the AMD64 branch predictor. It avoids having the `ret` as a target of a branch. Apparently, if this is the case, then the machine cannot correctly predict the target of the return and thus causing a slow-down.

 Type Original Time in sec Optimized Time in sec `float` 1.02 0.45 `double` 1.2 0.46 `long double` 1.65 0.58 `__float128` 6.6 0.46

As can be seen by the timing results, the optimized assembly versions are about twice as fast as the glibc library, and even faster for `long double` and `__float128` types.

### isinf()

The `isinf()` function/macro is a little different than the previous functions. They can return any non-zero value if their condition is true. `isinf()` is specified the same way in the C99 standard. However, the glibc library defines its version to return 1 for positive infinity, and -1 for negative infinity. This means that we need to make our return value dependent on the sign if we are to remain compatible. Also note how the `isinf()` function needs to check both the exponent and the significand since a NaN will have the same exponent bit pattern as an infinity. This makes this function much more complex than the previous routines.

The big question is which is better: SSE code, or 64bit code? SSE instructions work on a large amount of data at a time, but tend to be relatively slow and inflexible. They are also rather non-orthogonal, and the exact instruction you might like to use may not exist. This makes programming using SSE instructions difficult. Conversely, the 64bit instructions are very flexible, but only work with a small amount of data at a time. It turns out that the fastest code uses both instruction sets simultaneously, extracting the best of both worlds.

``````
.align 16
isinf1c1:
isinf1c2:

isinf1:
movq        %xmm0, %rdx
xor         %eax, %eax
pand        isinf1c1(%rip), %xmm0
sar         \$30, %edx
ucomiss     isinf1c2(%rip), %xmm0
cmova       %edx, %eax
retq

.align 16
isinf2c1:
isinf2c2:

isinf2:
movq        %xmm0, %rdx
xor         %eax, %eax
pand        isinf2c1(%rip), %xmm0
sar         \$62, %rdx
ucomisd     isinf2c2(%rip), %xmm0
cmova       %edx, %eax
retq

isinf3:
movzwl      0x10(%rsp), %eax
sbb         %edx, %edx
cmp         \$0xfffe, %ax
je          2f
1:	xor         %eax, %eax
retq
.align8
2:	mov         0x8(%rsp), %rax
jne         1b
lea         0x1(%edx, %edx, 1), %eax
retq

.align 16
isinf4c1:
isinf4c2:
isinf4:
movmskpd    %xmm0, %edx
pand        isinf4c1(%rip), %xmm0
mov         \$0x1, %eax
pcmpeqw     isinf4c2(%rip), %xmm0
pmovmskb    %xmm0, %edi
cmp         \$0xffff, %di
cmovne      %eax, %edx
sub         %edx, %eax
retq
``````

How the above routines work is non-obvious. The basic algorithm for the `float` and `double` versions is to first mask of the sign bit. This turns a negative infinity into a positive infinity. We then do a comparison with the largest non-infinite number. If we are larger, and not a NaN, then we need to return the sign. While the SSE instructions are running, we are simultaneously calculating this value. The trick is that the arithmetic shift will yield ±1 since we know the second most significant bit is set. Finally, the results are combined with a conditional move instruction.

The `long double` algorithm is a little different. Here we extract the sign bit into the carry flag via the first `add` instruction. The `sbb` instruction then sets %edx to either all ones or all zeros based on this carry value. If both comparisons pass, then we use the expression eax = 2*edx+1 to convert this into ±1 result we need.

The `__float128` version is also different. In this case, it is because the 128bit SSE instructions operate on fragments of the 128 bit registers independently. It is difficult to transfer information from the lower to the upper parts, and vice-versa. Thus to obtain the results, we need to extract the information we need, and store it in normal 64bit registers. The first `movmskpd` instruction extracts the sign information, thus %edx will be either 2 or 0 depending on the sign of the infinity. We next compare with the bit pattern for possitive infinity, and construct a 16 bit result that contains the information about the status of the exponent and significand via the `pcmpeqw` and `pmovmskb` instructions. Finally, the information is combined to return ±1 or 0 depending on the input in a branch-free manner.

 Type Original Time in sec Optimized Time in sec `float` 0.75 0.35 `double` 0.78 0.35 `long double` 1.55 0.45 `__float128` 6.3 0.46

Again, the resulting optimized assembly routines are much faster than the glibc versions, being at least twice as fast.

### fpclassify()

The above functions are defined in terms of the `fpclassify` function/macro. This returns an integer based on the classification of its parameter. To remain compatible with the glibc library, we will use the same values:

``````
#define FP_NAN       0
#define FP_INFINITE  1
#define FP_ZERO      2
#define FP_SUBNORMAL 3
#define FP_NORMAL    4
``````

The algorithm we will use is to firstly check the expoenent. If it is all-ones, branch to some code that checks the significand. If the significand is all-zeros, return 1, otherwise return zero. If the exponent isn't all-ones, check to see if it is all-zeros. If not, then we have a normal number, and return 4. Finally, we need to return 2 or 3 depending on whether or not the significand is clear. Unfortunately, this algorithm requires a fair amount of jumping, so is quite slow. However, the last steps can be done branchlessly via various `setcc` instructions, which helps somewhat.

``````
fpclassify1:
movq    %xmm0, %rdx
mov     \$0x4, %eax
mov     %edx, %edi
and     \$0x7f800000, %edx
je      1f
cmp     \$0x7f800000, %edx
je      2f
repz; retq
.align 8
1:	shl     \$9, %edi
setne   %al
retq
.align 8
2:	shl     \$9, %edi
sete    %al
retq

fpclassify2:
movq    %xmm0, %rdx
mov     \$0x4, %eax
mov     fp2c(%rip), %rsi
mov     %rdx, %rdi
and     %rsi, %rdx
je      1f
cmp     %rsi, %rdx
je      2f
repz; retq
.align 8
1:	shl     \$12, %rdi
setne   %al
retq
.align 8
2:	shl     \$12, %rdi
sete    %al
retq

fpclassify3:
movzwl  0x10(%rsp), %edx
mov     \$0x4, %eax
and     \$0x7fff, %edx
je      1f
cmp     \$0x7fff, %edx
je      2f
repz; retq
.align 8
1:	mov     0x8(%rsp), %rdx
setne   %al
retq
.align 8
2:	mov     0x8(%rsp), %rdx
sete    %al
retq

.align 16
fp4c1:
fp4c2:

fpclassify4:
pand    fp4c1(%rip), %xmm0
mov     \$0x4, %eax
pmovmskb %xmm0, %edx
and     \$0xaaaa, %edx
cmp     \$0x8000, %edx
jge     1f
pextrw  \$0x7, %xmm0, %edi
cmp     \$0x1, %edi
je      2f
repz; retq
.align 8
2:	cmp     \$0x0, %edx
setne   %al
retq
.align 8
1:	sete    %al
retq
``````

The above functions have execution times that depend on the the type of numbers tested. This makes constructing a benchmark to test them harder. We will thus test these functions in two ways. The first is to simply use a normal number over and over again. This corresponds to use where most numbers are normal. The other case will test all possibilities by enumerating the values within an array repeatedly.

 Normals Type Original Time in sec Optimized Time in sec `float` 0.74 0.35 `double` 0.92 0.45 `long double` 1.17 0.40 `__float128` 6.1 0.46

 Mixed numbers Type Original Time in sec Optimized Time in sec `float` 0.75 0.51 `double` 0.88 0.57 `long double` 1.56 0.58 `__float128` 8.5 0.60

As can be seen, the mixed input results aren't quite as much of an improvement. However, the normal-only results are at least twice as the glibc. This is due to the branchy nature of the algorithm.

### Summary

Using assembly code we can obtain about a 2× speedup for the bit-manipulation algorithms used to classify floating point types, over the C standard library. By using SSE instructions in non-standard ways, it is possible construct 128bit comparisons. By using SSE and normal 64 bit code simultaneously it is possible to construct branch-free versions of some quite complex algorithms.

The above contains quite a few assembly tricks, and it may be worthwhile investigating the source in more detail. For example, note how the `sbb` instruction can be used after a comparison to store the result. The use of shifts instead of an `and` instruction + comparison is also interesting. Another trick is to test a set of contiguous bits with an `add` causing a carry. Finally, the arithmetic shift of the infinity exponent bit-pattern to yield ±1 is also something you won't see all that often.

Obviously spending quite so much effort optimizing normal code isn't worth it. However, the floating point classification algorithms are simple test cases that allow one to explore what possibilities the extremes of optimization allow. By spending similar effort on your inner loops in performance-critical code, similar speed improvements may be possible. The most important thing is to think laterally, you don't have to use instructions the way they are normally used...

said...
In the article 'Classifying Floating Point Numbers' the term "magnitude" is used wrongly. You should replace all ocurrences of "magnitude" with "exponent". If one was to use the term "magnitude" at all in regard to floating point numbers it would refer to the combination of the exponent and the significand, in other words, everything except the sign.
sfuerst said...
Thanks, I've fixed the notation problem and replaced "magnitude" with the more precise "exponent" where applicable.
Jeremy said...
Have you ever compared the speed of the "fxam" instruction for classifying floating point numbers on x86/x87? It seems to do everything you provide explicit code for, in one instruction!

Herbert said...
Great article! Thanks a lot. However, I couldn't find a date of this article? How old is it? Are the statements e.g. regarding glibc still true? Please add dates to your great posts.

Enter the 10 characters above here

Name