Lockless Inc

Classifying Floating Point Numbers

IEEE Floating point numbers consist of three components. The first part is the "significand" which contains an integer which represents the fractional part of the number. Next, follows the "exponent" which is an offseted integer describing what power of two the significand needs to be multiplied by. Finally, the most significant bit contains the sign of the number.

The C programming language on x86_64 systems exposes four different floating point types. Three of these are standard IEEE types of 32, 64 and 128 bits in size. The remaining type is an Intel specific 80 bit floating point number. In addition to the difference in size, the Intel numbers include the leading one in the significand, whereas the IEEE floating point numbers only include the fractional part leaving the leading one implicit.

The different floating point types contain different precisions for the exponents and significands. They are also passed in different registers.

C TypeSize in bitsExponent size in bitsSignificand size in bitsRegister
float32823SSE single precision
double641152SSE double precision
long double801563 + leading 1memory + x87 stack
__float12812815112SSE 128bit

Most of the above floating point types are well known. However, the __float128 type is a little obscure. It is currently poorly supported in the gnu glibc library, lacking most of the math library bindings. The reason for this is that there is currently no hardware support for this type on x86 machines. Software based floating point is used for arithmetic with this type instead, which is quite slow.

IEEE floating point numbers can represent more than just normal numbers. Special bit patterns are used for ±∞, ±0, subnormal numbers, and signaling and non-signaling NaN's. If the exponent has all bits set, or all bits clear, then the number isn't "normal". The two infinities have all exponent bits set, together with all significand bits clear. If a bit is set in the significand, then the result is a NaN. If the highest (or second highest on 80 bit floating point) significand bit is set, then the NaN is signaling, otherwise it is a "quiet" NaN.

Zero is represented by all bits clear, and negative zero has just the sign bit set. If the exponent has all bits clear, and there are still bits set in the significand, then the result is a subnormal number. These numbers do not have the implicit leading one, and allow gradual underflow.

Many algorithms depend on testing floating point numbers to see which type they are classified as. For example, a data stream may represent missing points as NaN's, which will then need to be tested for later on. So what is the fastest way for testing for a NaN?

isnan()

The most obvious function to use is the isnan() function/macro from the C math library. This will return non-zero if the floating point number is a NaN. Benchmarking repeating this 227 times, takes on this machine:

TypeTime in sec
float0.68
double0.74
long double1.45
__float1286.98

As can be seen, the larger the type, the slower the operation is. The __float128 case is extremely slow. This is due to the non-existence of a isnan() function optimized for that type in the glibc library. The compiler makes an implicit conversion to long double type, which slows things down enormously.

Can we do better? It turns out that gcc is smarter than the C library. If we use the fact that NaN's always return false for a comparison we can write a test for them. The compiler is smart enough to recognize this, and use an optimized test in the resulting machine code.


int isnan1(float x)
{
	return !(x == x);
}

int isnan2(double x)
{
	return !(x == x);
}

int isnan3(long double x)
{
	return !(x == x);
}

The fact that __float128 types do not have hardware support makes this technique a loser for them. Instead, we can use SSE instructions to manipulate the 128 bit values. With careful instruction selection, we can get a very fast function. The trick here is that the paddusw instruction (implied by the _mm_adds_epu16() intrinsic) allows us to test the exponent and significand simultaneously. The result after clearing the sign followed by the bounded addition, will only have the high bit set if the exponent was originally all ones. The other bits extracted by the pmovmskb can then be massaged to test for any bits in the significand.


int isnan4(__float128 x)
{
	__m128i c1 = {0xffffffffffffffffull, 0x7fffffffffffffffull};
	__m128i c2 = {0x7fff7fff7fff7fffull, 0x00017fff7fff7fffull};
	__m128i x2 = *(__m128i *) &x;
	
	x2 &= c1;
	x2 = _mm_adds_epu16(c2, x2);
	return (_mm_movemask_epi8(x2) & 0xaaaa) > 0x8000;
}

The optimized functions are significantly faster:

TypeOriginal Time in secOptimized Time in sec
float0.680.40
double0.740.40
long double1.450.57
__float1286.980.50

This is quite a bit faster for float and double types, nearly three times faster for long double's and over an order of magnitude faster for __float128's.

isfinite()

The isfinite() function/macro returns non-zero if the floating point number passed to it is not an infinity or NaN. Otherwise it returns zero. This can be implemented in C, but it is much faster to use assembly. We can use the pextrw SSE instruction to extract part of the float, and then do a compare to the all-bits set pattern to test for this.


isfinite1:
	pextrw  $0x1, %xmm0, %rax
	and     $0x7f80, %ax
	sub     $0x7f80, %ax
	sbb     %eax, %eax
	retq

isfinite2:
	pextrw  $0x3, %xmm0, %rax
	and     $0x7ff0, %ax
	sub     $0x7ff0, %ax
	sbb     %eax, %eax
	retq

isfinite3:
	movw    0x10(%rsp), %ax
	and     $0x7fff, %ax
	sub     $0x7fff, %ax
	sbb     %eax, %eax
	retq

isfinite4:
	pextrw  $0x7, %xmm0, %rax
	and     $0x7fff, %ax
	sub     $0x7fff, %ax
	sbb     %eax, %eax
	retq

Where the above assembly functions correspond to the float, double, long double and __float128 versions respectively. Note how the sbb instruction is used to create the result via the carry flag set, or unset, by the previous sub instruction. Minus one is returned when the test passes, and zero if not.

This code is about twice as fast as the functions within glibc, and again over an order of magnitude faster for the __float128 version.

TypeOriginal Time in secOptimized Time in sec
float0.680.35
double0.680.35
long double0.680.38
__float1285.50.35

signbit()

We can use a similar trick to extract the sign bit of a floating point type. However, it is faster to simply shift all of the non-required bits away rather than use a mask. The resulting optimized asm routines are only three instructions long:


signbit1:
	pextrw  $0x1, %xmm0, %eax
	shr     $15, %eax
	retq

signbit2:
	pextrw  $0x3, %xmm0, %eax
	shr     $15, %eax
	retq

signbit3:
	movzwl  0x10(%rsp),%eax
	shr     $15, %eax
	retq

signbit4:
	pextrw  $0x7, %xmm0, %eax
	shr     $15, %eax
	retq

Since the task is so simple, the glibc library code is the same speed for float and double types. However, the above is a fair bit more compact, since it avoids writing to the stack simply to transfer the information from the SSE register.

TypeOriginal Time in secOptimized Time in sec
float0.400.40
double0.450.45
long double1.180.52
__float1285.60.40

Again, the __float128 version is much faster, and in addition the performance increase for long double types is also significant.

isnormal()

The isnormal() function/macro returns nonzero if its parameter is not an infinity, zero, subnormal or NaN. This corresponds to a number with a exponent that is not either all bits set, or all bits clear. By masking with an and instruction, we can simultaneously check to see if the exponent bits are all zero. This allows us to only need one explicit comparison in the optimized functions.


isnormal1:
	pextrw  $0x1, %xmm0, %eax
	and     $0x7f80, %ax
	je      1f
	cmp     $0x7f80, %ax
	sbb %eax, %eax
1:	repz; retq

isnormal2:
	pextrw  $0x3, %xmm0, %eax
	and     $0x7ff0, %ax
	je      1f
	cmp     $0x7ff0, %ax
	sbb     %eax, %eax
1:	repz; retq

isnormal3:
	movzwl  0x10(%rsp), %eax
	and     $0x7fff, %ax
	je      1f
	cmp     $0x7fff, %ax
	sbb     %eax, %eax
1:	repz; retq

isnormal4:
	pextrw  $0x7, %xmm0, %eax
	and     $0x7fff, %ax
	je      1f
	cmp     $0x7fff, %ax
	sbb     %eax, %eax
1:	repz; retq

The repz prefix is used on the return instruction to work around a limitation of the AMD64 branch predictor. It avoids having the ret as a target of a branch. Apparently, if this is the case, then the machine cannot correctly predict the target of the return and thus causing a slow-down.

TypeOriginal Time in secOptimized Time in sec
float1.020.45
double1.20.46
long double1.650.58
__float1286.60.46

As can be seen by the timing results, the optimized assembly versions are about twice as fast as the glibc library, and even faster for long double and __float128 types.

isinf()

The isinf() function/macro is a little different than the previous functions. They can return any non-zero value if their condition is true. isinf() is specified the same way in the C99 standard. However, the glibc library defines its version to return 1 for positive infinity, and -1 for negative infinity. This means that we need to make our return value dependent on the sign if we are to remain compatible. Also note how the isinf() function needs to check both the exponent and the significand since a NaN will have the same exponent bit pattern as an infinity. This makes this function much more complex than the previous routines.

The big question is which is better: SSE code, or 64bit code? SSE instructions work on a large amount of data at a time, but tend to be relatively slow and inflexible. They are also rather non-orthogonal, and the exact instruction you might like to use may not exist. This makes programming using SSE instructions difficult. Conversely, the 64bit instructions are very flexible, but only work with a small amount of data at a time. It turns out that the fastest code uses both instruction sets simultaneously, extracting the best of both worlds.


  	.align 16
isinf1c1:
	.quad 0x000000007fffffff
	.quad 0x0000000000000000
isinf1c2:	
	.quad 0x000000007f7fffff
	.quad 0x0000000000000000

isinf1:
	movq        %xmm0, %rdx
	xor         %eax, %eax
	pand        isinf1c1(%rip), %xmm0
	sar         $30, %edx
	ucomiss     isinf1c2(%rip), %xmm0
	cmova       %edx, %eax
	retq

	.align 16
isinf2c1:
	.quad 0x7fffffffffffffff
	.quad 0x0000000000000000
isinf2c2:	
	.quad 0x7fefffffffffffff
	.quad 0x0000000000000000

isinf2:
	movq        %xmm0, %rdx
	xor         %eax, %eax
	pand        isinf2c1(%rip), %xmm0
	sar         $62, %rdx
	ucomisd     isinf2c2(%rip), %xmm0
	cmova       %edx, %eax
	retq

isinf3:
	movzwl      0x10(%rsp), %eax
	add         %ax, %ax
	sbb         %edx, %edx
	cmp         $0xfffe, %ax
	je          2f
1:	xor         %eax, %eax
	retq
	.align8
2:	mov         0x8(%rsp), %rax
	add         %rax, %rax
	jne         1b
	lea         0x1(%edx, %edx, 1), %eax
	retq
	
	.align 16
isinf4c1:
	.quad 0xffffffffffffffff
	.quad 0x7fffffffffffffff
isinf4c2:	
	.quad 0x0000000000000000
	.quad 0x7fff000000000000
isinf4:
	movmskpd    %xmm0, %edx
	pand        isinf4c1(%rip), %xmm0
	mov         $0x1, %eax
	pcmpeqw     isinf4c2(%rip), %xmm0
	pmovmskb    %xmm0, %edi
	cmp         $0xffff, %di
	cmovne      %eax, %edx
	sub         %edx, %eax
	retq

How the above routines work is non-obvious. The basic algorithm for the float and double versions is to first mask of the sign bit. This turns a negative infinity into a positive infinity. We then do a comparison with the largest non-infinite number. If we are larger, and not a NaN, then we need to return the sign. While the SSE instructions are running, we are simultaneously calculating this value. The trick is that the arithmetic shift will yield ±1 since we know the second most significant bit is set. Finally, the results are combined with a conditional move instruction.

The long double algorithm is a little different. Here we extract the sign bit into the carry flag via the first add instruction. The sbb instruction then sets %edx to either all ones or all zeros based on this carry value. If both comparisons pass, then we use the expression eax = 2*edx+1 to convert this into ±1 result we need.

The __float128 version is also different. In this case, it is because the 128bit SSE instructions operate on fragments of the 128 bit registers independently. It is difficult to transfer information from the lower to the upper parts, and vice-versa. Thus to obtain the results, we need to extract the information we need, and store it in normal 64bit registers. The first movmskpd instruction extracts the sign information, thus %edx will be either 2 or 0 depending on the sign of the infinity. We next compare with the bit pattern for possitive infinity, and construct a 16 bit result that contains the information about the status of the exponent and significand via the pcmpeqw and pmovmskb instructions. Finally, the information is combined to return ±1 or 0 depending on the input in a branch-free manner.

TypeOriginal Time in secOptimized Time in sec
float0.750.35
double0.780.35
long double1.550.45
__float1286.30.46

Again, the resulting optimized assembly routines are much faster than the glibc versions, being at least twice as fast.

fpclassify()

The above functions are defined in terms of the fpclassify function/macro. This returns an integer based on the classification of its parameter. To remain compatible with the glibc library, we will use the same values:


#define FP_NAN       0
#define FP_INFINITE  1
#define FP_ZERO      2
#define FP_SUBNORMAL 3
#define FP_NORMAL    4

The algorithm we will use is to firstly check the expoenent. If it is all-ones, branch to some code that checks the significand. If the significand is all-zeros, return 1, otherwise return zero. If the exponent isn't all-ones, check to see if it is all-zeros. If not, then we have a normal number, and return 4. Finally, we need to return 2 or 3 depending on whether or not the significand is clear. Unfortunately, this algorithm requires a fair amount of jumping, so is quite slow. However, the last steps can be done branchlessly via various setcc instructions, which helps somewhat.


fpclassify1:
	movq    %xmm0, %rdx
	mov     $0x4, %eax
	mov     %edx, %edi
	and     $0x7f800000, %edx
	je      1f
	cmp     $0x7f800000, %edx
	je      2f
	repz; retq
	.align 8
1:	shl     $9, %edi
	setne   %al
	add     $0x2, %eax
	retq
	.align 8
2:	shl     $9, %edi
	sete    %al
	retq

fp2c: .quad 0x7ff0000000000000

fpclassify2:
	movq    %xmm0, %rdx
	mov     $0x4, %eax
	mov     fp2c(%rip), %rsi
	mov     %rdx, %rdi
	and     %rsi, %rdx
	je      1f
	cmp     %rsi, %rdx
	je      2f
	repz; retq
	.align 8
1:	shl     $12, %rdi
	setne   %al
	add     $0x2, %eax
	retq
	.align 8
2:	shl     $12, %rdi
	sete    %al
	retq

fpclassify3:
	movzwl  0x10(%rsp), %edx
	mov     $0x4, %eax
	and     $0x7fff, %edx
	je      1f
	cmp     $0x7fff, %edx
	je      2f
	repz; retq
	.align 8
1:	mov     0x8(%rsp), %rdx
	add     %rdx, %rdx
	setne   %al
	add     $0x2, %eax
	retq
	.align 8
2:	mov     0x8(%rsp), %rdx
	add     %rdx, %rdx
	sete    %al
	retq

	.align 16
fp4c1:
	.quad 0xffffffffffffffff
	.quad 0x7fffffffffffffff
fp4c2:	
	.quad 0x7fff7fff7fff7fff
	.quad 0x00017fff7fff7fff

fpclassify4:
	pand    fp4c1(%rip), %xmm0
	paddusw fp4c2(%rip), %xmm0
	mov     $0x4, %eax
	pmovmskb %xmm0, %edx
	and     $0xaaaa, %edx
	cmp     $0x8000, %edx
	jge     1f
	pextrw  $0x7, %xmm0, %edi
	cmp     $0x1, %edi
	je      2f
	repz; retq
	.align 8
2:	cmp     $0x0, %edx
	setne   %al
	add     $0x2, %eax
	retq
	.align 8
1:	sete    %al
	retq

The above functions have execution times that depend on the the type of numbers tested. This makes constructing a benchmark to test them harder. We will thus test these functions in two ways. The first is to simply use a normal number over and over again. This corresponds to use where most numbers are normal. The other case will test all possibilities by enumerating the values within an array repeatedly.

Normals
TypeOriginal Time in secOptimized Time in sec
float0.740.35
double0.920.45
long double1.170.40
__float1286.10.46

 

Mixed numbers
TypeOriginal Time in secOptimized Time in sec
float0.750.51
double0.880.57
long double1.560.58
__float1288.50.60

As can be seen, the mixed input results aren't quite as much of an improvement. However, the normal-only results are at least twice as the glibc. This is due to the branchy nature of the algorithm.

Summary

Using assembly code we can obtain about a 2× speedup for the bit-manipulation algorithms used to classify floating point types, over the C standard library. By using SSE instructions in non-standard ways, it is possible construct 128bit comparisons. By using SSE and normal 64 bit code simultaneously it is possible to construct branch-free versions of some quite complex algorithms.

The above contains quite a few assembly tricks, and it may be worthwhile investigating the source in more detail. For example, note how the sbb instruction can be used after a comparison to store the result. The use of shifts instead of an and instruction + comparison is also interesting. Another trick is to test a set of contiguous bits with an add causing a carry. Finally, the arithmetic shift of the infinity exponent bit-pattern to yield ±1 is also something you won't see all that often.

Obviously spending quite so much effort optimizing normal code isn't worth it. However, the floating point classification algorithms are simple test cases that allow one to explore what possibilities the extremes of optimization allow. By spending similar effort on your inner loops in performance-critical code, similar speed improvements may be possible. The most important thing is to think laterally, you don't have to use instructions the way they are normally used...

Comments

said...
In the article 'Classifying Floating Point Numbers' the term "magnitude" is used wrongly. You should replace all ocurrences of "magnitude" with "exponent". If one was to use the term "magnitude" at all in regard to floating point numbers it would refer to the combination of the exponent and the significand, in other words, everything except the sign.
sfuerst said...
Thanks, I've fixed the notation problem and replaced "magnitude" with the more precise "exponent" where applicable.
Jeremy said...
Have you ever compared the speed of the "fxam" instruction for classifying floating point numbers on x86/x87? It seems to do everything you provide explicit code for, in one instruction!


Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.