Lockless Inc

Implementing int2float()

The Linux Kernel exists within a special coding context. It has different constraints than user-space code. For example, stack space is limited, and various run-time features of the C standard library may not exist. (Use kmalloc() instead of malloc().) In addition, there may be hardware-specific differences to the way code executes in a kernel context. On x86_64, the kernel lives in a different address space than user space. The uppermost addressing bits are ones instead of zeros. Finally, you can't use floating point within the kernel

That isn't quite true. The RAID code does use floating point, but it pays some cost in doing so. The reason is that floating point state isn't saved and restored by the context-switch entry point. This means that if the kernel executes any floating point instructions, it can corrupt userspace information.

Note that you can't simply save and restore all the floating point registers. You also need to save and restore internal floating point flag state as well. For example, non-standard rounding modes may be enabled, or a floating point exception from a signalling NaN may be pending. Finally, the size of the floating point registers is hardware dependent. On newer X86, there are AVX registers that are 256 bits in size, whereas older machines only have the 128 bit SSE registers. This complexity (just for a single architecture, supporting others is even more difficult) means that floating point use is highly frowned upon.

This poses a problem. Some hardware actually depends on getting IEEE floating point numbers as part of its operation. If those numbers can come from user-space, it isn't too much of an issue. We can pass the information via memory, or by integer registers. However, sometimes the kernel itself needs to calculate floating point numbers. An example of this are GPU drivers. OpenGL uses floats to describe 3d space, and it is natural for hardware to mimic this interface.

Thus we need a way of calculating the IEEE floating point bitwise representation of an integer, but without using the convenient integer to floating point conversion instructions. The task today is to find the most efficient way of constructing such a function.

Straight-Forward Method

Recent versions of the Linux kernel use the following routine to calculate the floating point number corresponding to an unsigned integer: [It is actually called i2f()]


#define I2F_MAX_BITS 15
#define I2F_MAX_INPUT  ((1 << I2F_MAX_BITS) - 1)
#define I2F_SHIFT (24 - I2F_MAX_BITS)

/*
 * Converts unsigned integer into 32-bit IEEE floating point representation.
 * Conversion is not universal and only works for the range from 0
 * to 2^I2F_MAX_BITS-1. Currently we only use it with inputs between
 * 0 and 16384 (inclusive), so I2F_MAX_BITS=15 is enough. If necessary,
 * I2F_MAX_BITS can be increased, but that will add to the loop iterations
 * and slow us down. Conversion is done by shifting the input and counting
 * down until the first 1 reaches bit position 23. The resulting counter
 * and the shifted input are, respectively, the exponent and the fraction.
 * The sign is always zero.
 */
uint32_t u2f(uint32_t input)
{
	uint32_t result, i, exponent, fraction;

	if ((input & I2F_MAX_INPUT) == 0)
		result = 0;
	else {
		exponent = 126 + I2F_MAX_BITS;
		fraction = (input & I2F_MAX_INPUT) << I2F_SHIFT;

		for (i = 0; i < I2F_MAX_BITS; i++) {
			if (fraction & 0x800000)
				break;
			else {
				fraction = fraction << 1;
				exponent = exponent - 1;
			}
		}
		result = exponent << 23 | (fraction & 0x7fffff);
	}
	return result;
}

As can be seen by the comment, the function isn't perfect. It only works for numbers up to 215-1, and is quite inefficient. The larger the numbers to support, the slower the algorithm is.

The problem is that the code finds out what the exponent and fraction are by doing a linear scan. We need a better method that does a faster search. Fortunately, there is the fls() function that obtains the location of the most significant bit. On x86, this corresponds to using an asm intrinsic to invoke the bsr opcode.

Thus we can replace the loop above with a single call to fls(). Due to the simpler algorithm, we can also expand the supported range to all numbers less than 224 or so. In general, numbers above this cannot be exactly represented as floating point, due to the limited number of fraction bits.

What we would like to do is support all unsigned 32 bit integers. The problem there is that we need to modify the algorithm somewhat. Numbers below 224 need to have their fraction bits shifted to the left. Numbers above, need to have the fraction bits shifted to the right. So it looks like we may need to have a branch for either case.

There is a trick, however. We can use the magic of the rotate instruction. A rotate (followed by a mask) can behave like a shift either to the left, or to the right. If we are careful about what values are passed to the rotate, we can also avoid some arithmetic. On x86, rotate instructions are taken modulo the size of the register i.e. 32 in this case, which allows further simplification. The result is the following routine:


/* 23 bits of float fractional data */
#define I2F_FRAC_BITS	23
#define I2F_MASK ((1 << I2F_FRAC_BITS) - 1)

/*
 * Converts unsigned integer into 32-bit IEEE floating point representation.
 * Will be exact from 0 to 2^24.  Above that, we round towards zero
 * as the fractional bits will not fit in a float.  (It would be better to
 * round towards even as the fpu does, but that is slower.)
 */
static uint32_t u2f(uint32_t x)
{
	uint32_t msb, exponent, fraction;

	/* Zero is special */
	if (!x) return 0;

	/* Get location of the most significant bit */
	msb = __fls(x);

	/*
	 * Use a rotate instead of a shift because that works both leftwards
	 * and rightwards due to the mod(32) behaviour.  This means we don't
	 * need to check to see if we are above 2^24 or not.
	 */
	fraction = ror32(x, (msb - I2F_FRAC_BITS) & 0x1f) & I2F_MASK;
	exponent = (127 + msb) << I2F_FRAC_BITS;

	return fraction + exponent;
}

As described in the comment, this function doesn't quite output the same values as your floating-point co-processor would. Above 224 the different choices of rounding mode matter. We chose the faster method of rounding towards zero. This is equivalent to truncation. Your fpu will likely choose "round to even", which has better numerical properties. Finally, note that the mod 32 (via &0x31) is there to support cpus other than x86. GCC helpfully removes that redundant operation when the rotate instruction doesn't need it.

Assembly

The above function is much more efficient than its earlier counterpart. It gets compiled by GCC into:


.globl u2f_c
.type u2f_c,@function
.align 16
u2f_c:
	xor    %eax, %eax
	test   %edi, %edi
	je     1f
	mov    %edi, %eax
	bsr    %rax, %rax
	lea    -0x17(%rax), %ecx
	add    $0x7f, %eax
	shl    $0x17, %eax
	ror    %cl, %edi
	and    $0x7fffff, %edi
	add    %edi, %eax
1:	rep; retq
.size u2f_c, .-u2f_c

There is some extra redundancy in the above. The test for zero can be removed. The result of doing the bsr instruction will set the zero flag in that case. We can also use the fact that bsr doesn't affect its output register if it has a zero input on modern x86. (This behaviour is officially undocumented, but is now used elsewhere within the Linux Kernel at the blessing of the CPU manufacturers.) Thus we can simplify the function to:


.globl u2f_asm
.type u2f_asm,@function
.align 16
u2f_asm:
	mov    %edi, %eax
	bsr    %eax, %eax
	je     1f
	lea    -0x17(%rax), %ecx
	add    $0x7f, %eax
	shl    $0x17, %eax
	ror    %cl, %edi
	and    $0x7fffff, %edi
	add    %edi, %eax
1:	rep; retq
.size u2f_asm, .-u2f_asm

The question is is this worth it? The original function takes 34 seconds to convert all possible 32 bit unsigned integers. The new function takes 30 seconds. The 10% speedup may, or may not be worth the maintainence burden of an assembly language routine.

Signed integers

The previous routines dealt with unsigned integers. Creating functions that work with signed integers isn't much more difficult. We just need to check the sign of the input, and then toggle the sign bit of the output floating point number. (Of course, always using the positive absolute value for the calculation.)

Thus a C version of the signed case might look like:


uint32_t i2f(int32_t x)
{
	if (x < 0) return u2f(-x) | 0x80000000;
	return u2f(x);
}

Unfortunately, this isn't handled very efficiently by GCC. It doesn't understand that the negative case could never also equal zero. It also doesn't fold the setting of the sign bit into other constants. The result thus looks like:


.globl i2f_c
.type i2f_c,@function
.align 16
i2f_c:
	cmp    $0x0, %edi
	jl     2f
	mov    $0x0, %eax
	je     1f
	mov    %edi, %eax
	bsr    %rax, %rax
	lea    -0x17(%rax), %ecx
	add    $0x7f, %eax
	shl    $0x17, %eax
	ror    %cl, %edi
	and    $0x7fffff, %edi
	add    %edi, %eax
1:  rep; retq 

2:  nopw   0x0(%rax,%rax,1)
	neg    %edi
	mov    $0x80000000, %eax
	je     1b
	mov    %edi, %edx
	bsr    %rdx, %rdx
	lea    -0x17(%rdx), %ecx
	add    $0x7f, %edx
	shl    $0x17, %edx
	ror    %cl, %edi
	and    $0x7fffff, %edi
	lea    (%rdi,%rdx,1), %eax
	or     $0x80000000, %eax
	retq   
.size i2f_c, .-i2f_c

This function takes 47 seconds to convert all 32 bit signed integers to their floating point counterparts.

We can be a little more intelligent. With a single neg instruction, we can simultaneously test for zero, positive or negativeness, and also flip the sign of the input. Using this, we can use conditional moves to control the body of the function. The resulting routine looks like:


.globl i2f_asm_cmov
.type i2f_asm_cmov,@function
.align 16
i2f_asm_cmov:
	mov    %edi, %eax
	neg    %edi
	je     1f
	mov    $0x17f, %edx
	mov    $0x7f, %ecx
	cmovl  %eax, %edi
	cmovl  %ecx, %edx
	bsr    %edi, %eax
	lea    -0x17(%rax), %ecx
	add    %edx, %eax
	shl    $0x17, %eax
	ror    %cl, %edi
	and    $0x7fffff, %edi
	add    %edi, %eax
1:	rep; retq
.size i2f_asm_cmov, .-i2f_asm_cmov

The above is a little faster than the C routine, taking 44 seconds to run over all inputs.

What happens if we try to reproduce the method that gcc uses, and have two separate branches for the positive and negative cases? We can optimize things a bit by using smarter constants, and by simplifying the check for zero. Doing that looks like:


.globl i2f_asm_jmp2
.type i2f_asm_jmp2,@function
.align 16
i2f_asm_jmp2:
	mov    %edi, %eax
	neg    %edi
	je     1f
	jl     2f
	bsr    %edi, %eax
	lea    -0x17(%rax), %ecx
	add    $0x17f, %eax
	shl    $0x17, %eax
	ror    %cl, %edi
	and    $0x7fffff, %edi
	add    %edi, %eax
1:  rep; retq
2:  bsr    %eax, %edi
	lea    -0x17(%rdi), %ecx
	add    $0x7f, %edi
	shl    $0x17, %edi
	ror    %cl, %eax
	and    $0x7fffff, %eax
	add    %edi, %eax
    rep; retq
.size i2f_asm_jmp2, .-i2f_asm_jmp2

It also takes 44 seconds to run. Perhaps we can do better. There is a well-known way to calculate the absolute value of a number within assembly. You can use sign-extension followed by an xor and subtract. Finally, we can use the sign-mask to also perturb the calculation constants. Using those bit-tricks, our routine looks like:


.globl i2f_asm_abs
.type i2f_asm_abs,@function
.align 16
i2f_asm:
	mov    %edi, %eax
	cdq
	xor    %edx, %eax
	sub    %edx, %eax
	je     1f
	mov    %eax, %edi
	bsr    %eax, %eax
	and    $0x100, %edx
	lea    -0x17(%rax), %ecx
	lea    0x7f(%rdx, %rax), %eax
	shl    $0x17, %eax
	ror    %cl, %edi
	and    $0x7fffff, %edi
	add    %edi, %eax
1:	rep; retq
.size i2f_asm_abs, .-i2f_asm_abs

It executes one second faster than the previous routine, at 43 seconds for all inputs.

Finally, we could try a dumber method. Just use a simple branch around setting the constants depending on the sign:


.globl i2f_asm_jmp
.type i2f_asm_jmp,@function
.align 16
i2f_asm_jmp:
	mov    %edi, %eax
	neg    %eax
	je     2f
	mov    $0x7f, %edx
	jl     1f
	mov    $0x17f, %edx
	mov    %eax, %edi

1:	bsr    %edi, %eax
	lea    -0x17(%rax), %ecx
	add    %edx, %eax
	shl    $0x17, %eax
	ror    %cl, %edi
	and    $0x7fffff, %edi
	add    %edi, %eax
2:	rep; retq
.size i2f_asm_jmp, .-i2f_asm_jmp

This, surprisingly, is the fastest method of all. It takes 35 seconds to run over all possible inputs. The shorter code works better than the other branching algorithm. The fact that the benchmark (which scans linearly) is extremely predictable makes it better than the cmov-using function. In general, cmov works better with unpredictable input, and jumps are better in pretty much any other situation.

Of course, much of this testing is highly micro-arch dependent. This machine has a slow bsr implementation. Other machines may show bottlenecks elsewhere, and a different version of the above functions may prove to be the best. However, the above does show that it is possible to efficiently implement some floating point features purely with integer code. The fact that the Linux Kernel doesn't use floating point isn't much of a constraint after all.

The above routines convert 32 bit signed and unsigned integers into 32 bit single-precision floating point numbers. The conversion of 64 bit integers into double-precision floating point numbers isn't too different. All that is required is the use of 64 bit registers, and the modifying of the constants used. The code structure and algorithm can stay exactly the same.

Comments

Steven said...
Have you compared with specialized CPU instructions such as FILD (float load int into double) and FIST (float store as int)? Somehow, even if they should be much faster than the procedural conversions, I have the impression that they're not all that fast. I'd be curious to see their speed.


(also, the code for i2f is incorrect if the number is the smallest (largest negative) value. For example, on 16 bits, it ranges from -32768 to 32767. -(-32768) is not 32768)
sfuerst said...
The whole point is that you can't use the specialized CPU instructions inside the kernel, as floating point is off-limits. It is unlikely that these routines are as fast as the hardware ones.

Are you sure the code for i2f() is broken? It's been tested against the obvious float cast for all 32bit values. The negation of INT_MIN gives INT_MIN, but then the implicit cast to the unsigned type will yield 0x80000000 as an unsigned value. u2f() then works as intended...
Steven said...
I wasn't aware that this was a hard limitation in the kernel. Why prevent the use of floating point operations? To prevent to have a context-save?
sfuerst said...
The floating point registers are not saved on kernel entry. This increases performance due to avoiding the memory bandwidth cost of saving+restoring them.

However, the cost of supporting floating point within the kernel is even worse than that. You also need to save+restore things like the rounding mode and exception state. It is a horrible pain, that is only worth doing in rare cases. One case where the pain is outweighed by the gain is within SSE-accelerated RAID code.

Note that the kernel may be entered at any time via an interrupt... The interface it exposes is greater than that listed by system calls.
reader said...
Steven, perhaps you should re-read the first three paragraphs of this article.
Timo said...
You give timings for the your code, but not for the original one from the kernel with the loop. I'm curious as to how big the speedup is.
Conn said...
would you be willing to rewrite and give out your u2f and i2f assembly examples as inlined asm?
Conn said...
Well I gave u2f a try. It works as a stand alone function but when gcc inlines it it screws things up.

uint32_t u2f(uint32_t x)
{
        uint32_t out, rotate, mask = 0x7fffff;

asm volatile(
        "mov %1, %0 \n\t"
        "bsr %0, %0 \n\t"
        "je 1f\n\t"
        "lea -0x17(%0), %2\n\t"
        "add $0x7f, %0\n\t"
        "shl $0x17, %0\n\t"
        "ror %%cl, %1\n\t"
        "and %3, %1\n\t"
        "add %1, %0\n\t"
        "1:\n\t"
        : "=&a" (out) : "r" (x), "c" (rotate), "g" (mask) : "cc");

        return out;
}

Maybe you could look at it and see where I'm going wrong.
Conn said...
Ok, I got it to work but I'm sure there is room for improvement from a more experienced gcc inline asm programmer.


uint32_t u2f(uint32_t x)
{
        uint32_t out, rotate ;

asm volatile(
        "mov %1, %0 \n\t"
        "bsr %0, %0 \n\t"
        "je 1f\n\t"
        "lea -0x17(%0), %2\n\t"
        "add $0x7f, %0\n\t"
        "shl $0x17, %0\n\t"
        "ror %%cl, %1\n\t"
        "and $0x7fffff, %1\n\t"
        "add %1, %0\n\t"
        "1:\n\t"
        : "=&a" (out), "=r" (x), "=c" (rotate)
        : "1" (x), "2" (rotate)
        : "cc");

        return out;
}

It even works in the linux kernel :)
Satria said...
Ahh Eres nice idea. But if we care about performance, I'm not sure whehter inserting an expression all the times, where sometimes they won't be used, is good enough. Though on the other hand, it depends on how quick you know if there was a change.The easiest, ok not the easiest, but the best solution would be to have an IP register and a shadow IP register. The shadow register will contain the current IP, and will be used for rvalues only. The IP register will be reset to undefined every instruction. Then if the real IP register was set, you will know to use it. Otherwise continue to next instruction, and set the shadow one again.This wasy even if an instruction jumps to itself it will keep on working
Tommy said...
Unpeeallrlad accuracy, unequivocal clarity, and undeniable importance! http://ixswkdvzthq.com [url=http://btcswfsuhpg.com]btcswfsuhpg[/url] [link=http://scqdnzhdixu.com]scqdnzhdixu[/link]
Tommy said...
Unpeeallrlad accuracy, unequivocal clarity, and undeniable importance! http://ixswkdvzthq.com [url=http://btcswfsuhpg.com]btcswfsuhpg[/url] [link=http://scqdnzhdixu.com]scqdnzhdixu[/link]
Evaline said...
<a href="http://brirgw.com">Townudohc!</a> That's a really cool way of putting it!

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.