Implementing int2float()

The Linux Kernel exists within a special coding context. It has different constraints than user-space code. For example, stack space is limited, and various run-time features of the C standard library may not exist. (Use kmalloc() instead of malloc().) In addition, there may be hardware-specific differences to the way code executes in a kernel context. On x86_64, the kernel lives in a different address space than user space. The uppermost addressing bits are ones instead of zeros. Finally, you can't use floating point within the kernel

That isn't quite true. The RAID code does use floating point, but it pays some cost in doing so. The reason is that floating point state isn't saved and restored by the context-switch entry point. This means that if the kernel executes any floating point instructions, it can corrupt userspace information.

Note that you can't simply save and restore all the floating point registers. You also need to save and restore internal floating point flag state as well. For example, non-standard rounding modes may be enabled, or a floating point exception from a signalling NaN may be pending. Finally, the size of the floating point registers is hardware dependent. On newer X86, there are AVX registers that are 256 bits in size, whereas older machines only have the 128 bit SSE registers. This complexity (just for a single architecture, supporting others is even more difficult) means that floating point use is highly frowned upon.

This poses a problem. Some hardware actually depends on getting IEEE floating point numbers as part of its operation. If those numbers can come from user-space, it isn't too much of an issue. We can pass the information via memory, or by integer registers. However, sometimes the kernel itself needs to calculate floating point numbers. An example of this are GPU drivers. OpenGL uses floats to describe 3d space, and it is natural for hardware to mimic this interface.

Thus we need a way of calculating the IEEE floating point bitwise representation of an integer, but without using the convenient integer to floating point conversion instructions. The task today is to find the most efficient way of constructing such a function.

Straight-Forward Method

Recent versions of the Linux kernel use the following routine to calculate the floating point number corresponding to an unsigned integer: [It is actually called i2f()]

As can be seen by the comment, the function isn't perfect. It only works for numbers up to 2¹⁵-1, and is quite inefficient. The larger the numbers to support, the slower the algorithm is.

The problem is that the code finds out what the exponent and fraction are by doing a linear scan. We need a better method that does a faster search. Fortunately, there is the fls() function that obtains the location of the most significant bit. On x86, this corresponds to using an asm intrinsic to invoke the bsr opcode.

Thus we can replace the loop above with a single call to fls(). Due to the simpler algorithm, we can also expand the supported range to all numbers less than 2²⁴ or so. In general, numbers above this cannot be exactly represented as floating point, due to the limited number of fraction bits.

What we would like to do is support all unsigned 32 bit integers. The problem there is that we need to modify the algorithm somewhat. Numbers below 2²⁴ need to have their fraction bits shifted to the left. Numbers above, need to have the fraction bits shifted to the right. So it looks like we may need to have a branch for either case.

There is a trick, however. We can use the magic of the rotate instruction. A rotate (followed by a mask) can behave like a shift either to the left, or to the right. If we are careful about what values are passed to the rotate, we can also avoid some arithmetic. On x86, rotate instructions are taken modulo the size of the register i.e. 32 in this case, which allows further simplification. The result is the following routine:

As described in the comment, this function doesn't quite output the same values as your floating-point co-processor would. Above 2²⁴ the different choices of rounding mode matter. We chose the faster method of rounding towards zero. This is equivalent to truncation. Your fpu will likely choose "round to even", which has better numerical properties. Finally, note that the mod 32 (via &0x31) is there to support cpus other than x86. GCC helpfully removes that redundant operation when the rotate instruction doesn't need it.

Assembly

The above function is much more efficient than its earlier counterpart. It gets compiled by GCC into:

There is some extra redundancy in the above. The test for zero can be removed. The result of doing the bsr instruction will set the zero flag in that case. We can also use the fact that bsr doesn't affect its output register if it has a zero input on modern x86. (This behaviour is officially undocumented, but is now used elsewhere within the Linux Kernel at the blessing of the CPU manufacturers.) Thus we can simplify the function to:

The question is is this worth it? The original function takes 34 seconds to convert all possible 32 bit unsigned integers. The new function takes 30 seconds. The 10% speedup may, or may not be worth the maintainence burden of an assembly language routine.

Signed integers

The previous routines dealt with unsigned integers. Creating functions that work with signed integers isn't much more difficult. We just need to check the sign of the input, and then toggle the sign bit of the output floating point number. (Of course, always using the positive absolute value for the calculation.)

Unfortunately, this isn't handled very efficiently by GCC. It doesn't understand that the negative case could never also equal zero. It also doesn't fold the setting of the sign bit into other constants. The result thus looks like:

This function takes 47 seconds to convert all 32 bit signed integers to their floating point counterparts.

We can be a little more intelligent. With a single neg instruction, we can simultaneously test for zero, positive or negativeness, and also flip the sign of the input. Using this, we can use conditional moves to control the body of the function. The resulting routine looks like:

The above is a little faster than the C routine, taking 44 seconds to run over all inputs.

What happens if we try to reproduce the method that gcc uses, and have two separate branches for the positive and negative cases? We can optimize things a bit by using smarter constants, and by simplifying the check for zero. Doing that looks like:

It also takes 44 seconds to run. Perhaps we can do better. There is a well-known way to calculate the absolute value of a number within assembly. You can use sign-extension followed by an xor and subtract. Finally, we can use the sign-mask to also perturb the calculation constants. Using those bit-tricks, our routine looks like:

It executes one second faster than the previous routine, at 43 seconds for all inputs.

Finally, we could try a dumber method. Just use a simple branch around setting the constants depending on the sign:

This, surprisingly, is the fastest method of all. It takes 35 seconds to run over all possible inputs. The shorter code works better than the other branching algorithm. The fact that the benchmark (which scans linearly) is extremely predictable makes it better than the cmov-using function. In general, cmov works better with unpredictable input, and jumps are better in pretty much any other situation.

Of course, much of this testing is highly micro-arch dependent. This machine has a slow bsr implementation. Other machines may show bottlenecks elsewhere, and a different version of the above functions may prove to be the best. However, the above does show that it is possible to efficiently implement some floating point features purely with integer code. The fact that the Linux Kernel doesn't use floating point isn't much of a constraint after all.

The above routines convert 32 bit signed and unsigned integers into 32 bit single-precision floating point numbers. The conversion of 64 bit integers into double-precision floating point numbers isn't too different. All that is required is the use of 64 bit registers, and the modifying of the constants used. The code structure and algorithm can stay exactly the same.

Comments

Steven said...

Have you compared with specialized CPU instructions such as FILD (float load int into double) and FIST (float store as int)? Somehow, even if they should be much faster than the procedural conversions, I have the impression that they're not all that fast. I'd be curious to see their speed.

(also, the code for i2f is incorrect if the number is the smallest (largest negative) value. For example, on 16 bits, it ranges from -32768 to 32767. -(-32768) is not 32768)

sfuerst said...

The whole point is that you can't use the specialized CPU instructions inside the kernel, as floating point is off-limits. It is unlikely that these routines are as fast as the hardware ones.

Are you sure the code for i2f() is broken? It's been tested against the obvious float cast for all 32bit values. The negation of INT_MIN gives INT_MIN, but then the implicit cast to the unsigned type will yield 0x80000000 as an unsigned value. u2f() then works as intended...

Steven said...

I wasn't aware that this was a hard limitation in the kernel. Why prevent the use of floating point operations? To prevent to have a context-save?

sfuerst said...

The floating point registers are not saved on kernel entry. This increases performance due to avoiding the memory bandwidth cost of saving+restoring them.

However, the cost of supporting floating point within the kernel is even worse than that. You also need to save+restore things like the rounding mode and exception state. It is a horrible pain, that is only worth doing in rare cases. One case where the pain is outweighed by the gain is within SSE-accelerated RAID code.

Note that the kernel may be entered at any time via an interrupt... The interface it exposes is greater than that listed by system calls.

reader said...

Steven, perhaps you should re-read the first three paragraphs of this article.

Timo said...

You give timings for the your code, but not for the original one from the kernel with the loop. I'm curious as to how big the speedup is.

Conn said...

would you be willing to rewrite and give out your u2f and i2f assembly examples as inlined asm?

Conn said...

Well I gave u2f a try. It works as a stand alone function but when gcc inlines it it screws things up.

uint32_t u2f(uint32_t x)
{
        uint32_t out, rotate, mask = 0x7fffff;

asm volatile(
        "mov %1, %0 \n\t"
        "bsr %0, %0 \n\t"
        "je 1f\n\t"
        "lea -0x17(%0), %2\n\t"
        "add $0x7f, %0\n\t"
        "shl $0x17, %0\n\t"
        "ror %%cl, %1\n\t"
        "and %3, %1\n\t"
        "add %1, %0\n\t"
        "1:\n\t"
        : "=&a" (out) : "r" (x), "c" (rotate), "g" (mask) : "cc");

        return out;
}

Maybe you could look at it and see where I'm going wrong.

Conn said...

Ok, I got it to work but I'm sure there is room for improvement from a more experienced gcc inline asm programmer.

uint32_t u2f(uint32_t x)
{
        uint32_t out, rotate ;

asm volatile(
        "mov %1, %0 \n\t"
        "bsr %0, %0 \n\t"
        "je 1f\n\t"
        "lea -0x17(%0), %2\n\t"
        "add $0x7f, %0\n\t"
        "shl $0x17, %0\n\t"
        "ror %%cl, %1\n\t"
        "and $0x7fffff, %1\n\t"
        "add %1, %0\n\t"
        "1:\n\t"
        : "=&a" (out), "=r" (x), "=c" (rotate)
        : "1" (x), "2" (rotate)
        : "cc");

        return out;
}

It even works in the linux kernel :)

Satria said...

Ahh Eres nice idea. But if we care about performance, I'm not sure whehter inserting an expression all the times, where sometimes they won't be used, is good enough. Though on the other hand, it depends on how quick you know if there was a change.The easiest, ok not the easiest, but the best solution would be to have an IP register and a shadow IP register. The shadow register will contain the current IP, and will be used for rvalues only. The IP register will be reset to undefined every instruction. Then if the real IP register was set, you will know to use it. Otherwise continue to next instruction, and set the shadow one again.This wasy even if an instruction jumps to itself it will keep on working

Tommy said...

Unpeeallrlad accuracy, unequivocal clarity, and undeniable importance! http://ixswkdvzthq.com [url=http://btcswfsuhpg.com]btcswfsuhpg[/url] [link=http://scqdnzhdixu.com]scqdnzhdixu[/link]

Tommy said...

Unpeeallrlad accuracy, unequivocal clarity, and undeniable importance! http://ixswkdvzthq.com [url=http://btcswfsuhpg.com]btcswfsuhpg[/url] [link=http://scqdnzhdixu.com]scqdnzhdixu[/link]

Evaline said...

<a href="http://brirgw.com">Townudohc!</a> That's a really cool way of putting it!

Yakov said...

I'm not sure what you describe is completely kosher as I'm almost sure BSR/BSF use floating point internally. They convert to float (or double) and read the exponent.