The Linux Kernel exists within a special coding context. It has different constraints than user-space code. For example, stack space is limited, and various run-time features of the C standard library may not exist. (Use kmalloc() instead of malloc().) In addition, there may be hardware-specific differences to the way code executes in a kernel context. On x86_64, the kernel lives in a different address space than user space. The uppermost addressing bits are ones instead of zeros. Finally, you can't use floating point within the kernel
That isn't quite true. The RAID code does use floating point, but it pays some cost in doing so. The reason is that floating point state isn't saved and restored by the context-switch entry point. This means that if the kernel executes any floating point instructions, it can corrupt userspace information.
Note that you can't simply save and restore all the floating point registers. You also need to save and restore internal floating point flag state as well. For example, non-standard rounding modes may be enabled, or a floating point exception from a signalling NaN may be pending. Finally, the size of the floating point registers is hardware dependent. On newer X86, there are AVX registers that are 256 bits in size, whereas older machines only have the 128 bit SSE registers. This complexity (just for a single architecture, supporting others is even more difficult) means that floating point use is highly frowned upon.
This poses a problem. Some hardware actually depends on getting IEEE floating point numbers as part of its operation. If those numbers can come from user-space, it isn't too much of an issue. We can pass the information via memory, or by integer registers. However, sometimes the kernel itself needs to calculate floating point numbers. An example of this are GPU drivers. OpenGL uses floats to describe 3d space, and it is natural for hardware to mimic this interface.
Thus we need a way of calculating the IEEE floating point bitwise representation of an integer, but without using the convenient integer to floating point conversion instructions. The task today is to find the most efficient way of constructing such a function.
Recent versions of the Linux kernel use the following routine to calculate the floating point number corresponding to an unsigned integer: [It is actually called i2f()]
As can be seen by the comment, the function isn't perfect. It only works for numbers up to 215-1, and is quite inefficient. The larger the numbers to support, the slower the algorithm is.
The problem is that the code finds out what the exponent and fraction are by doing a linear scan. We need a better method that does a faster search. Fortunately, there is the
Thus we can replace the loop above with a single call to
What we would like to do is support all unsigned 32 bit integers. The problem there is that we need to modify the algorithm somewhat. Numbers below 224 need to have their fraction bits shifted to the left. Numbers above, need to have the fraction bits shifted to the right. So it looks like we may need to have a branch for either case.
There is a trick, however. We can use the magic of the rotate instruction. A rotate (followed by a mask) can behave like a shift either to the left, or to the right. If we are careful about what values are passed to the rotate, we can also avoid some arithmetic. On x86, rotate instructions are taken modulo the size of the register i.e. 32 in this case, which allows further simplification. The result is the following routine:
As described in the comment, this function doesn't quite output the same values as your floating-point co-processor would. Above 224 the different choices of rounding mode matter. We chose the faster method of rounding towards zero. This is equivalent to truncation. Your fpu will likely choose "round to even", which has better numerical properties. Finally, note that the mod 32 (via &0x31) is there to support cpus other than x86. GCC helpfully removes that redundant operation when the rotate instruction doesn't need it.
The above function is much more efficient than its earlier counterpart. It gets compiled by GCC into:
There is some extra redundancy in the above. The test for zero can be removed. The result of doing the
The question is is this worth it? The original function takes 34 seconds to convert all possible 32 bit unsigned integers. The new function takes 30 seconds. The 10% speedup may, or may not be worth the maintainence burden of an assembly language routine.
The previous routines dealt with unsigned integers. Creating functions that work with signed integers isn't much more difficult. We just need to check the sign of the input, and then toggle the sign bit of the output floating point number. (Of course, always using the positive absolute value for the calculation.)
Thus a C version of the signed case might look like:
Unfortunately, this isn't handled very efficiently by GCC. It doesn't understand that the negative case could never also equal zero. It also doesn't fold the setting of the sign bit into other constants. The result thus looks like:
This function takes 47 seconds to convert all 32 bit signed integers to their floating point counterparts.
We can be a little more intelligent. With a single
The above is a little faster than the C routine, taking 44 seconds to run over all inputs.
What happens if we try to reproduce the method that gcc uses, and have two separate branches for the positive and negative cases? We can optimize things a bit by using smarter constants, and by simplifying the check for zero. Doing that looks like:
It also takes 44 seconds to run. Perhaps we can do better. There is a well-known way to calculate the absolute value of a number within assembly. You can use sign-extension followed by an xor and subtract. Finally, we can use the sign-mask to also perturb the calculation constants. Using those bit-tricks, our routine looks like:
It executes one second faster than the previous routine, at 43 seconds for all inputs.
Finally, we could try a dumber method. Just use a simple branch around setting the constants depending on the sign:
This, surprisingly, is the fastest method of all. It takes 35 seconds to run over all possible inputs. The shorter code works better than the other branching algorithm. The fact that the benchmark (which scans linearly) is extremely predictable makes it better than the cmov-using function. In general, cmov works better with unpredictable input, and jumps are better in pretty much any other situation.
Of course, much of this testing is highly micro-arch dependent. This machine has a slow
The above routines convert 32 bit signed and unsigned integers into 32 bit single-precision floating point numbers. The conversion of 64 bit integers into double-precision floating point numbers isn't too different. All that is required is the use of 64 bit registers, and the modifying of the constants used. The code structure and algorithm can stay exactly the same.
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.