To inline or not to inline
Most programmers know that making commonly used code have the inline attribute is good for optimization. So what exactly does inlining code do, and why is it helpful? A function call consists of several steps:
The first step exists because the function that you are calling needs to know how to get the information contained within its arguments. The convention used for this is part of the Application Binary Interface (ABI), and varies between machine type and operating system. For example, 32bit x86 machines tend to pass arguments on the stack, whereas 64bit x86 machines tend to pass them in specific registers. There is an obvious cost in physically getting the arguments in the correct spot. However, smart optimizers may be good enough through intelligent register and stack-slot allocation to alleviate some of this.
The second step exists because a function's address might not be a compile-time constant. If you are calling a function in a dynamic library, then the address of that function needs to be obtained at runtime. Whether this is done at program startup, or gradually during execution, depends on your system. Either way, this cost exists, but is usually ignored by most people.
The third step exists because the called function may need somewhere to spill temporaries. Once the spare registers run out, the stack is used. The space for this needs to be allocated at the functions start by adding or subtracting from the stack pointer. (Depending on whether the stack grows upwards or downwards on your machine/operating system combination.)
The fourth step is self-explanatory. A function is called to do some work. However, as we can see, quite a lot needs to happen before this is done.
The fifth step may or may not exist depending on the ABI. Some machine/os combinations allow the use of a return statement that automatically pops off the local stack frame and arguments. Others do not in order to keep the calling convention simple and more flexible.
The sixth step is usually a single "return" instruction in most cases. However, a smart optimizer may change it into a jump via a tail-call optimization if a function ends with a call to a second daughter function.
Finally, we have the results. Unfortunately, they typically are not in the place we want them. i.e. on x86, integers and pointers are returned in %eax or %rax. We may need the results stored somewhere else. For example, if we are to call another function, we'll need to store %eax on the stack if we are running in 32-bit mode. If we are in 64-bit mode, we might need to move %rax to %rdi or %rsi to match the ABI of the function we are calling.
This is a rather long list, and inlining removes most of these steps, just leaving the "Actually do the work." one. (With perhaps a few register saves and restores remaining depending on how complex the inlined function, and how register starved the architecture is.) So obviously inlining is always a good thing, and should be done as much as possible. Unfortunately, reality isn't so simple. Making code inline can actually slow things down. Most of the time, the slowdown is indirect, and not correctly attributed via profilers, so it's hard to detect.
The problem is due to the finite size of your computers cache. Cache is so much faster than main memory on modern machines that it is extremely noticeable when a problem-set no longer fits within it. The larger the program, the larger its cache footprint. The inevitable cache misses are distributed virtually randomly throughout the program during its execution. Inlining a function called in multiple places can increase the size of the program, and thus the number of cache misses. However, the cache misses will probably not happen in the inlined code itself. This makes this slow-down hard to spot.
Thus only small functions that optimize down greatly, or things called in key inner loops should probably be inlined. The class of functions only called once sounds like a good category to inline as well. The resulting compiled code will probably be shorter. However, there is another cost to inlining. Inlining makes debugging much harder. Since there now can be multiple copies of a function, and a function may be "smeared out" in the middle of another function, it makes it much harder for the debugger to find which asm instructions correspond to a given line of source code. Inlining a function also exposes the function arguments to the optimizer, which may do all sorts of transformations on them, including eliminating them entirely. Debugging a function whose variables you cannot read, and whose lines you cannot set breakpoints on can be a challenge.
Inlining for performance is well known. However, the inverse is rarely discussed. The inline attribute is a keyword in modern C and C++. The inverse of forcing something not to be inline is not standardized, and needs to be done through some compiler-specific method. The gcc way is to use
The situation where it is advantageous to make part of a function not inlined is where you have a function which usually is extremely simple, but contains an if statement that is rarely true which causes a slow bloated special case to occur. An example of this is
The above code does something rather simple in the common case. However, the rare initialization case causes some extreme bloating:
The reason for the "bloat" is that the function calls in the rare case require a stack frame to be built. Even though
It turns out that the code can be optimized by splitting out the rare case into a separate function. By enforcing that new function not to be inlined, we can avoid the stack frame setup and teardown costs.
Note how the call to
The only deficiency is that the compiler doesn't generate a direct conditional jump to
The noinline trick is used within the Lockless Memory Allocator to remove the cost of rare initialization from fast paths. The common calls to functions like
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.