Lockless Inc

To inline or not to inline

Most programmers know that making commonly used code have the inline attribute is good for optimization. So what exactly does inlining code do, and why is it helpful? A function call consists of several steps:

  1. Marshall the parameters into the correct stack locations or registers required by the ABI
  2. Obtain the address of the function to call, and then call it.
  3. Construct a stack frame for local variables in that function
  4. Actually do the work.
  5. Deconstruct the stack frame of local variables
  6. Return to the calling function
  7. Move the results into the registers or locations where they actually need to be

The first step exists because the function that you are calling needs to know how to get the information contained within its arguments. The convention used for this is part of the Application Binary Interface (ABI), and varies between machine type and operating system. For example, 32bit x86 machines tend to pass arguments on the stack, whereas 64bit x86 machines tend to pass them in specific registers. There is an obvious cost in physically getting the arguments in the correct spot. However, smart optimizers may be good enough through intelligent register and stack-slot allocation to alleviate some of this.

The second step exists because a function's address might not be a compile-time constant. If you are calling a function in a dynamic library, then the address of that function needs to be obtained at runtime. Whether this is done at program startup, or gradually during execution, depends on your system. Either way, this cost exists, but is usually ignored by most people.

The third step exists because the called function may need somewhere to spill temporaries. Once the spare registers run out, the stack is used. The space for this needs to be allocated at the functions start by adding or subtracting from the stack pointer. (Depending on whether the stack grows upwards or downwards on your machine/operating system combination.)

The fourth step is self-explanatory. A function is called to do some work. However, as we can see, quite a lot needs to happen before this is done.

The fifth step may or may not exist depending on the ABI. Some machine/os combinations allow the use of a return statement that automatically pops off the local stack frame and arguments. Others do not in order to keep the calling convention simple and more flexible.

The sixth step is usually a single "return" instruction in most cases. However, a smart optimizer may change it into a jump via a tail-call optimization if a function ends with a call to a second daughter function.

Finally, we have the results. Unfortunately, they typically are not in the place we want them. i.e. on x86, integers and pointers are returned in %eax or %rax. We may need the results stored somewhere else. For example, if we are to call another function, we'll need to store %eax on the stack if we are running in 32-bit mode. If we are in 64-bit mode, we might need to move %rax to %rdi or %rsi to match the ABI of the function we are calling.

This is a rather long list, and inlining removes most of these steps, just leaving the "Actually do the work." one. (With perhaps a few register saves and restores remaining depending on how complex the inlined function, and how register starved the architecture is.) So obviously inlining is always a good thing, and should be done as much as possible. Unfortunately, reality isn't so simple. Making code inline can actually slow things down. Most of the time, the slowdown is indirect, and not correctly attributed via profilers, so it's hard to detect.

The problem is due to the finite size of your computers cache. Cache is so much faster than main memory on modern machines that it is extremely noticeable when a problem-set no longer fits within it. The larger the program, the larger its cache footprint. The inevitable cache misses are distributed virtually randomly throughout the program during its execution. Inlining a function called in multiple places can increase the size of the program, and thus the number of cache misses. However, the cache misses will probably not happen in the inlined code itself. This makes this slow-down hard to spot.

Thus only small functions that optimize down greatly, or things called in key inner loops should probably be inlined. The class of functions only called once sounds like a good category to inline as well. The resulting compiled code will probably be shorter. However, there is another cost to inlining. Inlining makes debugging much harder. Since there now can be multiple copies of a function, and a function may be "smeared out" in the middle of another function, it makes it much harder for the debugger to find which asm instructions correspond to a given line of source code. Inlining a function also exposes the function arguments to the optimizer, which may do all sorts of transformations on them, including eliminating them entirely. Debugging a function whose variables you cannot read, and whose lines you cannot set breakpoints on can be a challenge.

Inlining for performance is well known. However, the inverse is rarely discussed. The inline attribute is a keyword in modern C and C++. The inverse of forcing something not to be inline is not standardized, and needs to be done through some compiler-specific method. The gcc way is to use __attribute__((noinline)) in the function declaration. It turns out that in some relatively common cases, it is possible to make code more optimized by deliberately not inlining. How is this possible? Inlining removes a huge number of instructions, how can this advantage be overwhelmed?

The situation where it is advantageous to make part of a function not inlined is where you have a function which usually is extremely simple, but contains an if statement that is rarely true which causes a slow bloated special case to occur. An example of this is

static int *array;
int use_array(size_t x, size_t y)
	if (!array)
		int result;
		array = malloc(1024);
		if (!array)
			printf("We are out of memory in use_array()\n");
		result = init_array(array);
		test_array(result, array);
	array[x] += y;
	return array[x];

The above code does something rather simple in the common case. However, the rare initialization case causes some extreme bloating:

<use_array+0>:		mov    %rbx,-0x10(%rsp) 					
<use_array+5>:		mov    %rbp,-0x8(%rsp)  					
<use_array+10>:		sub    $0x18,%rsp						  
<use_array+14>:		cmpq   $0x0,0x200622(%rip) 	  # 0x600cd8 <array>
<use_array+22>:		mov    %rdi,%rbx									
<use_array+25>:		mov    %rsi,%rbp									
<use_array+28>:		je     0x4006e8 <use_array+72>					
<use_array+30>:		lea    0x0(,%rbx,4),%rdx							
<use_array+38>:		add    0x20060b(%rip),%rdx  	  # 0x600cd8 <array>
<use_array+45>:		mov    %ebp,%eax									
<use_array+47>:		add    (%rdx),%eax  								
<use_array+49>:		mov    %eax,(%rdx)  								
<use_array+51>:		mov    0x8(%rsp),%rbx								
<use_array+56>:		mov    0x10(%rsp),%rbp  							
<use_array+61>:		add    $0x18,%rsp									
<use_array+65>:		retq												
<use_array+66>:		nopw   0x0(%rax,%rax,1) 							
<use_array+72>:		mov    $0x400,%edi  								
<use_array+77>:		callq  0x400478 <malloc@plt>  					
<use_array+82>:		test   %rax,%rax									
<use_array+85>:		mov    %rax,0x2005dc(%rip)  	  # 0x600cd8 <array>
<use_array+92>:		je     0x400708 <use_array+104>					
<use_array+94>:		mov    %rax,%rdi									
<use_array+97>:		callq  0x400590 <init_array>  					
<use_array+102>:	jmp    0x4006be <use_array+30>					
<use_array+104>:	mov    $0x400918,%edi								
<use_array+109>:	callq  0x400458 <puts@plt>						
<use_array+114>:	xor    %edi,%edi									
<use_array+116>:	callq  0x400468 <exit@plt>

The reason for the "bloat" is that the function calls in the rare case require a stack frame to be built. Even though x and y are not modified within use_array the function needs a place to store them during the other function calls. It does this by saving %rbx and %rbp, and then using those registers to store the two arguments. However, in the common case, this saving and restoring is a complete waste. The actual work done is extremely small compared to all of the overhead.

It turns out that the code can be optimized by splitting out the rare case into a separate function. By enforcing that new function not to be inlined, we can avoid the stack frame setup and teardown costs.

int use_array2(size_t x, size_t y);
__attribute__((noinline)) int use_array2_aux(int x, int y)
	int result;
	array = malloc(1024);
	if (!array)
		printf("We are out of memory in use_array()\n");
	result = init_array(array);
	test_array(result, array);
	return use_array2(x, y);

int use_array2(size_t x, size_t y)
	if (!array) return use_array2_aux(x, y);
	array[x] += y;
	return array[x];

Note how the call to use_array2_aux() is able to be tail-called. This prevents a stack frame from being used, and the resulting assembler for use_array2() is extremely fast:

<use_array2+0>:		mov 	0x200541(%rip),%rax		# 0x600cd8 <array>
<use_array2+7>:		test	%rax,%rax
<use_array2+10>:	je	0x4007b0 <use_array2+32>
<use_array2+12>:	lea 	(%rax,%rdi,4),%rdx
<use_array2+16>:	mov 	%esi,%eax
<use_array2+18>:	add 	(%rdx),%eax
<use_array2+20>:	mov 	%eax,(%rdx)
<use_array2+22>:	retq
<use_array2+23>:	nopw	0x0(%rax,%rax,1)
<use_array2+32>:	jmpq	0x400720 <use_array2_aux>

The only deficiency is that the compiler doesn't generate a direct conditional jump to use_array2_aux. However, since that function is only called once, the extra instructions don't really matter much. For completeness, the assembler code for use_array2_aux() is still as bloated as ever:

<use_array2_aux+0>:	mov    %rbx,-0x18(%rsp)
<use_array2_aux+5>:	mov    %rbp,-0x10(%rsp)
<use_array2_aux+10>:	mov    %edi,%ebp
<use_array2_aux+12>:	mov    %r12,-0x8(%rsp)
<use_array2_aux+17>:	mov    $0x400,%edi
<use_array2_aux+22>:	sub    $0x18,%rsp
<use_array2_aux+26>:	mov    %esi,%r12d
<use_array2_aux+29>:	callq  0x400478 <malloc@plt>
<use_array2_aux+34>:	test   %rax,%rax
<use_array2_aux+37>:	mov    %rax,%rbx
<use_array2_aux+40>:	mov    %rax,0x200589(%rip)  	  # 0x600cd8 <array>
<use_array2_aux+47>:	je     0x40077a <use_array2_aux+90>
<use_array2_aux+49>:	mov    %rax,%rdi
<use_array2_aux+52>:	callq  0x400590 <init_array>
<use_array2_aux+57>:	movslq %ebp,%rdx
<use_array2_aux+60>:	mov    %r12d,%eax
<use_array2_aux+63>:	lea    (%rbx,%rdx,4),%rdx
<use_array2_aux+67>:	add    (%rdx),%eax
<use_array2_aux+69>:	mov    %eax,(%rdx)
<use_array2_aux+71>:	mov    (%rsp),%rbx
<use_array2_aux+75>:	mov    0x8(%rsp),%rbp
<use_array2_aux+80>:	mov    0x10(%rsp),%r12
<use_array2_aux+85>:	add    $0x18,%rsp
<use_array2_aux+89>:	retq
<use_array2_aux+90>:	mov    $0x400918,%edi
<use_array2_aux+95>:	callq  0x400458 <puts@plt>
<use_array2_aux+100>: 	xor    %edi,%edi
<use_array2_aux+102>: 	callq  0x400468 <exit@plt>

The noinline trick is used within the Lockless Memory Allocator to remove the cost of rare initialization from fast paths. The common calls to functions like malloc() and free() do not need to set up a stack frame, increasing the speed of the library.


raja said...

eye opener ...very well written and explained thanks a lot... :)

Enter the 10 characters above here

About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.