Optimization Tricks used by the Lockless Memory Allocator

With the releasing of the Lockless Memory Allocator under the GPL version 3.0 license, we can now discuss more of the optimization tricks used inside it. Many of these are things you wouldn't want to use in normal code. However, when speed is the ultimate goal, sometimes we need to break a few rules and use code that is a little sneaky.

The Slab

A slab is a well-known technique for allocating fixed size objects. For a given object size, a chunk of memory is divided up into smaller regions of that length. Since all the interior objects have the same size, fragmentation is eliminated. The only reason an object will not be able to be allocated will be when every single sub-region has been used.

Since a generic allocator doesn't know how many objects of a given size will eventually be allocated, it doesn't know how large to make the chunk of memory for a slab. This means that a trade-off is made, and those chunks are set to a given size, and there can be many of them for a given object length.

Thus the algorithm for slab allocation is quite simple. Firstly look for a chunk that contains the size of objects you want. Secondly grab a free object from that chunk. By using linked lists, both operations can be made to be O(1). This simplicity, and lack of algorithmic complexity means that slab allocations are very fast. The only disadvantage with slabs is that they can take up too much memory. Beyond a certain object size, other allocation techniques are better due to being more compact.

So if slabs are so simple, how come the Lockless Allocator manages to be faster than other allocators that use them for small allocations? There doesn't seem to be much room for improvement.

The first trick is to notice that many objects of slightly different sizes will use the same slab. Due to the fact that the ABI requires that all allocations be 16-byte aligned, it means that objects with sizes from say 33 to 48 bytes can all be placed in the 48-byte slab. So what is the fastest way of converting from the allocation size to a linked-list pointer to the correct chunk?

The Lockless Memory Allocator does it on 64bit machines in three instructions. The first is an addition of 15 to offset the alignment. The second is a logical-and to clear the lower four bits. Finally, a simple memory load of the result, offset from the start of the array of list pointers to chunks produces what we are after. The reason this works is that the size of a doubly-linked list is 2×8=16 bytes, the same as the alignment constraints. So by noticing that, we can avoid some left and right shift instructions by doing the pointer arithmetic "manually".

The second trick is to notice that allocating something of size zero is special. The obvious way to handle this is to have a conditional somewhere in the above function. However, since zero-sized allocations are rare, we don't want the code handling them in the fast-path. The Lockless Memory Allocator avoids this simply by having the chunk list for zero-sized allocations empty. So the check for an available chunk will fail (which could happen anyway), and we will fall back to the slow-path function slab_alloc_nolist() Finally, in that function, which is deliberately kept out-of-line, we can check for zero-sizes and fix them up. These tricks may seem not particularly important, but since slab allocation is so fast, any spare cycles you can gain are fairly signifcant.

The Btree

To store medium-sized free regions of memory, the Lockless Memory Allocator uses a btree data structure. Inside it, the free regions are sorted by size. This allows O(log(n)) lookup of the best-fit region for a given allocation request.

The standard way of constructing a btree is to use an array of n pointers and size-keys at each level in the tree, corresponding to a branch-factor of n. To insert a node in the tree then requires a recursive set of inserts into those arrays. Conversely, a deletion requires a recursive set of array deletions. However, inserting and deleting at arbitrary locations within an array is slow, taking a time proportional to the length of that array. Typically, the branch factor is relatively small, making this not too bad.

The Lockless Memory Allocator uses a different technique. It uses arrays... but also manages to embed a linked list within one of them. By using the low-order byte of the size information to store which btree branch follows a given branch, we can convert insertion and deletion into a O(1) algorithm at each level in the tree. Also, by overlapping the size and "next" information, we reduce memory fetches in the important lookup operation. (The extra information doesn't alter the sort-order, so we don't even need to mask it off.) The resulting search operation is extremely compact:

Another trick is noticing that since we are using linked lists within the btree size information arrays, we need some way to denote the begining and end of that list. We choose to use location zero in the array for that. By using an "intrinsic" list with a header at location zero, we can use the spare bytes of the size-key value there for other purposes. Since we use up one byte for the "next" pointer, we have three other bytes. In them we store the index of this node in the parent node in the btree, and the bit-mask of the unused branches in the size-array. It may not seem like much, but by keeping everything compact, we save on cache misses and increase speed.

The Fast Lists

Most medium sized allocations will hit the fast-lists instead of travelling into the btree code. This is a cache of recently freed allocations that we can quickly return when required. Since many applications allocate and free the same sizes over and over again, this is a big win. The trick here is trying to make this look-aside cache as fast as possible.

The first trick is deciding how to store the differing sizes of free chunks of memory in this cache. What we want is some set of lists that cover a logarithmically distributed set of sizes. The difficulty is coming up with a fast way of converting the free-size into the index for the cache. One obvious way to do it is a set of nested if statements doing some kind of binary search. However, we are a little more subtle.

In 32-bit mode we use the floating-point hardware to do the conversion for us. By converting the size as an integer into a float, and then reading the bits within that float back as an integer, we can trick the processor into performing some sort of psuedo-log operation for us. If we choose the right constants, we can get a nice distribution that neatly covers the range of sizes we are interested in.

We could do a similar trick in 64-bit mode. However, since those machines tend to have more memory, the distribution of allocated and freed memory needs to stretch to higher values. Also, the instructions required to move data into and out of the floating point units are relatively slow. Instead, we use the find-last-set bit-hack via the bsr instruction to do our magic:

There remains one particularly evil trick in the fast-lists code. We go to great difficulty here to avoid constructing a stack frame. However, the fast_alloc function even when inlined simply requires too many registers. If we can avoid constructing a stack frame, we can avoid many extra instructions in the fast path.

Since we know exactly what code we are calling, we know when and where floating-point instructions are used. It turns out that in the one key situation where an extra register is needed, no possible floating point code can be called. This allows us to cheat, and store information into a floating point SSE register. Normally, this would be slower than using the stack. However, in this one case, the reverse is the case. The result is that we use some inline-asm in a particularly hackish way:

Random Numbers

It might sound strange, but it is useful to use random numbers within a memory allocator. By slightly randomizing the layout of structures in memory, we can avoid cache-aliasing. This occurs when multiple objects are placed a multiple of a power of two apart in memory. If those objects are rapidly used in succession, the internal cache of the processor can mistakenly evict memory that will be soon used again. This happens because the processor uses the middle bits of the address of a piece of memory as a sort of hash into its cache hardware data structure. Since only a relatively small number of chunks of memory are allowed to share the same hash, if we try to access more than that within a small amount of time, performance plummets.

Knowing this, the Lockless Memory Allocator tries to "cache colour" the slab by not always aligning the objects nicely against the header. Since there is usually some slack space, we can shift some of this from the end of the large chunk to the begininning to alter the layout. Similarly, the per-thread memory used for storing the allocator state is also randomly aligned.

To do this, we need a nice and quick way of constructing a "random" number. The obvious way of doing it via the C library is a bit slow. We also don't need things to be particularly random, just enough that the bad cache behaviour becomes rare. So instead, we use the bits within a pointer to construct a hashed value. By sourcing those pointers from the operating system, we get a relatively good amount of randomness:

Synchronization

The Lockless Memory Allocator uses a wait-free technique to free memory to remote threads. Each thread has a queue of freed memory as a linked list. When it runs out of local memory, it scans its queue to see if any remote threads have returned some to it. There is one subtle problem with this technique. What do you do when a remote thread exits? Someone needs to reclaim its queue. (A thread may exit after allocating memory, but before a second thread frees it.)

The problem is that if a queue goes away, some other thread could be in the middle of trying to free memory to it. This is obviously bad, so some sort of synchronization is needed to prevent the garbage collection of dead threads from being a problem. One obvious way to fix the problem is to use a lock. If a thread is dying, prevent all frees to it until it is completely dead. However, the Lockless Memory Allocator tries to avoid synchronization like this if at all possible. Instead, we use a different technique.

The Lockless Memory Allocator uses a "hazard pointer" to show when a thread is currently trying to free to a queue. When a thread garbage collects anothers data, it updates all the references to the dead queue to point to its own. So a thread in the middle of freeing something needs to watch for the queue changing. This is what the hazard pointer is for. It acts like a temporary hashed-lock on the queue:

Once no threads refer to the dead queue, we can then garbage collect it. This is what the middle of reap_dead() does:

This means that we only need to lock thread creation and destruction around the addition and removal from the hazard pointer list. Since the operating system and C library also use locks around thread creation and destruction, the amount of extra synchronization overhead is negligible. The result is that the vast majority of the time malloc() and free() will use no locks at all.

Microsoft Windows Hacks

Getting the allocator working on Microsoft Windows presented unique challenges. The biggest problem is that many of the things that are taken for granted in the Unix world do not exist there, or exist in a fundamentally different way. For example... how do you determine when a thread exits? If you are in a DLL, then you can use the "reason" part of the arguments to DllMain(). If you aren't in a DLL... well you are kind of out luck with documented functions.

Fortunately, there exists an undocumented function that does what we want. By using __tlregdtor(), we can be notified on a thread exiting. This allows us to clean up the memory allocated by that thread, and then put its data on the "dead list" for later garbage collection. (The corresponding functionality on Linux is handled by the documented behaviour of the pthread_key_create() function.)

However, the biggest issue with the allocator on windows is that the linker doesn't really support overloading by symbol like it does in the Linux world. If we make a function called "malloc()", that doesn't mean that that function will actually be called by other code. In particular, the CRT library will use its own memory allocation routines, and there is no way to easily override them.

What we have to do is manually link the internal calls within the CRT to point to the allocation routines within the Lockless Memory Allocator. This is a well-known technique in the malware universe, and it is unfortunate that we have to stoop to using it as well. Basically, we patch in a jump instruction at the start of every hooked routine, and point it to our replacement:

The trick is finding all of the places we need to hook. The easy ones are the functions that are exported by the CRT. We can use the linker to tell us their address, and then patch each one in turn. However, there is one called _calloc_impl() which isn't exported. This function implements the core memory allocation action within the CRT. If we don't hook it, then memory allocated by the CRT on program start will then cause problems at program exit. In short, we will get a segfault when the program ends.

To fix this particular problem, we need to find out where _calloc_impl() is hiding in memory. One way to do this would be to do some sort of memory-search for a byte pattern within it. However, that is unreliable. We may accidentally end up patching the wrong thing. Instead, what the Lockless Memory Allocator does is notice that one of the things that calloc() does is call _calloc_impl(). If we trace the code within calloc(), we can find the call, and then get the address we require. This is why init_crt_funcs() does what it does.

Summary

The Lockless Memory Allocator uses all sorts of tricks to try to increase performance. (Or in the case of the Microsoft Windows port, simply to work at all.) Some of these tricks are things you probably wouldn't want to do in "clean" code. However, when speed is the ultimate goal, sometimes we need to bend the rules a little bit. The resulting code can be a little bit messy and non-standard, but the results speak for themselves.

Comments

david z said...

While this looks fairly cool for server work, it doesn't obviously address the problems we have on embedded systems:

We have threads which talk to devices and are quiescent for long periods, then very busy acquiring large amounts of data and transferring it into a set of worker threads which eventually free that data. In general the total amount of work across all devices is self-limiting, but the balance across the devices can shift radically.

The net effect we discovered is that each of the device threads ends up with a maximally sized arena, even when quiescent. This puts too much RAM pressure on the system, even though we do have plenty to actually handle the amount allocated at any one time.

Linux's ml-alloc system adds a little fun to this process by shuffling the arenas around. Apparently it locks the source arena when it frees, and if the source thread concurrently allocs it grabs a completely new arena rather than blocking. However multiple threads freeing to the same arena block each other, that's generally fairly "fast".

Of course I wrote something to get us out of this hole when we discovered it, but we'd love to have a 3rd party product that is properly maintained.

sfuerst said...

The problem you have seems to be that thread A allocates some memory, and then passes it to thread B. Sometime later, thread B then frees it. When it does, it grabs the arena corresponding to that memory. Later, thread A sends memory to thread C, and the same things happens... and so eventually all threads end up with maximally sized arenas.

The allocator described here doesn't work like that. It doesn't have arenas. When a thread frees "remote" memory, that memory is sent back to the thread that originally allocated it. This means that when thread A wants to allocate again, it will reuse the blocks freed by B, even if B is concurrently releasing them. The end result should be that this works very well with producer-consumer type code, with the total memory allocated being roughly equal to the amount "in flight" at that time.

Of course, there is indeed a pathological example where this allocator does perform horribly: If each thread allocates memory, sends it to a remote thread, and then never calls any allocation functions again; If those remote threads free that memory, it will never be properly released. Fortunately, this pattern doesn't happen very often. If a thread has allocated in the past, it tends to do so in the future as well.

Mina said...

I'm ipemrssed! You've managed the almost impossible.

Mica said...

Hi,

Is LLAlloc a wait-free mem allocator? or lock-free?
To my knowledge, there's no wait-free allocator out there.

Regards
Mica

Thomas Chan said...

compile error:
ll_alloc.c: In function fast_alloc:
ll_alloc.c:4197: warning: ISO C90 forbids mixed declarations and code
ll_asm.h: Assembler messages:
ll_asm.h:158: Error: bad register name `%sil'
ll_asm.h:158: Error: bad register name `%sil'
make: *** [ll_alloc.o] Error 1

said...

Enter your comments here