Optimization Tricks used by the Lockless Memory Allocator
With the releasing of the Lockless Memory Allocator under the GPL version 3.0 license, we can now discuss more of the optimization tricks used inside it. Many of these are things you wouldn't want to use in normal code. However, when speed is the ultimate goal, sometimes we need to break a few rules and use code that is a little sneaky.
A slab is a well-known technique for allocating fixed size objects. For a given object size, a chunk of memory is divided up into smaller regions of that length. Since all the interior objects have the same size, fragmentation is eliminated. The only reason an object will not be able to be allocated will be when every single sub-region has been used.
Since a generic allocator doesn't know how many objects of a given size will eventually be allocated, it doesn't know how large to make the chunk of memory for a slab. This means that a trade-off is made, and those chunks are set to a given size, and there can be many of them for a given object length.
Thus the algorithm for slab allocation is quite simple. Firstly look for a chunk that contains the size of objects you want. Secondly grab a free object from that chunk. By using linked lists, both operations can be made to be O(1). This simplicity, and lack of algorithmic complexity means that slab allocations are very fast. The only disadvantage with slabs is that they can take up too much memory. Beyond a certain object size, other allocation techniques are better due to being more compact.
So if slabs are so simple, how come the Lockless Allocator manages to be faster than other allocators that use them for small allocations? There doesn't seem to be much room for improvement.
The first trick is to notice that many objects of slightly different sizes will use the same slab. Due to the fact that the ABI requires that all allocations be 16-byte aligned, it means that objects with sizes from say 33 to 48 bytes can all be placed in the 48-byte slab. So what is the fastest way of converting from the allocation size to a linked-list pointer to the correct chunk?
The Lockless Memory Allocator does it on 64bit machines in three instructions. The first is an addition of 15 to offset the alignment. The second is a logical-and to clear the lower four bits. Finally, a simple memory load of the result, offset from the start of the array of list pointers to chunks produces what we are after. The reason this works is that the size of a doubly-linked list is 2×8=16 bytes, the same as the alignment constraints. So by noticing that, we can avoid some left and right shift instructions by doing the pointer arithmetic "manually".
The second trick is to notice that allocating something of size zero is special. The obvious way to handle this is to have a conditional somewhere in the above function. However, since zero-sized allocations are rare, we don't want the code handling them in the fast-path. The Lockless Memory Allocator avoids this simply by having the chunk list for zero-sized allocations empty. So the check for an available chunk will fail (which could happen anyway), and we will fall back to the slow-path function
To store medium-sized free regions of memory, the Lockless Memory Allocator uses a btree data structure. Inside it, the free regions are sorted by size. This allows O(log(n)) lookup of the best-fit region for a given allocation request.
The standard way of constructing a btree is to use an array of n pointers and size-keys at each level in the tree, corresponding to a branch-factor of n. To insert a node in the tree then requires a recursive set of inserts into those arrays. Conversely, a deletion requires a recursive set of array deletions. However, inserting and deleting at arbitrary locations within an array is slow, taking a time proportional to the length of that array. Typically, the branch factor is relatively small, making this not too bad.
The Lockless Memory Allocator uses a different technique. It uses arrays... but also manages to embed a linked list within one of them. By using the low-order byte of the size information to store which btree branch follows a given branch, we can convert insertion and deletion into a O(1) algorithm at each level in the tree. Also, by overlapping the size and "next" information, we reduce memory fetches in the important lookup operation. (The extra information doesn't alter the sort-order, so we don't even need to mask it off.) The resulting search operation is extremely compact:
Another trick is noticing that since we are using linked lists within the btree size information arrays, we need some way to denote the begining and end of that list. We choose to use location zero in the array for that. By using an "intrinsic" list with a header at location zero, we can use the spare bytes of the size-key value there for other purposes. Since we use up one byte for the "next" pointer, we have three other bytes. In them we store the index of this node in the parent node in the btree, and the bit-mask of the unused branches in the size-array. It may not seem like much, but by keeping everything compact, we save on cache misses and increase speed.
The Fast Lists
Most medium sized allocations will hit the fast-lists instead of travelling into the btree code. This is a cache of recently freed allocations that we can quickly return when required. Since many applications allocate and free the same sizes over and over again, this is a big win. The trick here is trying to make this look-aside cache as fast as possible.
The first trick is deciding how to store the differing sizes of free chunks of memory in this cache. What we want is some set of lists that cover a logarithmically distributed set of sizes. The difficulty is coming up with a fast way of converting the free-size into the index for the cache. One obvious way to do it is a set of nested if statements doing some kind of binary search. However, we are a little more subtle.
In 32-bit mode we use the floating-point hardware to do the conversion for us. By converting the size as an integer into a float, and then reading the bits within that float back as an integer, we can trick the processor into performing some sort of psuedo-log operation for us. If we choose the right constants, we can get a nice distribution that neatly covers the range of sizes we are interested in.
We could do a similar trick in 64-bit mode. However, since those machines tend to have more memory, the distribution of allocated and freed memory needs to stretch to higher values. Also, the instructions required to move data into and out of the floating point units are relatively slow. Instead, we use the find-last-set bit-hack via the
There remains one particularly evil trick in the fast-lists code. We go to great difficulty here to avoid constructing a stack frame. However, the
Since we know exactly what code we are calling, we know when and where floating-point instructions are used. It turns out that in the one key situation where an extra register is needed, no possible floating point code can be called. This allows us to cheat, and store information into a floating point SSE register. Normally, this would be slower than using the stack. However, in this one case, the reverse is the case. The result is that we use some inline-asm in a particularly hackish way:
It might sound strange, but it is useful to use random numbers within a memory allocator. By slightly randomizing the layout of structures in memory, we can avoid cache-aliasing. This occurs when multiple objects are placed a multiple of a power of two apart in memory. If those objects are rapidly used in succession, the internal cache of the processor can mistakenly evict memory that will be soon used again. This happens because the processor uses the middle bits of the address of a piece of memory as a sort of hash into its cache hardware data structure. Since only a relatively small number of chunks of memory are allowed to share the same hash, if we try to access more than that within a small amount of time, performance plummets.
Knowing this, the Lockless Memory Allocator tries to "cache colour" the slab by not always aligning the objects nicely against the header. Since there is usually some slack space, we can shift some of this from the end of the large chunk to the begininning to alter the layout. Similarly, the per-thread memory used for storing the allocator state is also randomly aligned.
To do this, we need a nice and quick way of constructing a "random" number. The obvious way of doing it via the C library is a bit slow. We also don't need things to be particularly random, just enough that the bad cache behaviour becomes rare. So instead, we use the bits within a pointer to construct a hashed value. By sourcing those pointers from the operating system, we get a relatively good amount of randomness:
The Lockless Memory Allocator uses a wait-free technique to free memory to remote threads. Each thread has a queue of freed memory as a linked list. When it runs out of local memory, it scans its queue to see if any remote threads have returned some to it. There is one subtle problem with this technique. What do you do when a remote thread exits? Someone needs to reclaim its queue. (A thread may exit after allocating memory, but before a second thread frees it.)
The problem is that if a queue goes away, some other thread could be in the middle of trying to free memory to it. This is obviously bad, so some sort of synchronization is needed to prevent the garbage collection of dead threads from being a problem. One obvious way to fix the problem is to use a lock. If a thread is dying, prevent all frees to it until it is completely dead. However, the Lockless Memory Allocator tries to avoid synchronization like this if at all possible. Instead, we use a different technique.
The Lockless Memory Allocator uses a "hazard pointer" to show when a thread is currently trying to free to a queue. When a thread garbage collects anothers data, it updates all the references to the dead queue to point to its own. So a thread in the middle of freeing something needs to watch for the queue changing. This is what the hazard pointer is for. It acts like a temporary hashed-lock on the queue:
Once no threads refer to the dead queue, we can then garbage collect it. This is what the middle of
This means that we only need to lock thread creation and destruction around the addition and removal from the hazard pointer list. Since the operating system and C library also use locks around thread creation and destruction, the amount of extra synchronization overhead is negligible. The result is that the vast majority of the time
Microsoft Windows Hacks
Getting the allocator working on Microsoft Windows presented unique challenges. The biggest problem is that many of the things that are taken for granted in the Unix world do not exist there, or exist in a fundamentally different way. For example... how do you determine when a thread exits? If you are in a DLL, then you can use the "reason" part of the arguments to
Fortunately, there exists an undocumented function that does what we want. By using
However, the biggest issue with the allocator on windows is that the linker doesn't really support overloading by symbol like it does in the Linux world. If we make a function called "malloc()", that doesn't mean that that function will actually be called by other code. In particular, the CRT library will use its own memory allocation routines, and there is no way to easily override them.
What we have to do is manually link the internal calls within the CRT to point to the allocation routines within the Lockless Memory Allocator. This is a well-known technique in the malware universe, and it is unfortunate that we have to stoop to using it as well. Basically, we patch in a jump instruction at the start of every hooked routine, and point it to our replacement:
The trick is finding all of the places we need to hook. The easy ones are the functions that are exported by the CRT. We can use the linker to tell us their address, and then patch each one in turn. However, there is one called
To fix this particular problem, we need to find out where
The Lockless Memory Allocator uses all sorts of tricks to try to increase performance. (Or in the case of the Microsoft Windows port, simply to work at all.) Some of these tricks are things you probably wouldn't want to do in "clean" code. However, when speed is the ultimate goal, sometimes we need to bend the rules a little bit. The resulting code can be a little bit messy and non-standard, but the results speak for themselves.
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.