How does it work?
The Lockless memory allocator uses several different algorithms to optimize speed and minimize memory fragmentation.
The smallest allocation sizes are handled by a slab allocator. The slab is constructed of 64KiB chunks that are traded between parallel threads as they are filled and emptied. To minimize allocation memory overhead, these blocks are adjacent to one another, with no separator fields between them. The total overhead for the slab is 128 bytes every 64KiB, plus a small amount of internal fragmentation averaging to half an allocation bin size. This is a maximum of about 1% overhead, dropping down to a fraction of a percent for the smaller allocation sizes.
There are many slabs, one for each block bin size up to 512 bytes. The bins increase in increments of 16 bytes. These slabs are allocated dynamically for each thread. Only the allocation size bins used require the overhead of a 64KiB chunk.
The slab allocations are aligned on 16 byte boundaries. This provides the optimal alignment for SSE code. As a special case, 64 byte slab allocations are aligned to 64 bytes, the cache-line size. This allows optimal message-passing implementations to be created, with fine grain control over exactly how many cache-line transfers are used in a synchronization algorithm. To reduce TLB aliasing, the offsets of the slab blocks are randomized within the slack-space created by n blocks not fitting exactly into a 64KiB chunk.
The slab allocator uses sbrk() to requisition address space from the operating system. This results in the slab allocations being continuous in memory. On Windows, sbrk() is emulated through calls to VirtualAlloc(). The continuous address space range allows efficient determination of whether or not a block sent to free() is part of the slab.
The slab uses per-cpu locking for yielding and acquiring empty slabs between threads. This improves scalability from earlier versions of the Lockless memory allocator. It now uses the faster slim reader-writer locks on Windows instead of critical sections. On Linux, it uses fast mutexes based directly on futexes, rather than the slower pthread mutexes.
Allocations above 512 bytes are forwarded to another allocator. This allocator uses mmap() (or VirtualAlloc() on Windows) to obtain chunks of up to 256MiB in size from the operating system. (The exact sizes depend on the conflicting demands of reducing address space usage, and decreasing internal fragmentation.) These chunks are then sub-divided using extents that delineate every allocation. Separating every allocation are 8-16 bytes of information that allow the allocator to find the next and previous blocks.
The overhead of an allocated block is 8 bytes, and a free one 16. This, combined with a natural alignment of 16 bytes for all allocations means that the block sizes in this data structure are 16n + 8 bytes in usable size. Since this method is only used for allocations above 512 bytes, the overhead is again less than a couple of percent.
The free blocks in all of these multi-megabyte chunks are stored within a btree. The btree allows best-fit lookup, at a cost of log(n) time. Since the btree may require to allocate for itself during insert operations, we require that all blocks referenced by the btree must be larger in size than two btree nodes. This means that blocks smaller than this size must be stored in some other data structure. To do this, we use a table of double-linked lists, one for each size down to 24 bytes. Finally, 8 byte blocks which may rarely be created through the use of memalign() and similar functions, are handled by using spare bits in the separator as flags.
The log(n)-time for lookup of btree blocks is fast, but to improve speed even more, there is a cache in front of the btree algorithms. This cache is implemented as a series of fast singly-linked look-aside lists separated by size bins. The look-aside lists implement a last in, first out stack for each bin. This increases memory locality as recently freed blocks will be given out to new allocations first.
As allocations are created and freed, the btree allocator will obtain and release megabyte sized chunks from the operating system. Note that a whole chunk needs to be unused for it to be released. This can cause memory to be retained if a program allocates some memory, and only frees part of it. The retained memory will be used in future allocations, when they occur. Under typical usage patterns, this effect is small due to the periodic coalescing of the allocation cache.
The largest allocation sizes (those above 256MiB, or those much larger than the total allocated so far) are obtained directly from the operating system. These are handled on a page-size granularity. Since the allocation size is so large, the overhead of this granularity is extremely small. Unlike the other allocation sizes, these large allocations are not recorded in per-thread data structures. This means that they may be free'd or realloc'd without synchronization overhead. It also means that the malloc_stats() function call will ignore these in its statistics calculations.
The smaller allocation data structures (the slab and the btree) are per-thread. This means that if one thread allocates something on a slab, and it is freed in a second thread, this other thread needs to communicate with the first. Conversely, if you free something allocated in the same thread, no synchronization overhead is required. The second case is much more common in most programs.
The synchronization method used in the Lockless memory allocator is a wait-free queue. There is one queue for each thread. A thread freeing a block from another thread will place the block on the correct queue and quickly return. The first thread will eventually empty its queue when it is under memory pressure. In common situations, one thread will be a "producer" allocating memory, and another a "consumer" freeing it. This case is handled extremely efficiently by the Lockless memory allocator through the queues. Note that other allocators may suffer from "memory blow up" as memory freed at the consumer may not be used by the producer.
The main situation where the Lockless memory allocator will use locks is when a thread is born or dies. Since the operating system and C library also have locking to protect their internal data structures in this case, the extra locking has minimal performance impact. The dead thread's memory can be used by newly born threads, and also by other threads currently under memory pressure. This use of locks means that the pthread_atfork() API is used on Linux. Thus fork() may not be called within a signal handler, if the signal handler could ever interrupt something that allocates memory. Note that the glibc memory allocator has the same limitation. Finally, note that even though the Lockless memory allocator uses wait-free algorithms for synchronization, it is not async-signal-safe. Per-thread data structures may be in an inconsistent state midway through a call to the allocator.
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.