Futex Cheat Sheet

Futexes are an important part of the multithreading API in the Linux kernel. They allow userspace code to efficiently implement synchronization primitives and thus simplify the job of exploiting parallelism. The name "Futexes" come from a contraction of "Fast Userspace Mutexes", which describes the original goal of the API - simply to implement the mutex primitive. However, over time it has expanded into a generic interface allowing the manipulation of wait-queues of sleeping threads and processes.

The Futex API is unfortunately poorly documented. Many of the more recent extensions are not discussed in the man pages, and source-diving is required to understand what is available and how it should be used. Fortunately, the hard work of using it to construct a more portable interface in the form of the glibc pthreads library has already been done. Most programmers should use the pthreads interface. However, those aiming for ultimate performance need to explore "under the covers" and construct their own threading primitives.

Nearly all of the Futex API is exposed via the sys_futex(). This uses the second parameter to multiplex into several related operations. Currently, there is a total of thirteen of these, together with a couple of extra option flags which increase the possibilities further. The system call takes six arguments, but not all of these are required for every operation. Since Futexes are a low-level interface, the programmer is expected to be able to use assembly language and call the API avoiding the overhead of setting the unneeded values.

Due to this, no explicit function is exposed by the futex.h header file. However, it is possible to construct our own, ignoring the overhead of the extra unneeded parameters:

In general the first parameter addr1 allows the kernel to determine which wait-queue to use. (It is used as a key in a hashtable of queues.) As described above, the second parameter is the multiplexer. The third parameter is an argument to the multiplexing function. The fourth is usually a timeout, but in some cases may be a second integer argument. (This is an example where the slow accretion of features to the Futex API is noticable.) The fifth parameter, addr2 is used to refer to a second wait-queue when required, and finally the sixth parameter is yet another argument to the multiplexing function, but is often ignored.

FUTEX_WAIT

The first multiplexed function option tells the current thread to go to sleep on a wait-queue. The trick here is that you only want to sleep if you know you are going to be woken up. (Not being woken up is a source of dead-lock.) Thus the sleeping thread needs to atomically check that it should do so. In order to do this, the val1 parameter is used. Only if the 32 bit integer pointed to by addr1 has the value in val1 will the thread be put to sleep on the addr1 wait-queue. Thus some other thread can alter that value to prevent the first thread from going to sleep and avoiding the lost-wakeup bug that may occur in some other APIs.

The timeout parameter points to the a structure that describes how long the thread should sleep. Usually, this is NULL, signifying that the thread should sleep forever. The other parameters are not used, and an efficient assembly-based call of this Futex function can use them to store other information, since the kernel ABI preserves the corresponding registers.

Note how the above sets the mutex to a known value, and then checks for that value within the FUTEX_WAIT call.

FUTEX_WAKE

The reverse process to the above is waking a thread (or multiple threads) currently sleeping on a wait-queue. This is the use of the FUTEX_WAKE multiplex function. This function is extremely simple, just taking the address corresponding to the wait-queue in addr1 and the number to wake in val1. If you wish to wake all the sleeping threads, simply pass INT_MAX to this function.

Together with the previous multiplexing function an efficient mutex implementation can be created. Such a mutex only enters kernelspace when there is contention. So in the common case, where the lock is uncontended only userspace atomic instructions are needed. This increases performance due to the fact that relatively slow system calls are not required.

However, there are some subtle issues with constructing a mutex if there are multiple threads executing concurrently. Note how the above code fragment unconditionally sets the mutex into the used (1-bit set), and contended (256-bit set). This is because there may be more threads sleeping in the wait-queue. This causes a small inefficiency where the wake-up code may call into the kernel more than strictly necessary, but this is much better than having missed wake-ups and dead-lock.

FUTEX_FD

This multiplex function was designed to construct a file descriptor to correspond to a wait queue. A thread could then use the normal select/poll/epoll file based interfaces to wait for a FUTEX_WAKE event. Unfortunately this suffered a fatal flaw. Note how the val1 argument which was necessary in FUTEX_WAIT is not available. This means that missed wakeups are unavoidable. For this reason, this part of the Futex API is depreciated, and in fact removed on recent kernels. Something which can partially replace it, is the eventfd() API, which doesn't suffer the missed-wakeups problem.

FUTEX_REQUEUE

The Futex API allows more control over the wait-queues than just adding and removing sleeping threads. Sleeping threads can also be moved between two different wait queues. This provides an important optimization for pthreads condition variables.

The typical usage of a condition variable requires using a mutex to lock the access to the condition being tested. If a thread is waiting for the condition to become true, then it sleeps on condition variable's wait queue. However, a "broadcast" event may wake up all such sleepers. They will then need to acquire the mutex in order to test the condition. Since being woken up simply to fall back asleep on the lock's wait-queue is inefficient, it is much better to transfer the majority of the waiters onto the lock's wait-queue. This multiplex function provides this operation.

It is similar to the FUTEX_WAKE operation, with addr1 referring to the wait-queue to transfer sleepers from. The val1 argument describes how many of the waiters to wake rather than transfer. The timeout argument is overloaded as the number of waiters to transfer. If this is zero, then the operation is identical to FUTEX_WAKE. Typically this number will be 1, to just move one, or INT_MAX to transfer all waiters. Finally, the addr2 argument describes the destination Futex, where the waiters will end up on.

See the condition variable implementation article for an example usage of this multiplex function.

FUTEX_CMP_REQUEUE

The mutex and condition variable implementations within glibc are quite different than those described here. They are much more complex because they support many more features, such as testing for deadlock, and recursive locking. Due to this, they have an internal lock protecting the extra state. This extra lock means that they cannot use the FUTEX_REQUEUE multiplex function due to a possible race.

Note that the condition variable functions are not guaranteed to be called with the mutex in the locked state, so the mutex lock cannot be used to synchronize condition variable internal state. This can greatly increase the complexity of the signal and broadcast functions. The way glibc solves this problem is to use multiple sequence numbers for the condition variable operations. By using the comparison part of the FUTEX_CMP_REQUEUE operation, it can test to see if some other operation raced with it. If so, the operation can be retried.

The arguments to this multiplex function are identical to those to FUTEX_REQUEUE, with the addition of val3 which contains the expected value of the integer at the address of the Futex (addr1). Like FUTEX_WAIT, if the integer does not match, this function will fail, returning EAGAIN.

FUTEX_WAKE_OP

This multiplexed function extends FUTEX_WAKE. It was created to fill yet another bug in the glibc condition variable implementation. However, that algorithm has since changed, and this function is no longer used by glibc. You probably shouldn't use this function, and instead change your threading primitive algorithm to not require it just as glibc has.

This function is rather complex, updating the integer at the memory location pointed by addr2, together with conditionally waking up waiters. The futex.h header file describes this operation as:

The rest of the arguments are similar to FUTEX_WAKE. addr1 refers to the Futex wait-queue where sleepers are woken. val1 describes how many sleepers to unconditionally wake. Again, the timeout argument is overloaded, this time describing the maximum number of sleepers to be conditionally woken when the operation succeeds. The addr2 argument is used in the atomic wakeup decision described above, and finally val3 encodes this operation.

FUTEX_WAIT_BITSET and FUTEX_WAKE_BITSET

A recent addition to the Futex API that hasn't been documented by the man pages are the bitset multiplex functions. These add a new argument to the FUTEX_WAIT and FUTEX_WAKE operations, stored in val3. The new detail acts as a bitmask.

The normal FUTEX_WAIT and FUTEX_WAKE operations act like this mask has all bits set. The FUTEX_WAKE_BITSET operation will only wake threads that have a bit set after the expression mask_wait & mask_wake is evaluated. Thus the FUTEX_WAIT_BITSET operation allows waiters that can ignore some wakeup requests, and FUTEX_WAKE_BITSET operation allows one to choose which subset of waiters to wake.

The original reason for the addition of these extensions was to improve the performance of pthread read-write locks in glibc. However, the pthreads library no longer uses the same locking algorithm, and these extensions are not used without the bitset parameter being all ones.

The bitset functions effectively allow up to 32 different wait queues to share a single Futex location in addr1. However, using the functionality like this is inefficient. It is much better to simply have a larger lock object, and several Futex locations within it. That way only the waiters that you are interested in need to be scanned for wakeups.

Another difference between FUTEX_WAIT_BITSET and FUTEX_WAIT is that the latter uses a relative timeout, and the former an absolute timeout. This makes FUTEX_WAIT_BITSET much more useful than it would otherwise be. Note how the kernel expects a relative timeout like all the other sleeping syscalls (select, poll, epoll) for FUTEX_WAIT. However, the pthreads library requires the timeouts to be absolute.

This means that a thread that would like to only wait a few seconds must call the kernel three times. The first time to determine what time it is to construct the absolute timeout, the second time to convert back to relative time within the pthreads library, and then thirdly to finally make Futex syscall. Using "vsyscalls" speeds up the first and second operations, but it would be nice to eliminate one of these redundant time checks altogether.

Thus implementing functions like pthread_mutex_timedlock() is easier using FUTEX_WAIT_BITSET to directly use the absolute timeout passed from the user. Even though the bitsets are not used any more within glibc, FUTEX_WAIT_BITSET is, because of how it is indirectly useful with respect to timeouts.

Priority Inheritance

The new priority inheritance multiplex functions are undocumented in the Futex man pages. However, they are described in pi-futex.txt and futex-requeue-pi.txt within the Linux Kernel documentation directory. These functions were created due to the problem of priority inversion with normal locks.

Imagine there is a high-priority thread. This thread wants to obtain a lock. However, that lock could be already owned by low-priority thread. If that low-priority thread is not run due to other medium-priority threads existing, then the high-priority thread is stuck.

The kernel is required to fix this because it is the final arbitrator of which threads execute. However, in order to do so, it needs to know who owns a lock in order to temporarily boost the priority of the low-priority thread. Thus these operations are tightly coupled to glibc with an inflexible ABI. For every lock operation required by glibc, a corresponding priority-inheritance version is required as a multiplex function.

The ABI expects that the integer pointed to by the addr1 parameter is either zero, which specifies that the priority inheritance lock is unlocked, or equal to the tid (Thread ID) of a thread which owns the lock, with perhaps a couple of upper bits set signifying the presence of waiters, or "robustness".

Thus the user-space locking operation should atomically try to alter the mutex from zero to the current tid via a cmpxchg operation. If this succeeds, then the thread has the lock, and no other work is required. The fact that the kernel doesn't need to be entered increases efficiency. However, if the lock is already held, the task will need to call the FUTEX_LOCK_PI multiplex function.

This function will return once the lock pointed to by addr1 is locked by the current thread. If a timeout is required, then the timeout argument can be used signifying an absolute timeout, otherwise a NULL value signifies that the thread should wait forever.

Once the thread returns from the kernel, it may have the top bit (FUTEX_WAITERS) set signifying that the lock is contended and other threads remain on the wait-queue. To unlock the priority inheritance lock, this bit can be atomically checked for via another cmpxchg. If it is unset, then all that needs to be done is to write zero, the unlocked value, into the lock. However, if there are waiters, then the FUTEX_UNLOCK_PI multiplex function should be called. It takes no other parameters other than addr1 specifying the lock to unlock.

In addition to locking and unlocking the priority inheritance mutex, pthreads also needs to implement the trylock function. This can be implemented like the lock function, mostly in userspace, and only calling into the kernel if there is contention. The FUTEX_TRYLOCK_PI multiplex function implements this operation.

Finally, there are two more priority inheritance operations used to implement condition variables. The first of these is FUTEX_CMP_REQUEUE_PI which is identical to FUTEX_CMP_REQUEUE other than requiring priority inheritance Futexes as the destination wait-queue. Note that priority-inheritance Futex to priority-inheritance Futex requeues are currently unsupported, so addr1 needs to point to a non-priority-inheritance Futex.

The last Futex multiplex function is FUTEX_WAIT_REQUEUE_PI. This is the inverse operation of FUTEX_CMP_REQUEUE. It puts the current thread to sleep on a Futex wait-queue, and then waits to be requeued onto a priority-inheritance Futex. addr1 is the non-priority-inheritance wait-queue to wait on, and addr2 is the priority-inheritance wait-queue we will be moved to when FUTEX_CMP_REQUEUE is called. Note that we must match the addr2 passed to that multiplex function. The only other argument to this operation is an absolute timeout. The bitset defaults to all bits set.

Robust Locks

Not quite part of the Futex API, but extremely related are the robust locks syscalls. These allow a program to recover if a thread that holds a lock exits. Normally, this would cause a deadlock as no other thread could ever lock that lock. However, if the kernel is informed about all locks that every thread obtains via these syscalls and the "robust locks" list, then the kernel can recover the situation for us.

What it will do is scan the robust locks list in the exiting thread, and then set the FUTEX_OWNER_DIED bit in each lock, and wake a waiter per lock if any exist. Then, when a thread tries to lock the "dead" lock userspace can try to recover.

This is related to the Futex API because the kernel also needs to understand this ABI for the priority-inheritance locks. Thus FUTEX_LOCK_PI and FUTEX_TRYLOCK_PI will do the right thing if they are called on a priority-inheritance lock that has had its owner die.

The futex.h header file and the Linux Kernel documentation describes the form of the robust list in detail. In order to inform the kernel of its location the sys_set_robust_list() system call is used. Conversely, to discover its location, the sys_get_robust_list() may be used.

Extra flags

Futexes are extremely useful, more than you might expect at first glance. Their most powerful feature is that they allow synchronization between different processes, not just threads within the same process. All you need to do is map the same piece of shared memory, and the Futex API will work provided each process uses the corresponding location in addr1.

This power comes at a cost. Since processes may map the shared memory into separate locations in their address space, the value of addr1 needs to be supplemented by extra information internally for the kernel to find the correct wait-queue in its hash table. In order to do this, it needs to inspect virtual memory areas (vma's) under the mmap_sem lock. Unfortunately, this provides a bottleneck in some cases. The contention on mmap_sem can greatly reduce performance.

Since the most common case of using a Futex is to refer to a process-internal lock, there is an optimization available here. If the FUTEX_PRIVATE_FLAG is set on the multiplex function number passed to the syscall, then the kernel will assume that the Futex is private to the process. It can then use addr1 as the raw key into its hash table, saving the locking of mmap_sem. Thus every normal Futex operation has a "private" variant selected by the use of that flag.

The final flag which alters the behavior of the Futex multiplexing functions is FUTEX_CLOCK_REALTIME. Unfortunately, this doesn't affect all timeout-using functions. In fact, only FUTEX_WAIT_BITSET and FUTEX_WAIT_REQUEUE_PI will work. (Otherwise ENOSYS is returned.). This changes the clock source for the timeout from CLOCK_MONOTONIC to CLOCK_REALTIME. The difference between the two is that the user can change the current time (or more realistically, the ntpd daemon can). CLOCK_MONOTONIC is unaffected by these clock changes, whereas CLOCK_REALTIME is.

Possible Future Extensions

As can be seen, the Futex API has evolved considerably from its roots as a mutex implementation. However, it still has other possible changes. The most obvious of these is to remove some of the limitations of the current implementation. The biggest of these is that Futexes are limited to point to addresses that are naturally aligned. The reason for that is to avoid the case where a Futex integer overlaps two adjacent pages in memory. This would add a few extra error cases where one but not the other of the two pages is invalid. Ignoring this technicality, this change would be useful to make more efficient read-write lock implementations.

The current implementation of read-write locks in glibc uses a mutex to synchronize access to the lock, combined with two more wait-queues, one each for readers and writers. It would be nice to be able to remove the outer lock, and just have the two wait-queues. This requires some other form of synchronization between the two queues. The easiest way to do that is to make the integers the two point to partially overlap.

Another possibility is to increase the size of the Futex integer from 32 bits to 64 bits in size. This will become necessary eventually as computers capable of running billions rather than millions of threads become available. This is a few years off yet though.

One missing operation is FUTEX_REQUEUE_PI. This wasn't implemented because the glibc condition variable algorithm has a race condition if FUTEX_REQUEUE is used rather than FUTEX_CMP_REQUEUE. However, it is indeed possible to construct a working condition variable algorithm that uses the simpler (and faster) FUTEX_REQUEUE. Allowing this to work with priority-inheritance based locks would be useful.

Finally, it may be worth investigating more efficient ways to implement priority-inherited read-write locks. Potential implementations may require hypothetical extensions like FUTEX_WRLOCK_PI and FUTEX_RDLOCK_PI. The difficulty here is that multiple readers can simultaneously own the lock. Having the kernel know enough information to priority-boost them all when required is difficult to efficiently implement, but still may be possible...

Comments

Borislav Trifonov said...

Windows keyed events seem to be interruptible with a user asynchronous procedure call, which another thread can implement by calling QueueUserAPC, and the keyed event wait then would return STATUS_USER_APC (0xC0, same value as the documented WAIT_IO_COMPLETION which is used for the alertable WaitForSingleObjectEx, SleepEx, etc.). This makes implementing thread cancellation easy on Windows, and if the return code check throws an exception, this would make for nice cleanup as well before the thread ends. How do we accomplish this on Linux with futexes if we use say the fast mutex you designed in your other article? It seems there would need to be a table keeping track of the futex objects for every thread so that they can be woken up, but that seems inefficient. Any ideas?

HdcDst said...

It's a pity, that possibility to wait on a futex by a file descriptor was dropped.
This could have been the first step to introduce a uniform way to wait on different objects at the same time like it is possible on Windows with WaitForMultipleObjects.
Sure, there is the problem with missing a wake in this way, which makes this possibility at first unusable for mutexes.
But this could have been solved by adding a set semantic to the futex.
So that it is possible to say set the futex untill n threads are woken up, instead of wake at least n threads which are currently waiting on the futex.

woodrowdouglass said...

HdcDst - look at the man page for eventfd. EFD_SEMAPHORE may be what you're looking for.

Angel Genchev said...

Yes, Borislav Trifonov and HdcDst need EventFD, but they don`t know it exist. We have a nice poll(), epoll() calls to wait on file and eventFD descriptors.

Lolek said...

LOL

Michael K said...

The alignment requirement is not just to avoid corner cases like futex variables spanning two pages, but is required by many processor architectures. Futex variables may be modified by user space while another thread is running a futex system call, so the read accesses to the futex variable to compare against the expected value needs to be atomic. Most non-x86 architectures do not support atomic memory access to unaligned variables, even if it is one of the architecures that support unaligned access at all.