Obscure Synchronization PrimitivesBlocking synchronization primitives on Linux are built using the Futex system call. Futexes were originally designed to allow user-space fast-paths in the creation of mutexes. Thus the kernel doesn't need to be notified if there is no contention. Only if a thread needs to block, does a system call need to be executed. Later, additions to the Futex system call made it a more powerful beast that allowed the construction of other primitives such as condition variables and read-write locks. Every other synchronization primitive can be built from mutexes and condition variables, so those two are technically all that we require. However, in the search for performance, sometimes it is necessary to investigate how other primitives that might better-fit a given task may be created. Another possibility where they are important, is in porting software between operating systems. The Microsoft Windows threading system is very different from the Linux one. Condition variables are a new addition there, so most older codes use other techniques that have no direct correspondence to pthreads on Linux. Some synchronization primitives are easier to use than others. Fortunately for Linux programmers, mutexes and condition variables are rather easy to use correctly. The following primitives tend not to be so forgiving. However, if you have source code that already uses similar methods on other operating systems and you need to port it to Linux, then this article may be useful. The first thing we will need is some atomic intrinsics:
We also need to use the
Finally, in order to block on a kernel wait-list, we need to use the Futex system call, which unfortunately isn't exposed by
Single Producer, Multiple Consumer, Single Item QueueA useful primitive that allows communication between threads is some form of queue. The simplest possible queue only allows one item to be stored within itself. Constructing one with normal primitives isn't hard. You use a mutex to protect the internal state of the object, and a condition variable to allow the consumers to wait for the producer. The problem here is that by default, mutexes and condition variables are quite large and heavy-weight objects. We can do better using Futexes directly. Such an object has three different states. The first state is "empty", with nothing stored. We can represent this as a NULL pointer. The second state is "full", where a pointer to the item enqueued is stored within the object. Finally, we need a third state "empty but contended" to denote when we may have consumers waiting for a producer on a Futex wait-list within the kernel. The problem here is what to choose for the value of the third state. For 32 bit systems, the answer is obvious. We can use the bit pattern 0x00000001, which will never be a pointer to a real item. However, 64 bit systems aren't so simple. The difficulty is that Futexes only deal with 32 bit quantities. It is perfectly possible to have a real pointer to an item with the last 32 bits in that particular pattern. Fortunately, this doesn't happen very often. Only pointers to unaligned objects, or to raw bytes or characters can have this occur. If we restrict the use of the SPMCSI queue to only aligned pointers, it isn't that much of a restriction in practice. Using a
A function to produce to such an object needs to check if the state is empty. If it is, a simple compare and swap will enqueue the object in a wait-free manner. Otherwise, we need to wake up possible waiters as well.
Finally, the most complex function is that for the consumers. They need to try to remove an item if one exists in a wait-free manner for performance. They also need to transition to the contended state if no item exists. The trick here is making sure that the contended state is maintained even when we finally get the enqueued item within the sleeping slow-path. We do this because we don't know if there are any other consumers waiting within the kernel. We transition to a no-waiter state only if the producer tries to wake someone up and fails.
EventsAnother very simple synchronization primitive is an "event". An event can either be set or unset. If an event is unset, then a waiter will wait until it is set. If you set an event, all such waiters will be awoken. This primitive is used quite often in the Microsoft Windows world. It allows one thread to notify others waiting on a particular result. The corresponding pattern in the Linux programming world would involve a mutex and condition variable. However, we again can simplify the problem when a pure Futex is used. We define an event to be a 32 bit quantity manipulated by the Futex system call.
Reading whether or not the event is set or unset is easy:
Transitioning to the unset state is also quite simple. We just store a zero in the requisite variable.
Setting the event it more complex. We need to have a fast-path that checks to see if anyone is waiting on us or not. If not, we can bail out early. Otherwise we need to wake the waiters. One trick to notice is the use of the
Finally, we need to wait on the event. Doing this is simpler than the SPMCSI queue wait function because all waiters are woken rather than just one at a time. Thus we don't need to be quite so tricky with setting the contended state and checking when to bail out of the wait-loop.
Countdown EventsA "countdown" event is an event that is only triggered after a certain number of signals is sent to it. One possible use for such a thing is barrier-like semantics where one thread is waiting on the results from multiple others. We will use the bottom bit of the 32 bit Futex quantity to denote whether there are any waiters or not. The upper 31 bits can then be used as a counter.
By using an atomic add of -2, we can decrement the counter. Once we reach zero or one, then we are done, and have to wake up any waiters.
Finally, we need the wait function which will block until the counter reaches zero (actually one, since the lowest bit will be set if we sleep.)
Unfortunately this API is rather hard to use correctly. Note how there is a race between EventcountsOne rather useful synchronization primitive is an "eventcount". It is somewhat like the reverse of a seq-lock, where the waiter blocks if no signals have occurred between two points in time, rather than the converse. This primitive can be used to construct "almost lock free" data structures, that rarely call into the kernel. The trick is making sure the signal/broadcast functions are wait free in the common case where there are no waiters, and making the wait-check operation as cheap as possible. Like other primitives, this one fits inside a 32bit Futex integer. Two billion or so possible signals between checks should be enough. (A similar amount is used to prevent the ABA bug in lock free linked lists.) We will use the bottom bit to denote whether there are waiters or not, just like the countdown event object.
The broadcast and signal operations are quite similar. The first needs to wake all waiters, whereas the second just one. However, we can quickly check if no waiters exist, and avoid making the relatively slow syscall into the kernel. The signal operation is quite useful in preventing the "thundering herd" issue from occurring.
The first part of the waiter code needs to determine the "key" used to represent the instance in time. To do this, we use the integer value of the object. As an optimization, we set the lower bit to make comparisons faster. (We don't care about the state of that bit in the waiter code.)
The most complex part of the code is in the wait function. It needs to determine whether or not there have been any signals (or broadcasts) since the thread has obtained its key. We can do that by checking the integer against our key. Again, we don't care about the least significant bit, so we make sure that is set to match what we have done in the
SummaryThe above synchronization primitives, while obscure, can be quite useful in some situations. As can be seen, they all can be efficiently implemented via the Futex API on Linux. No complex compare-exchange based loops are needed, as all atomic operations required exist within the x86 instruction set. Thus the above code should be quite fast, performing much better than generic mutex & condition variable implementations of the same tasks. You may want to investigate whether any performance critical code you may have could also be sped up in the same manner. |
About Us | Returns Policy | Privacy Policy | Send us Feedback |
Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |
Comments
Jeff K said...You could fix the bug by reading and updating the two flags atomically (e.g. combining them into a single byte), but event_reset() and event_wait() would need to use atomic_bit_set() and atomic_bit_clear() or xchg_8().
The futex system call checks that the state matches the given integer before going to sleep. If the event bit has been set, then that comparison will fail (It simultaneously checks for contended = true, event = false), and the thread will spin around to check the event state again.
If there is a bug, it probably is due to things leaking across event_reset(). I've added another compile barrier there to fix that.
I'll have to admit that I haven't checked the functions in this article nearly as much as those in the mutex code I derived them from. It is entirely possible that a few bugs slipped in, but good to know that none have been spotted yet. :-)
Another issue is that events, like all other synchronization events on Windows, can be either anonymous or named. This means that an application can try to open an event created by another application based on shared knowledge of what that event name would be. What this all means is that despite the temptation to use futexes, in order to have the whoe range of functionality of Windows synchronization events on Linux we would have to use full-fledged IPC together with sockets and some form of poll() and select().