Phasers are a generalization of the Barrier synchronization primitive. Barriers synchronize a given set of threads. Once all the threads have reached the barrier, it lets them continue. Until then, threads must wait until their tardy compatriots arrive. This is obviously inefficient. Threads could perhaps be doing other work whilst they are waiting. This is the key difference with phasers, where the act of "arriving" and "waiting" are separated.
The key operations for a phaser are "signaling" (or arriving), and "waiting". A thread signals a phaser to let it know that it has been reached. In return, a phaser gives that thread a handle (a sequence number). The thread can later use that sequence number to test to see if all other threads have reached the phaser. It can either poll the phaser, or sleep on it using that handle. Another possibility is to have an interface function that does both, in which case the phaser behaves exactly like a barrier would.
A phaser can thus be made by splitting a barrier in two. Instead of having one wait function, we will now have a signal and a wait function. We will use the fast pool barrier as the skeleton for this. Abstracting away the internals of the data structure gives:
Where the code to wake all the sleeping threads is:
There is some subtlety to the above. We need to correctly handle the case where more threads than planned reach the phaser before it ticks to the next sequence number. By marking the low bit (byte) of the sequence handle, we can notify the wait routine that we either succeeded passing through the phaser or not. Since this miss-usage is likely to be rare, we won't bother "fixing" the
In the above we have deliberately abstracted away the internals to the phaser data structure. The reason for this is that phasers have one other feature that is different to barriers. Threads are allowed to dynamically join and leave a phaser. This means that the total number of threads we are synchronizing can change at any time. This is problematic as we need to atomically handle the cases where threads are simultaneously joining, leaving and entering a phaser. If not done correctly, we could have missed wakeups.
Thus we will place the
What saves us is that the Intel platform allows non-aligned atomic instructions. These work as you might expect, but may be slower than their aligned counterparts. By ordering the three fields such that the count is in the middle and with the total and sequence number on either side, we can satisfy all the constraints. For speed, we will place
Note how in the above we haven't bothered to use unions to describe the overlapping atomic accesses used in the code. That just complicates the description, without any real gain. We can simply use casts to get the correctly sized operation.
The initialization and destruction look identical to their barrier counterparts:
In the code for the signal, and wait routines we abstracted out the refcounting code. Refcounting is needed, just like in the barriers case because threads leaving the phaser will read the phaser's state. To safely destroy a phaser, we need to wait until all such reads have occurred. Otherwise those threads may attempt to read freed memory, with potentially disastrous results. The refcounting code is simple:
Next we need the routines that atomically add to the count of threads waiting at the phaser, and spin waiting for that phaser to be ready. The two are very similar to their barriers counterparts. The only difference is that we cannot simply read the
With the above, we can also construct a "sigwait" routine, one that both signals and waits. It's operation is identical to a barrier:
Finally, we need the routines that add and remove threads from a phaser. Adding is simple, we just atomically increment the count. Removing is more complex. We could be the last thread to arrive. If so, we need to wake the other waiting threads. Thus, the code looks similar to the
The result is an efficient phaser implementation. By splitting up the signal and wait, we can speed up code that uses barriers. The extra overhead in doing so is minimal, and we gain a much increased overlap of synchronization and computation. By allowing threads to add and remove themselves from phasers, it is possible to efficiently construct deadlock-free synchronization graphs where threads only wait on those that they need to. You might want to try phasers in your code to see how much they can improve performance...
Company Info |
Product Index |
Category Index |
Copyright © Lockless Inc All Rights Reserved.