Barriers

Barriers are a fundamental synchronization primitive that allow you to tell when a set of threads have completed a task. Once all threads reach a barrier, and not before, will they be allowed to pass through it. Due to this, they are a rather heavy-weight operation. However, many parallel algorithms can be simplified greatly by being broken into phases of computation synchronized by barriers. This ease of use means that they are a nice option to keep in your virtual toolkit of parallel idioms.

The pthreads library supports barriers via the pthread_barrier_t data type, and the pthread_barrier_init, pthread_barrier_destroy, and pthread_barrier_wait functions. To use a pthreads barrier, you first initialize it with the number of threads that are required to be synchronized. (Together with extra optional attributes like whether the barrier is required to be inter-process, rather than intra-process.) Then threads needing to be synchronized call pthread_barrier_wait(). Finally, pthread_barrier_destroy should be called when the barrier primitive is no longer needed.

Destruction of a barrier primitive is a little bit subtle. Most other synchronization primitives can have a destruction operation that is a no-op. They don't allocate memory, and the futex system call that they use for waiting doesn't need explicit deallocation. Barriers also don't have to allocate memory, and also use futexes (assuming a Linux-based implementation). However, they do have some work to do.

The problem is that a destroy call can occur whilst other threads are in the process of leaving the barrier. These threads need to check the state of the barrier due to the fact that the futex system call can have "false wakeups". If the memory used by the barrier is freed out from underneath them, then undefined behaviour will occur. For example, if the memory is unmapped by the munmap() system call, then a segfault is inevitable.

Thus the destroying thread needs to wait until all other threads have finished exiting the barrier. In order for that to happen, extra synchronization is required. This subtle issue was ignored in the winpthreads library. Fortunately, it is easy to fix by reusing the condition variable and mutex in the destroy function. That article has been updated with the corrected version.

The winpthreads library shows how we can implement our own version of a barrier from simpler mutex and condition variable primitives. Translating back from Microsoft Windows functions to pthreads gives:

We can benchmark the above by calling the barrier wait function many times repeatedly in a loop. By changing the number of threads to synchronize, we can investigate the scalability of the algorithm. Such a benchmark might look like:

We can then switch between various barrier implementations with a few macros. (The default in the above is to use the pthreads library barriers.) i.e.

We decide to use runs with two threads, four threads, and twenty threads on this quad-core machine. The two-thread case investigates a best-case situation. The four-thread case has a small amount of interference from other processes running on the system, so we should expect it to run a little slow. The twenty-thread case determines if there are any pathological issues with running large numbers of threads through the barrier.

The first thing to notice in the above is the large variability in the results. (We tested five times for each combination, and recorded the range of times seen.) There seems to be about a 10% intrinsic variation, which can make it difficult to see which of two algorithms is faster. For this particular pair, it seems that only at large numbers of threads is the pthreads library significantly faster than the open-coded version.

This shouldn't be too surprising as the pthreads library actually uses a similar algorithm. It just uses a faster low-level lock than a pthreads mutex. It also is implemented in assembly for speed.

The real question now is can we do better? The first trick is to realize that the current barrier implementation is suffering from quite a large inefficiency. No threads can enter a barrier, whilst others are leaving it. They need to wait until all other threads have left, and the count reaches zero before they can start. If we can simultaneously allow new threads to start waiting on the barrier, whilst old threads are woken up and begin to leave it, we should be able to improve the times.

The first thing to notice is that we will have to use something more low-level than mutexes and condition variables. We will also have to use the futex system call directly. This calls for the usual set of atomic primitives:

The Pool Barrier

Now we can use the above to construct a "pool barrier". It works by filling a pool of waiting threads. Once the pool is full, a sequence number is incremented. The waiting threads can see the sequence number has changed, and they then can exit the barrier. New threads will use the new sequence number. If too many threads try to enter the barrier at the same time, some will get a count number greater than the size of the pool. If so, they also need to wait until the sequence number changes in order to try again.

However, there is one important thing missing. We need to be able to safely destroy the barrier. Thus we need some way of knowing how many threads are currently accessing it. The simplest way of doing that is to use a reference count. A thread entering the barrier increments the count. A thread exiting decrements it. If the count starts at one, and the destroying thread decrements and waits until the count reaches zero, we have a nice solution. Such a barrier looks like:

The result is a barrier that is twice as fast as the pthreads barrier! The extra speed comes from the new overlap in entry and exit.

Ticket Barrier

Can we do even better? The above pool barrier uses two different count variables to control access. The first counts how many threads have tried to enter the pool. The second is the sequence number, that counts which pool we are currently using. Perhaps we can try to combine them in some way?

One way to try to simplify the code is to follow the example of the "ticket lock". That fast locking algorithm has threads take a ticket, and wait until it is that ticket's turn to continue. If we allow a bunch of tickets to continue at the same time, then we will have something equivalent to a barrier.

Unfortunately, we still need some way to know when all threads have left the barrier. This means we need something like the refcount in the pool barrier algorithm. Since each thread must take a ticket on entry, we can use that as the initial increment. If they also take a ticket on exit, we can see when those counters match. If they do, then all threads have left. Such an algorithm looks like:

The ticket lock has less locked instructions than the pool algorithm. (Two instead of four.) This means that it should be faster.

However, as the above table shows, that isn't the case! It seems the overhead of having to synchronize with other threads via the futex system call is much greater than the time spent in the atomic locked instructions. At least on this machine, there is very little difference in timings between the two algorithms. The intrinsic 10% variability explains all the variation.

The ticket barrier has one weakness though. It requires that no more than 2³¹ threads can cumulatively take the barrier before all waiting threads leave it. If more do, then the count will overflow, and a deadlock may occur. The pool barrier has a similar weakness... but there the limit is 2³²-total threads per sequence number, and with 2³² different sequence numbers before a problem occurs. This is a much larger number, and much less likely to occur. If only futexes could wait on 64bit numbers rather than 32bit numbers, and this problem wouldn't be so much of an issue. As it is though, for safety, the ticket barrier shouldn't really be used as its advantages are small.

Further Improvements

The above algorithms are twice as fast as the standard pthreads barriers. However, they may be able to be improved further. The trick is to notice that we always call the kernel via a futex no matter what happens. If few threads are synchronizing, then it may be profitable to spin a bit waiting instead. Such an extension to the pool barrier looks like:

In the above, the wait function spins 1000 times before sleeping. Why 1000? If we spin too little, then we get no advantage at all. If we spin too much, then the case with more threads than cpus becomes much much slower. The result for this number is:

And so we have a mixed blessing. When we have two threads, the result is ten times faster than before. When we have four or more threads, the result is slower than before. In the case of twenty threads, much slower. Unfortunately, picking a spin number smaller than 1000 rapidly makes the 10× improvement disappear. The slow-down doesn't go away so rapidly. There appears to be no good number that gets a good improvement without a disadvantage somewhere else. (The spin version of the ticket barrier gives similar numbers.)

There is a way out though. If we count how many cpus we have at initialization time, we can change the spin number if we have more or less threads synchronizing at the barrier. We can also support the inter-process barrier feature of pthreads with little overhead. The result is a uniformly fast barrier that is a drop-in replacement for the standard one:

It is twenty times as fast for small numbers of threads, and twice as fast for large numbers of threads, compared to the pthreads barrier. However, we have one more trick up our sleeves. The pthreads library implements its barrier in assembly language for speed. We can do the same. The resulting .S file contains:

Unfortunately, there is little difference in speed between the C and asm versions of the barrier. (At least within the 10% variation seen here.) However, using the above doesn't hurt, and may even help on other machines.

Summary

Barriers are a useful synchronization primitive. However, they are relatively slow and heavy-weight in the glibc pthreads library. By changing to a different algorithm, we can speed things up by a factor ranging from 2-20.

Comments

Borislav Trifonov said...

Have you looked at phasers? It's a recent generalization of barriers that looks quite interesting; see for example "Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism".

Bryan LaPorte said...

Definitely a better methodology, although I would put the spin loop/sleep in the straight-line path as opposed to after a branch, since it's statistically the most likely (branch prediction optimization here is a bit of a formality, I realize). It's worth pointing out that the spin loop optimization (in a loop as it is here) will be highly sensitive to the lengths of varying payloads. It might be worthwhile to use an in-out parameter to keep a per-thread spin count which can be tuned (obviously maintaining the non-spinning over-threading case): say, doubled after each thread sleep up to a maximum count.

sfuerst said...

There is now an article on Phasers. You can use the same tricks to speed those up as well. :-)

Basically, you want to spin-loop time to be roughly equal to a syscall/context switch time. Any longer, and you are wasting cpu. Any shorter, and timing jitter will get you. You could make it adaptive... but having a good default is probably better.

CT Chou said...

Suppose there are N threads in total and a pool barrier is initialized with count=N. Is it true that in pool_barrier_wait, it is impossible for any thread to reach the sys_futex call preceded by the comment: "We were too slow... wait for the barrier to be released"? So, in this case, is the while loop in pool_barrier_wait necessary?

Thanks in advance!

jessej said...

I got the code to work for 64 bit, however I could not get to work with 32 bit. Any suggestions?