Unfairness and Locking

There are many different ways to construct a spinlock. Some of them perform better than others, and some are more flexible than others. For example, the MCS lock cannot use the same API as a typical spinlock. On the other hand, the ticket lock enforces "fairness", whereas the exchange-based lock deliberately avoids it.

The article on spinlocks benchmarks their performance. However, the results there are not the whole story. In particular, the ticket lock can be slower than the exchange-based lock, even though the results given are the reverse. So what has gone wrong?

The key piece of information ignored is that the length of time a lock is held is an important variable. As will be seen, different lock algorithms have differing behaviour depending this. Firstly, we will need to describe the locking benchmark in more detail. What we want is to have a number of threads try to take the lock a given number of times, and do a given amount of "work" whilst that lock is held. Something that does this is:

Where the above prepares and launches the number of threads required. Each thread then executes the locker() routine. That routine will depend on the type of lock to test. Also, by altering the NUM_THREADS and PAUSE_COUNT #defines we can change what we test. For exchange-based locks, we have:

Through the use of the pre-processor it is trivial to toggle which of the two functions are used.

In the spinlocks (and read-write spinlocks) article, the number of spins the inner loop takes was chosen to be 1024. This number was set to be large enough that it would realistically model the cache-misses typically taken within a lock. It is however, small enough that the overhead of the lock itself is still significant. Remember, a lock will wrap accesses to data that was most likely previously altered by a different thread. This means that cache misses are likely, and that the total hold time will be longer than you might otherwise expect.

However, it is possible that code might not wait as long as the 1024-count loop. If so, the timings could be different. Similarly, if a large amount of work is done within a spinlock, then the number chosen could be too small. What we need to see is how the benchmark results change when the PAUSE_COUNT is altered.

The ticketlock is catastrophically slow when the number of threads exceeds the number of cpus. Thus the 5-thread column in the above is missing, due to this being a quad-core machine.

The first thing one can do is compare the numbers above with those in the article from two years ago via the 1024 pause-count row. Fortunately, the numbers haven't changed too much, even though a different kernel (with a different HZ rate) is used, and this machine now has more (but slower) ram.

So what can we learn from the above? Most locks in a highly-optimized program should be rarely contended. Thus, the single-thread column is the one to look at. There, the ticket lock is always faster, no matter what the held time is. The question is why? Unfortunately, an answer is difficult. This may simply be an artifact of the micro-architecture of this machine (a dual-dual core opteron), and not be representative of cpu types. However, the results (at least on this machine) are definitive. Use the ticket lock if you have fewer threads than cpus, and your locks are uncontended.

So what happens as we increase contention? By adding more threads, the time taken universally increases. The reason is that transferring ownership of the lock between threads takes time. This may be handled internally by the cpu's cache-coherency protocol. Or, with an old cpu, may require a trip back to slow system ram. The more times we switch which thread has the lock, the longer the added time will be.

When will ownership of a lock change? This is the key issue of "unfairness". If a thread drops a lock, and can quickly re-take it before another thread can, then the lock will not have to be transferred between cpu cores. The total time taken can be quite a bit less. Theoretically, the benchmark results can approach those of the single-threaded case with maximal unfairness. Each thread will complete its set amount of work in sequence, and only transfer the lock to the next cpu when it is complete. Latency will be horrible, but throughput will be maximized.

In the real world, maximal unfairness is a bad thing. Latency does matter. If a thread can never acquire a lock, then starvation occurs, and forward progress fails. Depending on exactly how the kernel scheduler allocates time-slices, the exact amount of lock ownership changes will vary from run to run. This causes the time taken to vary from run to run. For low pause counts, and "high" thread numbers, this is apparent in the benchmark results of the exchange-based spinlock. The runtimes were highly variable, up to a factor of two or more.

The ticket lock, on the other hand, is a "fair" lock. Threads will wait their turn to run. This greatly decreases timing jitter, and greatly increases the number of times the lock ownership changes. However, a little bit of jitter still remains. With four threads running on the quad code machine, the scheduler will occasionally decide to run a background task instead of the thread that is next to acquire the lock. Since no other thread can proceed, they need to wait until the sleeping thread is re-woken. Since this occurs at random, the total time taken for the benchmark will be variable.

So what wins? Ticket locks are faster to acquire and release, but cause more ownership changes than exchange-base locks. It turns out that the answer depends on the length of time the lock is held. Short times mean that exchange locks are quicker. Medium to long times, and ticket locks are better.

What about real programs, rather than benchmarks? Here, things are more cloudy. We might not know how many cpu cores our application or library will run on. If so, we really cannot use the ticket lock, as its catastrophic slowdown is difficult to avoid. Also, real programs have different amounts of tuning. They most likely will have several classes of locks - from coarse to fine grain. Coarse-grained locks typically have long hold times, and high contention, whereas fine-grained locking has short hold times, and low contention.

We can now look at the benchmark results table to investigate which lock algorithm performs the best for coarse and fine grained situations. The coarse-grained situation corresponds to the bottom right, and fine-grained to the upper left. In each case, the ticket-lock looks to be the best. However, fine-grained locks with a bit of contention may be better implemented as the exchange-based locks, as the 2-thread column shows.

This isn't the end of the story though. If you have a coarse-grained lock with lots of contention, you are throwing lots of performance away. You are probably better off thinking of some other data structure or algorithm to improve the parallelizability of your code. This could yield an order of magnitude improvement rather than a few percent change for trying a different spinlock type. Finally, don't forget sleeping mutexes. If you have contention, having threads go to sleep can waste much less cpu than having them spin...

Comments

Aaron Altman said...

This is helpful, thanks. I'm working on a scheme to introduce thread safety into a profiling tool that has a bunch of unprotected writes to global state. These writes occur in wrappers for each function specifying an API, recording times before and after the execution of the underlying API call.

This article made me think that maybe exchange-based locks could be best for the profiling tool's average use case. If the locks are released during the underlying API call:

enter_critical_section();
do_pre_profiling();
exit_critical_section();
underlying_api_call();
enter_critical_section();
do_post_profiling();
exit_critical_sectioN();

it might actually make for more accurate profiling if the lock is usually reacquired by the releasing thread. Bonus if it's faster for overall throughput as well.