Sleeping Read-Write Locks

Read-write locks are useful when a large fraction of the users of a critical section do not change any of the state protected by the lock. This allows these "readers" or "shared owners" to potentially execute in parallel. On the other hand, since the protected state also needs to change, another type of locking is required for "writers" or "exclusive owners". (Note that obviously if the data never changes, then no locking is required at all.) The complexity of having two types of ownership makes read-write locks slower than traditional mutexes. In general, only when the number of readers outnumbers the writers, is this form of locking faster.

If the length of time one holds the lock is small, then it may be worth using a spinning read-write lock. However, due to the overhead of this synchronization primitive, it is usually only worth using when the lock hold time is relatively long. If this is the case, then the waiters are better off sleeping within the kernel until the lock is released, as compared to wasting cycles spinning. On Microsoft Windows, such a locking API is implemented by the "Slim Read-write locks" (SRWLOCK). On Linux, and other Unix-based operating systems, the pthreads pthread_rwlock_t is available.

The above code is obviously not exactly how Glibc implements them as the real code is written in assembly and has different identifiers. However, the above is algorithmically how it works. Note how there is only one type of unlock routine (as mandated by the Posix specification), which determines whether to read or write unlock based on the current owner. There are also two different types of lock based on whether the writer_preferred parameter is set or not. If it is set, then waiting writers will prevent new readers from taking the lock. If it is unset, then new readers can preempt writers so long as other readers still have the lock. This gives two different sets of performance characteristics.

We benchmark the Posix read-write locks by having four threads (on this quad-core machine) randomly read and write lock the lock. They then spin-wait for a given number of cycles, and then unlock the lock. By varying the fraction of time we write-lock the lock as compared to read-lock it, we can see how this affects the speed.

The default initialization of a lock using PTHREAD_RWLOCK_INITIALIZER gives a read-preferring version:

Using PTHREAD_RWLOCK_WRITER_NONRECURSIVE_INITIALIZER_NP gives the writer-preferring variant:

Notice how the performance characteristics of the two lock types is quite different. Writer-preferring locks are better when there are mostly readers because they prevent write starvation. However, as the writer fraction rises they perform worse than their competitor, with a worse-case situation at about 50-50 readers to writers. Finally, when the number of writers dominates, the two types of lock perform comparably. (However, in this situation, you'd probably want to use a simple mutex instead.)

Futex Bitset Locks

The above code "obviously" isn't optimal. It uses a low level mutex to synchronize the internal state of the read-writer lock. If we can avoid this, then perhaps we can gain extra parallelism. For example, it may be possible for many readers to simultaneously lock at the same time instead of being synchronized. The obvious way to do this is to use the bitset capability of Futexes. This allows one to record classes of threads waiting on the same futex. By using one bit for readers, and one for writers, we can then wake the correct group when desired.

This, (provided gcc is prevented from inlining the functions), gives the results:

The above lock is unfortunately quite a bit slower than the Posix locks when there are many readers. Only when the writer count is unreasonably large does it perform well.

The reader-preferring variant is a bit simpler because the read-lock routine doesn't have to wait for writers:

The read-preferring variant is also very slow compared to the Posix locks when there are many readers. However, it is the fastest so far for the case where the number of readers and writers is the same. There are two reasons for this slow-down. The first is that gcc doesn't understand the Futex syscall doesn't alter any registers. Thus it saves and restores them all. In order to do this, it constructs a stack frame for each function. Unfortunately, the overhead of doing this can be quite large compared the very little work done in the fast-paths. Thus in order to get optimal speed, using assembly is required (just like Glibc has done.)

The second problem is the bitset Futex API itself. The problem here is that the kernel needs to scan the entire waitlist. Since the waitlist may contain both readers and writers, instead of just one or the other, the length of this list is increased. This increases the length of time the kernel holds internal locks protecting the Futex waitlist, and thus reduces scalability. It probably doesn't matter much with only four threads hammering on the read-write lock API, but will once hundreds, thousands or millions of threads do. In order to short-circuit this problem, we can simply make readers and writers wait on different lists.

Futex Eventcount Locks

The simplest way of splitting the single bitset-based lock into two is to use eventcounts. These allow a thread to notice is some event has or has not occurred in the time since it registers its interest in that event. If the event hasn't occurred, the thread may then sleep until it does. Thus this allows threads to notice if other threads alter the lock state without having some sort of global mutex synchronizing everything.

However, for optimal speed, we need to use assembly language. The optimized writer-preferring version then is:

These results are slightly faster than the bitset locks for larger reader fractions, but still horrible compared to the pthreads code. It addition, they are much worse for large writer fractions.

Which are slightly worse for the reader-dominated cases, but much better in the writer-dominated ones. These still aren't nearly as good as the Glibc results.

Fair Read-Write Locks

The above locks are either reader-preferring or writer-preferring. However, there is a third type of read-write lock, one that prefers neither. Such a "Fair Lock" maintains an order such that neither readers nor writers can be starved. This can improve throughput since forward progress is guaranteed. The simplest way to create such a lock uses two separate mutexes. One mutex maintains the ordering, and the second controls the mutual exclusivity between readers and writers:

Notice how in the above, most of the complexity has been hidden within the mutex implementation. We will use a low-level lock the size of a Futex (integer) for this:

Using the above primitives, we can construct the asm versions of the mutex based fair read-write lock. The main trick here is that the asm routines do not mess with %rdi which points to the address of the mutex. This means that the calling routine does not have to save and restore this variable, saving the construction of a stack frame.

This lock is amazingly fast, beating or equalling the Posix locks in all situations. However, some of the other lock implementations above do beat it when the number of readers is the same as writers. Perhaps we can improve speed some more by noticing that the above algorithm doesn't use the fact that multiple readers may be awoken simultaneously when a readers gains control of a lock. The simplest way of arranging this using "standard" threading primitives is to use a condition variable. By using a broadcast to the condition variable, multiple threads can be awoken at once. C code that does this looks like:

This performs horribly for medium numbers of readers, but very good for large number of writers. However, again, such a situation would usually be solved through the use of a simple mutex rather than a read-write lock. The reason for the poor performance is that the condition variable at the lowest level doesn't perform in the way we would wish. Instead of waking all sleeping readers, it wakes one, and re-queues the rest on the mutex. We can fix this by replacing the condvar with an eventcount. Similarly, we can merge the waiter count with the mutex, and read flag with the reader count, shrinking the lock data structure size down quite a bit. Doing this yields the rather complex:

This is much improved compared to the condition variable based lock, being comparable to the writer-preferring Posix lock. However, it is quite slow compared to the mutex-based fair lock. Unfortunately, there is a race on reader-thread wakeup. A waking reader may miss the window where it can take the lock, in which case it goes back to sleep. This causes a loss of performance. One might hope that making the newly-wakened readers spin with a cmpxchg loop would help. However, that produces even worse benchmark results.

Summary

The best performing read-write locks seem to be the simplest. Using complex algorithms with many atomic instructions seems to result in worse performance. Even though the Posix read-write locks in Glibc are implemented via a mutex, they perform extremely well. However, it is still possible to do better. By using a mutex-based fair read-write lock, one can gain extra speed. That particular lock is about twice as fast as the default reader-preferring lock when there are many readers, and around twice as fast as writer-preferring lock when the number of readers and writers are similar. In other situations, it performs as well as the competition.

It may be possible to improve the fair lock by allowing multiple readers to awaken simultaneously. However, doing so is non-trivial. The reader which wakes the others may finish with the critical section before the wakeups are complete. This causes a race which is hard to plug without extra locking that slows things down more than what is gained. The other possibility, of allowing such newly awakened readers to fail to grab the lock causes the fairness guarantees to be weakened. This again results in the loss of performance due to starvation.

Comments

Samy Al Bahra said...

http://repnop.org/ck/doc/ck_bytelock.html has a good fast path for readers.

Borislav Trifonov said...

Would it be possible to increase the isolation between readers by having them use separate mutexes and counters, shifting more work to the writer which would have to check all? Joe Duffy got an improvement for doing this with spinning RW locks and 16*CPUcount slots: http://www.bluebytesoftware.com/blog/2009/02/21/AMoreScalableReaderwriterLockAndABitLessHarshConsiderationOfTheIdea.aspx (at time of this post the link is down but Google's cache at http://webcache.googleusercontent.com/search?q=cache:GP1RCH1S5gcJ:www.bluebytesoftware.com/blog/2009/02/21/AMoreScalableReaderwriterLockAndABitLessHarshConsiderationOfTheIdea.aspx+/AMoreScalableReaderwriterLockAndABitLessHarshConsiderationOfTheIdea&cd=1&hl=en&ct=clnk&gl=ca&client=firefox-a&source=www.google.ca works).

Bryan LaPorte said...

Actually, for the event counting code, there is a bug in the assembly for the write-preferring rwlock_wunlock which I believe is skewing your results by causing many spurious wake-ups.

The check for writer contention should be a "cmpb" and not "cmp". As it is, we'll make the wake system call for exclusive waiters every time we do a write unlock. I reconstructed the test scenario described here and found that the corrected code is very competitive with the native rwlocks.

Really the worst issue is with the interface itself. The acquisition and release semantics should be exposed to the consumer of the lock, which has never been done entirely (if I recall, the Windows kernel has a shared-acquire-starve-exclusive call which is a step in this direction).

I also flipped over and ran the scenario on Windows SRW locks. In all but the uncontentested case (predictably, but sadly), they basically get toasted.

I dug into the SRW disassembly and found that unsurprisingly they're style over substance, with what appear to be some pretty silly things going on. I'd be happy to type something up on they're design and send it on, if you think your readers would be interested.

Bryan LaPorte said...

"Their", not "they're" in that last sentence. Hate typos.

@Borislav: I read through the article you posted and I'm not sure what you're asking about. Do you mean can a lock be constructed which separates the contention between readers and writers better?

The lock that Joe Duffy is describing is a C# algorithm being compared to other theoretical C# algorithms. The locks described above will more than likely beat it hands down in any scenario (the performance difference between C# and hand-optimized assembly is typically vast).

Alex said...

In the glibc's version of rwlock_rdlock(), shouldn't the waiting condition be instead:

/* Wait until there are no writers */
-while (l->wwait || (l->wrlock && l->writer_preferred))
+while (l->wrlock || (l->wwait && l->writer_preferred))
{
l->rwait++;
...

Otherwise we cannot recursively rdlock as required by POSIX.

Borislav Trifonov said...

How would one go about implementing trylock() and timedlock() for the fair version of the R/W mutex, assuming the underlying mutex type supports these? It seems to me that for the timed version, an absolute timeout is needed where both the m and writer mutexes are part of the timed wait, because the m could be contended. Here's my attempt, but I'm not sure it's right:

bool rwlock_wtrylock(fair_rwlock *rw)
{
        if (!mutex_trylock(&rw->m)) return false;
        if (mutex_trylock(&rw->writer))
        {
                mutex_unlock(&rw->m);
                return true;
        }
        mutex_unlock(&rw->m);
        return false;
}

bool rwlock_wtrylockuntil(fair_rwlock *rw, unsigned int tEnd)
{
        if (!mutex_trylockuntil(&rw->m, tEnd)) return false;
        if (mutex_trylockuntil(&rw->writer, tEnd))
        {
                mutex_unlock(&rw->m);
                return true;
        }
        mutex_unlock(&rw->m);
        return false;
}

bool rwlock_rtrylock(fair_rwlock *rw)
{
        if (!mutex_trylock(&rw->m)) return false;
        if (atomic_xadd(&rw->read_count, 1) || mutex_trylock(&rw->writer))
        {
                mutex_unlock(&rw->m);
                return true;
        }
        mutex_unlock(&rw->m);
        return false;
}

bool rwlock_rtrylockuntil(fair_rwlock *rw, unsigned int tEnd)
{
        if (!mutex_trylockuntil(&rw->m, tEnd)) return false;
        if (atomic_xadd(&rw->read_count, 1) || mutex_trylockuntil(&rw->writer, tEnd))
        {
                mutex_unlock(&rw->m);
                return true;
        }
        mutex_unlock(&rw->m);
        return false;
}

Sleeping Read-Write Locks

Posix Locks

Futex Bitset Locks

Futex Eventcount Locks

Fair Read-Write Locks

Summary

Comments