Signal-Safe Locks

Signals are a feature of Unix-like operating systems. They interrupt a program and transfer control to a signal handler. Since the interruptions can in general come at any time, signals are very hard to use correctly. As a good rule of thumb, a signal handler should do as little as possible, and return control to the rest of the program as soon as possible.

Due to the difficulty of getting signal-using code to work correctly, most programs will either not use them, or be very careful about exactly when a certain signal is not blocked. (Through careful use of functions like sigsuspend(), pselect(), or epoll_pwait() etc. a program can delay signal handling to only occur at certain spots.) A typical pattern is to have a signal handler just write a byte into a pipe. The main polling event loop can then listen to the other end of that pipe and handle the signal at its leisure. This can be optimized a bit by using eventfd or signalfd in more recent Linux kernels, but the basic idea remains the same.

Sometimes, however, the above pattern is not available. The program may need to be reliably interrupted while it is doing arbitrary work. The problem here is that the code may own locks to important data structures when the signal handler is called. A classical example are the internal locks in the memory allocator. A signal handler is not allowed to use functions like malloc() or free() because the memory allocation data structures could be in an inconsistent state. Due to limitations on what it is safely possible to do within a signal handler, it can be very difficult to write them.

It would be very nice if we could use the "standard" method of mutual exclusion with signal handlers: locks. Unfortunately, you cannot use normal locking techniques. Lock implementations are usually not designed to be able to be interrupted, and then re-entered by the signal handler. (A recursive mutex design might still not be async-signal-safe.) Even if we solve that problem through careful design, another issue is what do we do when the program code owns a lock needed by the signal handler. If we allow recursive locking then that lock doesn't protect against the signal code "stomping" on the data structure. If we don't allow recursive locking, then the result will be a deadlock if the signal is triggered at an inopportune time.

However, if we could just get locks to magically work with signals, then writing correct signal-safe code would be easy. Fortunately, this isn't too hard. All we need to do is disable signals while we own a lock. To do that, we can use the pthread_sigmask function to disable them for our thread. An optimization is to notice that we only need to disable signals when the first lock in a set of nested locks is taken. The deeper, inner locks are already protected. Thus by keeping track of the lock-nesting depth, we can gain some efficiency.

Firstly, we will construct some signal-safe primitives. It is convenient to be able to locally-atomically increment or decrement a counter. Since we wont need to synchronize with other cpu's here, we don't need bus-locked instructions. However, we do need to enforce the compiler to generate a single increment or add. (An unoptimized load, add, store sequence would not be signal safe.) Fortunately, with inline asm these aren't too difficult:

Where in the above we construct a compile-barrier, cpu-atomic (but not smp-atomic) increment, decrement and decrement-and-test primitives.

If we disable signals, we need to record the locking depth, and the original disposition of the signals mask. We can store these in thread-local variables:

The code that handles the lock nesting, and turning on and off signal handling looks like:

The above code turns off all signals. In a real application, you probably only want to turn off the signal handler you are interested in.

Finally, the code to use locks in a signal-safe way becomes trivial with the above helper functions:

The above locking routines can safely be used in normal code, and in signals handlers. Mutual exclusion is preserved. What is even nicer, is that sometimes you just want to protect against concurrent modifications by a signal handler. In that case, you can just use bare calls to inc_siglock() and dec_siglock, no mutex is needed.

So why aren't all locks signal-safe? The problem is that the above is slow. Really slow. In well-designed programs, the degree of lock-nesting is low. This means that the above will call into the kernel every time it takes or releases a lock in order to change the signal mask. The whole point of using futexes within the Linux pthread mutex implementation is to avoid system calls. The above breaks that optimization by re-introducing extra system calls in the fast paths.

What we would like is the ability to again do fast user-space locking, but still have some degree of signal safety. Unfortunately, it is impossible to do this without modifying the signal handler. The problem is that a signal handler in general will run on the same stack frame as the original program. If both the signal handler, and original code are allowed to run "at the same time" via manual context-switching, they will stomp on each other. If they don't run pseudo-simultaneously, then deadlocks are inevitable.

So the obvious thing to do is use the sigaltstack() function to get the signal delivered to an alternate stack. The problem with that is that the new stack is used for all signals (if they decide to use it) as each signal doesn't get a separate alternate stack. Another option is to somehow at the start of the signal handler switch to a new stack... but that again is modifying the signal handler. If we are going to modify a signal handler anyway, lets keep the modification as simple as possible. There is actually no need for complex stack-managing code if we are smart.

The trick is to implement the algorithm used in the original method in a different way. What we want to do is to avoid calling a signal handler whilst the main program holds a lock. We did this by telling the kernel to turn off signal delivery. However, there is another way. If we modify the signal handler, we can notice that a signal was delivered - and then process that fact at some later time.

So at the start of our signal handler we check if locks are held. If so, we record that we need to call the signal handler at some later time. Later, once all locks are released, the unlock code can notice that signal handlers should be invoked. It can then call them manually. The trick is to remember that the number of signals invoked on an application (or thread) isn't the same thing as the number of times the signal handler is called. The kernel is free to merge a few signals and only call the signal handler once. Thus we are free to only record a fixed number of waiting signals. (Real-time signals are different... but again there is a limit to the number queueable.)

Thus we will need to store an array of waiting signals, one for each signal number. (This is technically incorrect for real-time signals, but storing a bounded number for each signal number isn't much of a change.) We can record the siginfo_t data passed to the signal, along with the handler. (Obviously, if in your code you only care about one signal, you won't need an array. If you always use the same signal handler, you won't need to record a function pointer to it. Also, if you don't care about the data in the siginfo_t parameter there is no point saving it, so some obvious optimizations are available.)

Since we don't need to block signals with this method, the inc_siglock() routine simplifies enormously. Conversely, the dec_siglock() routine needs to sometimes scan for extra work to do. We force the extra code out-of-line so that these two routines can be nicely inlined into their callers. They are only a few asm instructions long, after all.

The siglock(), sigunlock(), sigtrylock() routines don't change. However, we need to hook the signal handler in order to record the needed information in the sig_data array. Code which does this is:

The above is very efficient. Disabling signals compiles down into very few instructions. Re-enabling them is also very efficient as long as none are queued. No system calls are needed.

Using a simple benchmark where threads increment a counter protected by a signal-safe lock, the second method is 3-5 times faster than the first technique that uses system calls to block and unblock signals.

Note that a similar trick may be used in kernel-mode. Instead of using an expensive clear-interrupt operation, it is possible to modify locking primitives so that interrupts are handled when data structures are in a consistent state. (This comes at a cost of latency though.)

A simplified version of this technique is used within Lockless MPI to efficiently handle MPI sends to ranks that are computing away rather than calling MPI functions. We protect the message passing data-structures with the signal-safe mutual exclusion. This means that sends can complete in a timely manner even if the remote rank is not cooperating.

Comments

Suhaila said...

Gabriel May 31, 2011 at 7:18 am I can't believe how much yo saved my life and how slpime and stupid clearing this error is. I googled like crazy and many people have the same problems with no solutions whatsoever or offering and incredibly long and unworkable solution. After 3 days of googling I stumbled into your blog and these 2 lines of code got rid of the problem. Incredible.Thanks a lot for your help!

Daniel said...

HAHA! This is such an adorable post ;) Hmmm <a href="http://ujobwzet.com">imiange</a> if that was the way everything worked?? My brain would be sending signals to my tummy, thighs, and butt to get smaller!!!

Anon said...

The code discussed here seems to be fundamentally broken in multiple ways; do not try to implement it.

- Say, for example, thread 1 calls siglock(). It will do inc_sigLock and set your depth to 1, then mask the signals and lock the mutex.
- Next, say thread 2 calls siglock() while thread 1 is still locked. Thread 2 will call inc_sigLock, setting the depth to 2 and not masking any signals, then wait on the mutex.
- When thread 1 finishes holding the mutex, thread 2 will lock the mutex without being masked. If thread 2 were to receive a signal and enter a handler which uses the lock, it can attempt to reenter the mutex lock function. Boom.

Unfortunately, I cannot think of a way this code can be fixed to work, either. If you enter the mutex before masking signals, a signal handler can reenter the mutex lock. If you always mask the signals before locking the mutex, then two threads can store their previous mask and end up restoring the wrong mask on completion. If you store a depth counter as you do, threads have no way of knowing if the depth is their own lock depth or another thread's depth. If you store the thread ID instead of a depth, it will not be able to recursively enter the lock and will deadlock in a signal handler. I'm not convinced you could safely create a spin lock with a double compare and swap to set both a depth and owner thread atomically, either.

If I am misunderstanding this code, then my apologies, but I attempted to implement it and stress test it; it failed and broke.