Signal-Safe LocksSignals are a feature of Unix-like operating systems. They interrupt a program and transfer control to a signal handler. Since the interruptions can in general come at any time, signals are very hard to use correctly. As a good rule of thumb, a signal handler should do as little as possible, and return control to the rest of the program as soon as possible. Due to the difficulty of getting signal-using code to work correctly, most programs will either not use them, or be very careful about exactly when a certain signal is not blocked. (Through careful use of functions like Sometimes, however, the above pattern is not available. The program may need to be reliably interrupted while it is doing arbitrary work. The problem here is that the code may own locks to important data structures when the signal handler is called. A classical example are the internal locks in the memory allocator. A signal handler is not allowed to use functions like It would be very nice if we could use the "standard" method of mutual exclusion with signal handlers: locks. Unfortunately, you cannot use normal locking techniques. Lock implementations are usually not designed to be able to be interrupted, and then re-entered by the signal handler. (A recursive mutex design might still not be async-signal-safe.) Even if we solve that problem through careful design, another issue is what do we do when the program code owns a lock needed by the signal handler. If we allow recursive locking then that lock doesn't protect against the signal code "stomping" on the data structure. If we don't allow recursive locking, then the result will be a deadlock if the signal is triggered at an inopportune time. However, if we could just get locks to magically work with signals, then writing correct signal-safe code would be easy. Fortunately, this isn't too hard. All we need to do is disable signals while we own a lock. To do that, we can use the Firstly, we will construct some signal-safe primitives. It is convenient to be able to locally-atomically increment or decrement a counter. Since we wont need to synchronize with other cpu's here, we don't need bus-locked instructions. However, we do need to enforce the compiler to generate a single increment or add. (An unoptimized load, add, store sequence would not be signal safe.) Fortunately, with inline asm these aren't too difficult:
Where in the above we construct a compile-barrier, cpu-atomic (but not smp-atomic) increment, decrement and decrement-and-test primitives. If we disable signals, we need to record the locking depth, and the original disposition of the signals mask. We can store these in thread-local variables:
The code that handles the lock nesting, and turning on and off signal handling looks like:
The above code turns off all signals. In a real application, you probably only want to turn off the signal handler you are interested in. Finally, the code to use locks in a signal-safe way becomes trivial with the above helper functions:
The above locking routines can safely be used in normal code, and in signals handlers. Mutual exclusion is preserved. What is even nicer, is that sometimes you just want to protect against concurrent modifications by a signal handler. In that case, you can just use bare calls to So why aren't all locks signal-safe? The problem is that the above is slow. Really slow. In well-designed programs, the degree of lock-nesting is low. This means that the above will call into the kernel every time it takes or releases a lock in order to change the signal mask. The whole point of using futexes within the Linux pthread mutex implementation is to avoid system calls. The above breaks that optimization by re-introducing extra system calls in the fast paths. What we would like is the ability to again do fast user-space locking, but still have some degree of signal safety. Unfortunately, it is impossible to do this without modifying the signal handler. The problem is that a signal handler in general will run on the same stack frame as the original program. If both the signal handler, and original code are allowed to run "at the same time" via manual context-switching, they will stomp on each other. If they don't run pseudo-simultaneously, then deadlocks are inevitable. So the obvious thing to do is use the The trick is to implement the algorithm used in the original method in a different way. What we want to do is to avoid calling a signal handler whilst the main program holds a lock. We did this by telling the kernel to turn off signal delivery. However, there is another way. If we modify the signal handler, we can notice that a signal was delivered - and then process that fact at some later time. So at the start of our signal handler we check if locks are held. If so, we record that we need to call the signal handler at some later time. Later, once all locks are released, the unlock code can notice that signal handlers should be invoked. It can then call them manually. The trick is to remember that the number of signals invoked on an application (or thread) isn't the same thing as the number of times the signal handler is called. The kernel is free to merge a few signals and only call the signal handler once. Thus we are free to only record a fixed number of waiting signals. (Real-time signals are different... but again there is a limit to the number queueable.) Thus we will need to store an array of waiting signals, one for each signal number. (This is technically incorrect for real-time signals, but storing a bounded number for each signal number isn't much of a change.) We can record the
Since we don't need to block signals with this method, the
The
The above is very efficient. Disabling signals compiles down into very few instructions. Re-enabling them is also very efficient as long as none are queued. No system calls are needed. Using a simple benchmark where threads increment a counter protected by a signal-safe lock, the second method is 3-5 times faster than the first technique that uses system calls to block and unblock signals. Note that a similar trick may be used in kernel-mode. Instead of using an expensive clear-interrupt operation, it is possible to modify locking primitives so that interrupts are handled when data structures are in a consistent state. (This comes at a cost of latency though.) A simplified version of this technique is used within Lockless MPI to efficiently handle MPI sends to ranks that are computing away rather than calling MPI functions. We protect the message passing data-structures with the signal-safe mutual exclusion. This means that sends can complete in a timely manner even if the remote rank is not cooperating. |
About Us | Returns Policy | Privacy Policy | Send us Feedback |
Company Info |
Product Index |
Category Index |
Help |
Terms of Use
Copyright © Lockless Inc All Rights Reserved. |
Comments
Suhaila said...- Say, for example, thread 1 calls siglock(). It will do inc_sigLock and set your depth to 1, then mask the signals and lock the mutex.
- Next, say thread 2 calls siglock() while thread 1 is still locked. Thread 2 will call inc_sigLock, setting the depth to 2 and not masking any signals, then wait on the mutex.
- When thread 1 finishes holding the mutex, thread 2 will lock the mutex without being masked. If thread 2 were to receive a signal and enter a handler which uses the lock, it can attempt to reenter the mutex lock function. Boom.
Unfortunately, I cannot think of a way this code can be fixed to work, either. If you enter the mutex before masking signals, a signal handler can reenter the mutex lock. If you always mask the signals before locking the mutex, then two threads can store their previous mask and end up restoring the wrong mask on completion. If you store a depth counter as you do, threads have no way of knowing if the depth is their own lock depth or another thread's depth. If you store the thread ID instead of a depth, it will not be able to recursively enter the lock and will deadlock in a signal handler. I'm not convinced you could safely create a spin lock with a double compare and swap to set both a depth and owner thread atomically, either.
If I am misunderstanding this code, then my apologies, but I attempted to implement it and stress test it; it failed and broke.