Keyed Events
The primitives used for synchronization between threads in a Linux process are based on the Futex multiplexing syscall. This syscall allows wait-queues of sleeping threads to be manipulated from userspace. The interface efficiently allows userspace locking to avoid transitions to kernel space in the fast-path where locks based on Futexes are uncontended. Microsoft Windows has a different design. The exposed interface is a rather complex API involving pre-made library functions that implement critical sections, read-write locks and condition variables. However, underneath the surface all of these functions are implemented in terms of a lower-level construct.
The key (pun intended) to understanding synchronization between threads on Windows is the Keyed Event interface. This API is rather similar to the Futex API on Linux. However, it had a rather different origin, and as such is a bit more limited in scope.
Originally, synchronization on Windows between threads was done by rather heavy-weight methods that are also applicable between different processes. These are slow because a userspace-kernelspace transition is required for their use. To improve performance, Microsoft added the CriticalSection API. This, similar to mutexes on pthreads in Linux, has a fast-path where no transition to kernelspace via a system call is needed.
However, there was a subtle issue with critical sections as they were originally implemented. A critical section is "uninitialized" when it is first created. To initialize one, memory needs to be allocated, and a HANDLE object needed to be associated with it. Unfortunately these operations could fail. It is possible to run out of memory. It is also possible for the Windows Kernel to run out of HANDLE resources. To deal with this, critical sections could return via an exception when you tried to enter one. For more obscure reasons, it was also possible to raise an exception when leaving an critical section.
This poses a problem. To be correct, programs then need to wrap every single use of a critical section with a try/catch block. Unfortunately, this isn't enough. The problem is that after the exception is raised, the CRITICAL_SECTION object is in an undefined state, and cannot be used. This means that the program basically cannot recover from the situation.
To fix this, Microsoft changed the design of Critical Sections so that they could never fail. This change was implemented via the Keyed Event API. It allowed the back-up possibility of using a HANDLE (called \KernelObjects\CritSecOutOfMemoryEvent) for all non-initialized Critical Sections. The Keyed Event API can distinguish between different objects using it, so only one handle (initialized at process startup) is required.
Keyed Events work like normal events, except that along with the HANDLE you pass in a separate pointer-sized object to use as a "key". Only objects waiting with a matching key will be woken by the corresponding NtReleaseKeyedEvent() call. This allows many synchronization primitives to multiplex via the same HANDLE . By using the address of the primitive as the key, you can be sure that no one else matches, and you get the behaviour you desire. The only limitation is that the lowest bit of the key must be unset. (This is similar to the limitation of Futexes where the key pointer must be 4-byte aligned.)
The Keyed Event API is undocumented, and as such may change at any future time. However, it is interesting to explore how it works. There are currently two functions that can be used to create a Keyed Event handle:
NTSTATUS NTAPI NtCreateKeyedEvent(OUT PHANDLE handle, IN ACCESS_MASK access,
IN POBJECT_ATTRIBUTES attr, IN ULONG flags)
NTSTATUS NTAPI NtOpenKeyedEvent(OUT PHANDLE handle, IN ACCESS_MASK access,
IN POBJECT_ATTRIBUTES attr)
The first of these creates a new handle to be used as a Keyed Event. The second allows "opening via name" if the attributes structure is used. The flags parameter according to the Nt Internals website is reserved, and should be set to zero.
Once a handle is obtained, there are two functions that use them:
NTSTATUS NTAPI NtWaitForKeyedEvent(IN HANDLE handle, IN PVOID key,
IN BOOLEAN alertable, IN PLARGE_INTEGER mstimeout)
NTSTATUS NTAPI NtReleaseKeyedEvent(IN HANDLE handle, IN PVOID key,
IN BOOLEAN alertable, IN PLARGE_INTEGER mstimeout)
The first function allows a thread to go to sleep, waiting on the event signaled by the second function. The major different between this API and the Futex API is that there is no comparison value that is checked as the thread goes to sleep. Instead, the NtReleaseKeyedEvent() function blocks, waiting for a thread to wake if there is none. (Compared to the Futex wake function which will immediately return.) Thus to use this API we need to keep track of how many waiters there are to prevent the release function from hanging.
The limitation that some sort of waiter-count is required means that even though there is no specific limit to what state can be protected by this API, it will be limited to something roughly the size of an integer, just like Futexes. To prevent races, you need to do a compare-exchange to simultaneously increase the wait count and check that the condition you are going to sleep on hasn't occurred. You also need to do a compare-exchange immediately before waking someone up to make sure that there is indeed a waiter to wake.
Fast Mutexes
Using the above, it is possible to construct a very fast mutex implementation:
#pragma comment (linker, "/defaultlib:ntdll.lib")
typedef union fast_mutex fast_mutex;
union fast_mutex
{
unsigned long waiters;
unsigned char owned;
};
#define FAST_MUTEX_INITIALIZER {0}
HANDLE mhandle;
/* Allows up to 2^23-1 waiters */
#define FAST_M_WAKE 256
#define FAST_M_WAIT 512
void init_fast_mutex_handle(void)
{
NtCreateKeyedEvent(&mhandle, -1, NULL, 0);
}
void fast_mutex_init(fast_mutex *m)
{
m->waiters = 0;
}
void fast_mutex_lock(fast_mutex *m)
{
/* Try to take lock if not owned */
while (m->owned || _interlockedbittestandset(&m->waiters, 0))
{
unsigned long waiters = m->waiters | 1;
/* Otherwise, say we are sleeping */
if (_InterlockedCompareExchange(&m->waiters, waiters + FAST_M_WAIT, waiters) == waiters)
{
/* Sleep */
NtWaitForKeyedEvent(mhandle, m, 0, NULL);
/* No longer in the process of waking up */
_InterlockedAdd(&m->waiters, -FAST_M_WAKE);
}
}
}
int fast_mutex_trylock(fast_mutex *m)
{
if (!m->owned && !_interlockedbittestandset(&m->waiters, 0)) return 0;
return EBUSY;
}
void fast_mutex_unlock(fast_mutex *m)
{
/* Atomically unlock without a bus-locked instruction */
_ReadWriteBarrier();
m->owned = 0;
_ReadWriteBarrier();
/* While we need to wake someone up */
while (1)
{
unsigned long waiters = m->waiters;
if (waiters < FAST_M_WAIT) return;
/* Has someone else taken the lock? */
if (waiters & 1) return;
/* Someone else is waking up */
if (waiters & FAST_M_WAKE) return;
/* Try to decrease wake count */
if (_InterlockedCompareExchange(&m->waiters, waiters - FAST_M_WAIT + FAST_M_WAKE, waiters) == waiters)
{
/* Wake one up */
NtReleaseKeyedEvent(mhandle, m, 0, NULL);
return;
}
YieldProcessor();
}
}
The above is quite similar to the fast mutex implementation using the Futex API. The differences are the explicit waiter-count, and the addition of a bit describing a wake-up in progress. The extra bit prevents a large amount of inefficiency due to the relatively slow windows system calls. Note that to compile the above, ntdll.lib is needed in order to link to the low-level API exposed by the Windows kernel. This library file doesn't come with the standard Visual C++ install, and instead can be obtained from the free Windows Driver Development Kit from Microsoft.
So how fast is the above compared to the normal Windows synchronization primitives. We run a simple benchmark on a quad-core machine where four different threads simultaneously increment a global variable 224 times each, protected by a lock. The "standard" way of synchronizing is to use a CRITICAL_SECTION object, using one for this purpose takes 12.5 seconds to run the benchmark. Critical sections are quite heavy weight, so it is interesting to compare them to the slim reader-writer locks (SRWLOCK API). Using exclusive acquires and releases, they take 1.65 seconds, which is quite a bit faster. Finally, using the above low-level code using the Keyed Event API takes 1.25 seconds on average. This is about 25% faster. Note that in addition, the above locks are only 32 bits in size, which is half the size of a SRWLOCK object on a 64 bit machine.
Fast Condition Variables
The above is interesting... but to really use a locking interface it is usually also necessary to have a way of signalling changes in state efficiently between threads. A nice API for this is that of condition variables. The in-built CONDITION_VARIABLE objects in the Microsoft Windows synchronization API are quite flexible, and work with both critical sections and slim reader-writer locks. We require something similar for the fast mutexes. Fortunately, doing this with the Keyed Event API isn't difficult:
typedef struct fast_cv fast_cv;
struct fast_cv
{
fast_mutex lock;
unsigned long wlist;
};
#define FAST_CV_INITIALIZER {0, 0}
void fast_cv_init(fast_cv *cv)
{
fast_mutex_init(&cv->lock);
cv->wlist = 0;
}
void fast_cv_wait(fast_cv *cv, fast_mutex *m)
{
fast_mutex_lock(&cv->lock);
fast_mutex_unlock(m);
cv->wlist++;
fast_mutex_unlock(&cv->lock);
NtWaitForKeyedEvent(mhandle, &cv->wlist, 0, NULL);
fast_mutex_lock(m);
}
void fast_cv_signal(fast_cv *cv)
{
fast_mutex_lock(&cv->lock);
if (cv->wlist)
{
cv->wlist--;
NtReleaseKeyedEvent(mhandle, &cv->wlist, 0, NULL);
}
fast_mutex_unlock(&cv->lock);
}
void fast_cv_broadcast(fast_cv *cv)
{
fast_mutex_lock(&cv->lock);
while (cv->wlist)
{
cv->wlist--;
NtReleaseKeyedEvent(mhandle, &cv->wlist, 0, NULL);
}
fast_mutex_unlock(&cv->lock);
}
The above exposes a limitation of the Keyed Event API. There is no way to describe how many threads to awaken in the release function. It will only awake a single thread per call. Thus the broadcast function is quite inefficient because it needs to loop, and call the API many times. A second limitation is that there is no "requeue" function like in the Futex API. As each thread is awoken by the broadcast function, it will immediately try to lock the mutex. This will typically fail because of all the other threads that want the lock. Thus the thread will have to go straight back to sleep. The Futex API avoids this inefficiency by allowing the sleeping waiters in the condition variable to be requeued onto the wait-list for the mutex within the kernel.
Finally, the above obviously uses a lock for synchronization. The Futex version of the condition variable API can be lock-free. Unfortunately, the missing multiple-wake or multiple-requeue functionality means that locking is required. We need to prevent fast_cv_wait() from manipulating the waiter-count while the broadcast function is scanning its loop. If we don't it is possible to get into a live-lock situation where each thread awoken by the broadcast immediately tries to wait again, causing the broadcast to never finish.
Summary
As can be seen, the Keyed Event API is quite powerful, allowing all of the Windows synchronization primitives to be constructed. It would be nice it were fully documented, and perhaps further expanded with some of the power of the Futex API in the Linux kernel. As can be seen, it is relatively easy to construct new synchronization functions that are more efficient than those already existing within Windows. By dropping debugging overhead and unused functionality, speed can be gained...
|
Comments
Borislav Trifonov said...By the way, FAST_MUTEX_INITIALIZER is defined but doesn't seem to be used anywhere in the code...
NtWaitForKeyedEvent() takes a timeout parameter. Might a timed lock operation be implementable using keyed events as well, similarly to the way WaitForSingleObject with a timeout is used in winpthreads.h? Nt Internals is not clear on the format of the timeout (milliseconds?) and whether it is relative or absolute.
Unfortunately, I don't have any information about the return values from the undocumented functions. :-( All I know is the code here does seem to work on my Vista 64bit machine.
FAST_MUTEX_INITIALIZER allows you to do something like:
fast_mutex m = FAST_MUTEX_INITIALIZER;
and not call fast_mutex_init() if you don't want to. (Just like PHTREAD_MUTEX_INITIALIZER)
Are you sure _InterlockedAdd() needs Itanium? It seems to compile fine here on normal x86. Replacing it with the exchange version shouldn't hurt though. :-)
Yes, I'm fairly sure a timed-wait version of the locking functions can be made by using the timeout parameter. I haven't investigated exactly how that works though...
There is a bit more documentation for functions with a related interface like NtWaitForSingleObject. Looks like the timeout parameter is in units of 100s of nanoseconds, and considered absolute if the number is positive, and relative if the number is negative. My attempt at timed_lock (if you'll excuse the C++):
bool FastMutex::TimedLock(long msTimeout) // Return true if lock acquired, false if timeout
{
if (Trylock()) return true;
unsigned long long tNow(timeGetTime()), tEnd(tNow + msTimeout);
while (_owned || _interlockedbittestandset(&_waiters, 0))
{
unsigned long waiters(_waiters | 1);
if (_InterlockedCompareExchange(&_waiters, waiters + FAST_M_WAIT, waiters) == waiters)
{
long long diff(-10000ll * (tEnd - tNow)); // The timeout value is in 100s of ns; negative value indicates relative instead of absolute
long const ret(NtWaitForKeyedEvent(_handle, reinterpret_cast<void **>(this), 0, &diff));
_InterlockedAdd(&_waiters, -FAST_M_WAKE);
if (ret)
{
if (ret == STATUS_TIMEOUT) return false; // 0x102
else throw runtime_error("Failed to wait for keyed event");
}
}
tNow = timeGetTime();
if (tNow > tEnd) return false;
}
return true;
}
By the way, do you think there would be any benefit trying to spin for a bit before going for the keyed event wait in the lock function in the article, similarly to the version in your FUTEX based mutex article?
Their excuse is that it is too confusing! How ridiculous. As a low-level programmer, I know exactly which instruction I want the compiler to emit. Trying to "protect" me due to the ignorance of others is patronizing.
I highly recommend not using their compiler. gcc generates much better code in 64bit mode, supports C99 (which has existed for over a decade now), and has an extremely powerful interface for inline assembly. The difference for low-level use is like night and day.
Thanks for looking at the timeout parameter. How does the timeout change if you alter the clock? On Linux, you have the choice of using CLOCK_REALTIME or CLOCK_MONOTONIC. My guess is that windows always uses a monotonic clock... but I may be wrong.
When I tried a version with the spin-loop it was slower. This could be due to the differing speed of syscalls on Microsoft Windows verses Linux. It also might be the result of using a slightly different processor/ram mix than the Linux box. You might like to benchmark it to see if the lack of spinning is also best for your particular case.
I wonder, is it possible that you lock() a mutex with some maximum timeout value?
The comment above from Borislav Trifonov seems to have the problem that it does not reduce the wait count: suppose a case where only one thread is waiting on the mutex (with a timeout), and at the same time the timeout elapses (the wait returns with NTSTATUS_TIMEOUT), the mutex is unlocked by it's owner. The owner may observe waiters == FAST_M_WAIT and block forever in NtReleaseKeyedEvent(), because no-one is actually waiting on the mutex.
Also, it seems to have the problem if try_lock() fails after one loop iteration, he will wait ANOTHER full interval (potentially his timedlock blocks forever, if try_lock() keeps failing after the thread is woken)
I would propose:
int fast_mutex_lock_timed(fast_mutex *m, int milliseconds)
{
/* Timeout, absolute value, since we may loop */
LARGE_INTEGER timeout;
NtQuerySystemTime(&timeout);
timeout.QuadPart += ms * 10000ll; /* milliseconds to 100 * nanoseconds */
/* Try to take lock if not owned */
while (m->owned || _interlockedbittestandset(&m->waiters, 0))
{
unsigned long waiters = m->waiters | 1;
/* Otherwise, say we are sleeping */
if (_InterlockedCompareExchange(&m->waiters, waiters + FAST_M_WAIT, waiters) == waiters) while(1) /* Now a loop */
{
/* Sleep */
NTSTATUS code = NtWaitForKeyedEvent(mhandle, m, 0, &timeout);
if(code == NTSTATUS_TIMEOUT)
{
/* Reduce waiter count, because we woke up */
waiters = m->waiters & ~FAST_M_WAKE;
if(waiters >= FAST_M_WAIT && _InterlockedCompareExchange(&m->waiters, waiters - FAST_M_WAIT, waiters) == waiters)
{
/* Timeout elapsed, waiter count reduced successfully */
return EBUSY;
}
/* Cannot reduce waiter count, we need to wait more to prevent unlock() from blocking*/
}
else
{
/* No longer in the process of waking up */
_InterlockedAdd(&m->waiters, -FAST_M_WAKE);
break;
}
}
}
return 0;
}
If not, the compiler may optimize the access to this members and use cpu registers in some places instead always accessing memory.
I also don't think that following is enough to protect parallel access.
_ReadWriteBarrier();
m->owned = 0;
_ReadWriteBarrier();
What if another thread is changing m->owned after first _ReadWriteBarrier() ?
struct fast_cv {
std::atomic<uint32_t> waiters;
};
void
fast_cv_init(fast_cv &cv) {
cv.store(0, std::memory_order_relaxed);
}
void
fast_cv_wait(fast_cv &cv, fast_mutex &m) {
cv.wlist.fetch_add(1, std::memory_order_relaxed);
fast_mutex_unlock(&m);
NtWaitForKeyedEvent(mhandle, &cv.waiters, 0, NULL);
fast_mutex_lock(&m);
}
void
fast_cv_signal(fast_cv &cv) {
auto const snapshot = cv.exchange(0, std::memory_order_relaxed);
if (snapshot > 0) {
cv.fetch_add(snapshot - 1, std::memory_order_relaxed);
NtReleaseKeyedEvent(mhandle, &cv.waiters, 0, NULL);
}
}
void
fast_cv_broadcast(fast_cv &cv) {
auto snapshot = cv.exchange(0, std::memory_order_relaxed);
while (snapshot > 0) {
NtReleaseKeyedEvent(mhandle, &cv.waiters, 0, NULL);
--snapshot;
}
}
(msvc2015/2017 seem to always generate `lock xadd` rather than `lock add` for `fetch_add`, even if the result isn't used, which is unfortunate.)
void
fast_cv_signal(fast_cv &cv) {
retry:
auto const snapshot = cv.state.load(std::memory_order_relaxed);
if (snapshot > 0) {
auto expected = snapshot;
auto const success = cv.state.compare_exchange_weak(expected, snapshot - 1, std::memory_order_relaxed);
if (!success) goto retry;
NtReleaseKeyedEvent(mhandle, &cv.waiters, FALSE, nullptr);
}
}
I'm missing why we need m->owned in the example. It's only ever checked and written to zero, wouldn't we need to set it as something other than zero when we actually acquire the lock?
Cheers!