Lockless Inc

Keyed Events

The primitives used for synchronization between threads in a Linux process are based on the Futex multiplexing syscall. This syscall allows wait-queues of sleeping threads to be manipulated from userspace. The interface efficiently allows userspace locking to avoid transitions to kernel space in the fast-path where locks based on Futexes are uncontended. Microsoft Windows has a different design. The exposed interface is a rather complex API involving pre-made library functions that implement critical sections, read-write locks and condition variables. However, underneath the surface all of these functions are implemented in terms of a lower-level construct.

The key (pun intended) to understanding synchronization between threads on Windows is the Keyed Event interface. This API is rather similar to the Futex API on Linux. However, it had a rather different origin, and as such is a bit more limited in scope.

Originally, synchronization on Windows between threads was done by rather heavy-weight methods that are also applicable between different processes. These are slow because a userspace-kernelspace transition is required for their use. To improve performance, Microsoft added the CriticalSection API. This, similar to mutexes on pthreads in Linux, has a fast-path where no transition to kernelspace via a system call is needed.

However, there was a subtle issue with critical sections as they were originally implemented. A critical section is "uninitialized" when it is first created. To initialize one, memory needs to be allocated, and a HANDLE object needed to be associated with it. Unfortunately these operations could fail. It is possible to run out of memory. It is also possible for the Windows Kernel to run out of HANDLE resources. To deal with this, critical sections could return via an exception when you tried to enter one. For more obscure reasons, it was also possible to raise an exception when leaving an critical section.

This poses a problem. To be correct, programs then need to wrap every single use of a critical section with a try/catch block. Unfortunately, this isn't enough. The problem is that after the exception is raised, the CRITICAL_SECTION object is in an undefined state, and cannot be used. This means that the program basically cannot recover from the situation.

To fix this, Microsoft changed the design of Critical Sections so that they could never fail. This change was implemented via the Keyed Event API. It allowed the back-up possibility of using a HANDLE (called \KernelObjects\CritSecOutOfMemoryEvent) for all non-initialized Critical Sections. The Keyed Event API can distinguish between different objects using it, so only one handle (initialized at process startup) is required.

Keyed Events work like normal events, except that along with the HANDLE you pass in a separate pointer-sized object to use as a "key". Only objects waiting with a matching key will be woken by the corresponding NtReleaseKeyedEvent() call. This allows many synchronization primitives to multiplex via the same HANDLE. By using the address of the primitive as the key, you can be sure that no one else matches, and you get the behaviour you desire. The only limitation is that the lowest bit of the key must be unset. (This is similar to the limitation of Futexes where the key pointer must be 4-byte aligned.)

The Keyed Event API is undocumented, and as such may change at any future time. However, it is interesting to explore how it works. There are currently two functions that can be used to create a Keyed Event handle:


NTSTATUS NTAPI NtCreateKeyedEvent(OUT PHANDLE handle, IN ACCESS_MASK access,
	IN POBJECT_ATTRIBUTES attr, IN ULONG flags)

NTSTATUS NTAPI NtOpenKeyedEvent(OUT PHANDLE handle, IN ACCESS_MASK access,
	IN POBJECT_ATTRIBUTES attr)

The first of these creates a new handle to be used as a Keyed Event. The second allows "opening via name" if the attributes structure is used. The flags parameter according to the Nt Internals website is reserved, and should be set to zero.

Once a handle is obtained, there are two functions that use them:


NTSTATUS NTAPI NtWaitForKeyedEvent(IN HANDLE handle, IN PVOID key,
	IN BOOLEAN alertable, IN PLARGE_INTEGER mstimeout)
	
NTSTATUS NTAPI NtReleaseKeyedEvent(IN HANDLE handle, IN PVOID key,
	IN BOOLEAN alertable, IN PLARGE_INTEGER mstimeout)

The first function allows a thread to go to sleep, waiting on the event signaled by the second function. The major different between this API and the Futex API is that there is no comparison value that is checked as the thread goes to sleep. Instead, the NtReleaseKeyedEvent() function blocks, waiting for a thread to wake if there is none. (Compared to the Futex wake function which will immediately return.) Thus to use this API we need to keep track of how many waiters there are to prevent the release function from hanging.

The limitation that some sort of waiter-count is required means that even though there is no specific limit to what state can be protected by this API, it will be limited to something roughly the size of an integer, just like Futexes. To prevent races, you need to do a compare-exchange to simultaneously increase the wait count and check that the condition you are going to sleep on hasn't occurred. You also need to do a compare-exchange immediately before waking someone up to make sure that there is indeed a waiter to wake.

Fast Mutexes

Using the above, it is possible to construct a very fast mutex implementation:


#pragma comment (linker, "/defaultlib:ntdll.lib")

typedef union fast_mutex fast_mutex;
union fast_mutex
{
	unsigned long waiters;
	unsigned char owned;
};

#define FAST_MUTEX_INITIALIZER {0}

HANDLE mhandle;

/* Allows up to 2^23-1 waiters */
#define FAST_M_WAKE		256
#define FAST_M_WAIT		512

void init_fast_mutex_handle(void)
{
	NtCreateKeyedEvent(&mhandle, -1, NULL, 0);
}

void fast_mutex_init(fast_mutex *m)
{
	m->waiters = 0;
}

void fast_mutex_lock(fast_mutex *m)
{
	/* Try to take lock if not owned */
	while (m->owned || _interlockedbittestandset(&m->waiters, 0))
	{
		unsigned long waiters = m->waiters | 1;
		
		/* Otherwise, say we are sleeping */
		if (_InterlockedCompareExchange(&m->waiters, waiters + FAST_M_WAIT, waiters) == waiters)
		{
			/* Sleep */
			NtWaitForKeyedEvent(mhandle, m, 0, NULL);
			
			/* No longer in the process of waking up */
			_InterlockedAdd(&m->waiters, -FAST_M_WAKE);
		}
	}
}

int fast_mutex_trylock(fast_mutex *m)
{
	if (!m->owned && !_interlockedbittestandset(&m->waiters, 0)) return 0;
	
	return EBUSY;
}

void fast_mutex_unlock(fast_mutex *m)
{
	/* Atomically unlock without a bus-locked instruction */
	_ReadWriteBarrier();
	m->owned = 0;
	_ReadWriteBarrier();
	
	/* While we need to wake someone up */
	while (1)
	{
		unsigned long waiters = m->waiters;
		
		if (waiters < FAST_M_WAIT) return;
		
		/* Has someone else taken the lock? */
		if (waiters & 1) return;
		
		/* Someone else is waking up */
		if (waiters & FAST_M_WAKE) return;
		
		/* Try to decrease wake count */
		if (_InterlockedCompareExchange(&m->waiters, waiters - FAST_M_WAIT + FAST_M_WAKE, waiters) == waiters)
		{
			/* Wake one up */
			NtReleaseKeyedEvent(mhandle, m, 0, NULL);
			return;
		}
		
		YieldProcessor();
	}
}

The above is quite similar to the fast mutex implementation using the Futex API. The differences are the explicit waiter-count, and the addition of a bit describing a wake-up in progress. The extra bit prevents a large amount of inefficiency due to the relatively slow windows system calls. Note that to compile the above, ntdll.lib is needed in order to link to the low-level API exposed by the Windows kernel. This library file doesn't come with the standard Visual C++ install, and instead can be obtained from the free Windows Driver Development Kit from Microsoft.

So how fast is the above compared to the normal Windows synchronization primitives. We run a simple benchmark on a quad-core machine where four different threads simultaneously increment a global variable 224 times each, protected by a lock. The "standard" way of synchronizing is to use a CRITICAL_SECTION object, using one for this purpose takes 12.5 seconds to run the benchmark. Critical sections are quite heavy weight, so it is interesting to compare them to the slim reader-writer locks (SRWLOCK API). Using exclusive acquires and releases, they take 1.65 seconds, which is quite a bit faster. Finally, using the above low-level code using the Keyed Event API takes 1.25 seconds on average. This is about 25% faster. Note that in addition, the above locks are only 32 bits in size, which is half the size of a SRWLOCK object on a 64 bit machine.

Fast Condition Variables

The above is interesting... but to really use a locking interface it is usually also necessary to have a way of signalling changes in state efficiently between threads. A nice API for this is that of condition variables. The in-built CONDITION_VARIABLE objects in the Microsoft Windows synchronization API are quite flexible, and work with both critical sections and slim reader-writer locks. We require something similar for the fast mutexes. Fortunately, doing this with the Keyed Event API isn't difficult:


typedef struct fast_cv fast_cv;
struct fast_cv
{
	fast_mutex lock;
	unsigned long wlist;
};

#define FAST_CV_INITIALIZER	{0, 0}

void fast_cv_init(fast_cv *cv)
{
	fast_mutex_init(&cv->lock);
	cv->wlist = 0;
}

void fast_cv_wait(fast_cv *cv, fast_mutex *m)
{
	fast_mutex_lock(&cv->lock);
	fast_mutex_unlock(m);
	cv->wlist++;
	fast_mutex_unlock(&cv->lock);
	
	NtWaitForKeyedEvent(mhandle, &cv->wlist, 0, NULL);

	fast_mutex_lock(m);
}

void fast_cv_signal(fast_cv *cv)
{
	fast_mutex_lock(&cv->lock);
	if (cv->wlist)
	{
		cv->wlist--;
		NtReleaseKeyedEvent(mhandle, &cv->wlist, 0, NULL);
	}
	fast_mutex_unlock(&cv->lock);
}

void fast_cv_broadcast(fast_cv *cv)
{
	fast_mutex_lock(&cv->lock);
	while (cv->wlist)
	{
		cv->wlist--;
		NtReleaseKeyedEvent(mhandle, &cv->wlist, 0, NULL);
	}
	fast_mutex_unlock(&cv->lock);
}

The above exposes a limitation of the Keyed Event API. There is no way to describe how many threads to awaken in the release function. It will only awake a single thread per call. Thus the broadcast function is quite inefficient because it needs to loop, and call the API many times. A second limitation is that there is no "requeue" function like in the Futex API. As each thread is awoken by the broadcast function, it will immediately try to lock the mutex. This will typically fail because of all the other threads that want the lock. Thus the thread will have to go straight back to sleep. The Futex API avoids this inefficiency by allowing the sleeping waiters in the condition variable to be requeued onto the wait-list for the mutex within the kernel.

Finally, the above obviously uses a lock for synchronization. The Futex version of the condition variable API can be lock-free. Unfortunately, the missing multiple-wake or multiple-requeue functionality means that locking is required. We need to prevent fast_cv_wait() from manipulating the waiter-count while the broadcast function is scanning its loop. If we don't it is possible to get into a live-lock situation where each thread awoken by the broadcast immediately tries to wait again, causing the broadcast to never finish.

Summary

As can be seen, the Keyed Event API is quite powerful, allowing all of the Windows synchronization primitives to be constructed. It would be nice it were fully documented, and perhaps further expanded with some of the power of the Futex API in the Linux kernel. As can be seen, it is relatively easy to construct new synchronization functions that are more efficient than those already existing within Windows. By dropping debugging overhead and unused functionality, speed can be gained...

Comments

Borislav Trifonov said...
Is CloseHandle() necessary on the handle created by NtCreateKeyedEvent()? And any information about return values of the keyed event functions? I see for example NtCreateKeyedEvent() as having NTSTATUS return which is unsigned long.

By the way, FAST_MUTEX_INITIALIZER is defined but doesn't seem to be used anywhere in the code...
Borislav Trifonov said...
It seems _InterlockedAdd() requires Itanium; I'm guessing that on x86(-64) _InterlockedExchangeAdd() should work.

NtWaitForKeyedEvent() takes a timeout parameter. Might a timed lock operation be implementable using keyed events as well, similarly to the way WaitForSingleObject with a timeout is used in winpthreads.h? Nt Internals is not clear on the format of the timeout (milliseconds?) and whether it is relative or absolute.
sfuerst said...
You can call CloseHandle() on program exit, if you wish. However, the operating system should clean up after you, making this unnecessary.

Unfortunately, I don't have any information about the return values from the undocumented functions. :-( All I know is the code here does seem to work on my Vista 64bit machine.

FAST_MUTEX_INITIALIZER allows you to do something like:

fast_mutex m = FAST_MUTEX_INITIALIZER;

and not call fast_mutex_init() if you don't want to. (Just like PHTREAD_MUTEX_INITIALIZER)


Are you sure _InterlockedAdd() needs Itanium? It seems to compile fine here on normal x86. Replacing it with the exchange version shouldn't hurt though. :-)

Yes, I'm fairly sure a timed-wait version of the locking functions can be made by using the timeout parameter. I haven't investigated exactly how that works though...



Borislav Trifonov said...
_InterlockedAdd in Visual Studio 2010 is only supported on Itanium, it's even mentioned on MSDN. The x86/x64 compiler doesn't have an intrinsic to generate lock add anymore: http://connect.microsoft.com/VisualStudio/feedback/details/524100/interlockedadd-is-where


There is a bit more documentation for functions with a related interface like NtWaitForSingleObject. Looks like the timeout parameter is in units of 100s of nanoseconds, and considered absolute if the number is positive, and relative if the number is negative. My attempt at timed_lock (if you'll excuse the C++):

bool FastMutex::TimedLock(long msTimeout) // Return true if lock acquired, false if timeout
{
if (Trylock()) return true;
unsigned long long tNow(timeGetTime()), tEnd(tNow + msTimeout);
while (_owned || _interlockedbittestandset(&_waiters, 0))
{
unsigned long waiters(_waiters | 1);
if (_InterlockedCompareExchange(&_waiters, waiters + FAST_M_WAIT, waiters) == waiters)
{
long long diff(-10000ll * (tEnd - tNow)); // The timeout value is in 100s of ns; negative value indicates relative instead of absolute
long const ret(NtWaitForKeyedEvent(_handle, reinterpret_cast<void **>(this), 0, &diff));
_InterlockedAdd(&_waiters, -FAST_M_WAKE);
if (ret)
{
if (ret == STATUS_TIMEOUT) return false; // 0x102
else throw runtime_error("Failed to wait for keyed event");
}
}
tNow = timeGetTime();
if (tNow > tEnd) return false;
}
return true;
}

By the way, do you think there would be any benefit trying to spin for a bit before going for the keyed event wait in the lock function in the article, similarly to the version in your FUTEX based mutex article?
sfuerst said...
Ouch... they must have removed it then. On VS 8 it worked. I wish Microsoft would support inline asm in 64 bit mode. It would make it possible to ignore stupidities like them removing useful intrinisics like that.

Their excuse is that it is too confusing! How ridiculous. As a low-level programmer, I know exactly which instruction I want the compiler to emit. Trying to "protect" me due to the ignorance of others is patronizing.

I highly recommend not using their compiler. gcc generates much better code in 64bit mode, supports C99 (which has existed for over a decade now), and has an extremely powerful interface for inline assembly. The difference for low-level use is like night and day.

Thanks for looking at the timeout parameter. How does the timeout change if you alter the clock? On Linux, you have the choice of using CLOCK_REALTIME or CLOCK_MONOTONIC. My guess is that windows always uses a monotonic clock... but I may be wrong.

When I tried a version with the spin-loop it was slower. This could be due to the differing speed of syscalls on Microsoft Windows verses Linux. It also might be the result of using a slightly different processor/ram mix than the Linux box. You might like to benchmark it to see if the lack of spinning is also best for your particular case.
neill said...
To see why you should not use SwitchToThread you should run a low priority compute bound thread on the processors. You will get starved and your performance will be really bad.
lukcho said...
IMO one very major problem with those keyed events is that they require using of their own blocking API, so one can not use the multiplexing WaitForMultipleObjects in order to wait for many things at the same time.
Borislav Trifonov said...
lukcho, it's the same thing on Linux: you cannot wait on multiple futexes and have to use a different API.
cb said...
What about a spinlock instead for the lock protecting the cv waiter-count? Might be an improvement with the exception of cases where there's a broadcast to a significant number of waiters.
sfuerst said...
A spinlock makes sense when the hold-time is short, perhaps a few instructions long. In this case, the critical section contains kernel calls. (And in the case of fast_cv_broadcast(), perhaps many of them.) So we are better off if the waiting threads sleep instead of spinning for thousands of cycles wasting cpu.
MLM said...
Very interesting article.
I wonder, is it possible that you lock() a mutex with some maximum timeout value?

The comment above from Borislav Trifonov seems to have the problem that it does not reduce the wait count: suppose a case where only one thread is waiting on the mutex (with a timeout), and at the same time the timeout elapses (the wait returns with NTSTATUS_TIMEOUT), the mutex is unlocked by it's owner. The owner may observe waiters == FAST_M_WAIT and block forever in NtReleaseKeyedEvent(), because no-one is actually waiting on the mutex.

Also, it seems to have the problem if try_lock() fails after one loop iteration, he will wait ANOTHER full interval (potentially his timedlock blocks forever, if try_lock() keeps failing after the thread is woken)

I would propose:
int fast_mutex_lock_timed(fast_mutex *m, int milliseconds)
{
/* Timeout, absolute value, since we may loop */
LARGE_INTEGER timeout;
NtQuerySystemTime(&timeout);
timeout.QuadPart += ms * 10000ll; /* milliseconds to 100 * nanoseconds */

/* Try to take lock if not owned */
while (m->owned || _interlockedbittestandset(&m->waiters, 0))
{
unsigned long waiters = m->waiters | 1;

/* Otherwise, say we are sleeping */
if (_InterlockedCompareExchange(&m->waiters, waiters + FAST_M_WAIT, waiters) == waiters) while(1) /* Now a loop */
{
/* Sleep */
NTSTATUS code = NtWaitForKeyedEvent(mhandle, m, 0, &timeout);

if(code == NTSTATUS_TIMEOUT)
{
/* Reduce waiter count, because we woke up */
waiters = m->waiters & ~FAST_M_WAKE;
if(waiters >= FAST_M_WAIT && _InterlockedCompareExchange(&m->waiters, waiters - FAST_M_WAIT, waiters) == waiters)
{
/* Timeout elapsed, waiter count reduced successfully */
return EBUSY;
}

/* Cannot reduce waiter count, we need to wait more to prevent unlock() from blocking*/
}
else
{
/* No longer in the process of waking up */
_InterlockedAdd(&m->waiters, -FAST_M_WAKE);
break;
}
}
}
return 0;
}
Benjamin K said...
Some comments about memory accces. Doesn't m->owned and m->waiters to be declared as volatile?

If not, the compiler may optimize the access to this members and use cpu registers in some places instead always accessing memory.

I also don't think that following is enough to protect parallel access.

_ReadWriteBarrier();
m->owned = 0;
_ReadWriteBarrier();

What if another thread is changing m->owned after first _ReadWriteBarrier() ?
JeffR said...
I tried to write my own simple benchmark to compare CRITICAL_SECTION to fast_mutex, on a 2 core Toshiba Windows Vista machine. fast_mutex is about twice as fast as CRITICAL_SECTION, not 10X as reported above. Is the benchmark described above available so I can verify that I am testing the same way you tested?

Jesse T said...
Important! I've discovered that the internals of NT Keyed Events may have changed during one of the Windows 10 updates, and it now seems to use the two least significant bits (bit 0 and bit 1) inside of NtReleaseKeyedEvent & NtWaitForKeyedEvent, so you must take care not to store information in those two bits. Before Windows 10 (and at the very least, the initial release of Windows 10), only bit 0 was reserved for internal use.
Imran Hameed said...
I think you can remove the mutex protecting the waiter count inside the condition variable:

struct fast_cv {
    std::atomic<uint32_t> waiters;
};

void
fast_cv_init(fast_cv &cv) {
    cv.store(0, std::memory_order_relaxed);
}

void
fast_cv_wait(fast_cv &cv, fast_mutex &m) {
    cv.wlist.fetch_add(1, std::memory_order_relaxed);
    fast_mutex_unlock(&m);
    NtWaitForKeyedEvent(mhandle, &cv.waiters, 0, NULL);
    fast_mutex_lock(&m);
}

void
fast_cv_signal(fast_cv &cv) {
    auto const snapshot = cv.exchange(0, std::memory_order_relaxed);
    if (snapshot > 0) {
        cv.fetch_add(snapshot - 1, std::memory_order_relaxed);
        NtReleaseKeyedEvent(mhandle, &cv.waiters, 0, NULL);
    }
}

void
fast_cv_broadcast(fast_cv &cv) {
    auto snapshot = cv.exchange(0, std::memory_order_relaxed);
    while (snapshot > 0) {
        NtReleaseKeyedEvent(mhandle, &cv.waiters, 0, NULL);
        --snapshot;
    }
}

(msvc2015/2017 seem to always generate `lock xadd` rather than `lock add` for `fetch_add`, even if the result isn't used, which is unfortunate.)
Imran Hameed said...
Oof, I'm apparently incapable of writing even small program fragments without making a bunch of mundane errors. `cv.store`, `cv.wlist.fetch_add`, `cv.exchange`, and `cv.fetch_add` should be replaced with `cv.waiters.whatever`... :-)
Imran Hameed said...
While slapping together a quick test of the above in relacy I noticed a more serious error in my previous comment; it's possible for concurrent invocations of my version of `fast_cv_signal` to result in missed wakes (signaling threads could interfere with each other in between the `exchange` and the `fetch_add`, resulting in fewer than min(waiting-threads, invocations-of-signal) wakes). But that's easy to fix with a compare-exchange loop (or a hypothetical saturating-fetch-sub). Another sketch follows:

void
fast_cv_signal(fast_cv &cv) {
    retry:
    auto const snapshot = cv.state.load(std::memory_order_relaxed);
    if (snapshot > 0) {
        auto expected = snapshot;
        auto const success = cv.state.compare_exchange_weak(expected, snapshot - 1, std::memory_order_relaxed);
        if (!success) goto retry;

        NtReleaseKeyedEvent(mhandle, &cv.waiters, FALSE, nullptr);
    }
}
Osorio said...
Hey all!

I'm missing why we need m->owned in the example. It's only ever checked and written to zero, wouldn't we need to set it as something other than zero when we actually acquire the lock?

Cheers!
V said...
@Osorio, look up how does a union work
Amy said...
Note that it seems rare, but possible, to have NtWaitForKeyedEvent with a timeout return indicating that it timed out, while at the same time unblocking an NtReleaseKeyedEvent with the same key. I believe this is due to an unavoidable race between the signal and timeout in the kernel, as this is observed most frequently on a multiprocessor system. If you don&apos;t account for this and always assume that a timeout return from NWFKE means a waker wasn&apos;t unblocked, you&apos;ll get lost wakeups and stuck threads.

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.