Lockless Inc

Obscure Synchronization Primitives

Blocking synchronization primitives on Linux are built using the Futex system call. Futexes were originally designed to allow user-space fast-paths in the creation of mutexes. Thus the kernel doesn't need to be notified if there is no contention. Only if a thread needs to block, does a system call need to be executed. Later, additions to the Futex system call made it a more powerful beast that allowed the construction of other primitives such as condition variables and read-write locks.

Every other synchronization primitive can be built from mutexes and condition variables, so those two are technically all that we require. However, in the search for performance, sometimes it is necessary to investigate how other primitives that might better-fit a given task may be created. Another possibility where they are important, is in porting software between operating systems. The Microsoft Windows threading system is very different from the Linux one. Condition variables are a new addition there, so most older codes use other techniques that have no direct correspondence to pthreads on Linux.

Some synchronization primitives are easier to use than others. Fortunately for Linux programmers, mutexes and condition variables are rather easy to use correctly. The following primitives tend not to be so forgiving. However, if you have source code that already uses similar methods on other operating systems and you need to port it to Linux, then this article may be useful.

The first thing we will need is some atomic intrinsics:


/* Compile Barrier */
#define barrier() asm volatile("": : :"memory")

/* Compare Exchange */
#define cmpxchg(P, O, N) __sync_val_compare_and_swap((P), (O), (N))

/* Atomic add, returning the new value after the addition */
#define atomic_add(P, V) __sync_add_and_fetch((P), (V))

/* Atomic set and clear a bit, given by position */
#define atomic_set_bit(P, V) __sync_or_and_fetch((P), 1<<(V))
#define atomic_clear_bit(P, V) __sync_and_and_fetch((P), ~(1<<(V)))

We also need to use the xchg instruction. Since this isn't quite as portable, pure assembly is used:


static inline unsigned xchg_8(void *ptr, unsigned char x)
{
	__asm__ __volatile__("xchgb %0,%1"
				:"=r" ((unsigned char) x)
				:"m" (*(volatile unsigned char *)ptr), "0" (x)
				:"memory");

	return x;
}

static inline void *xchg_64(void **ptr, void *x)
{
	__asm__ __volatile__("xchgq %0,%1"
				:"=r" ((void *) x)
				:"m" (*(volatile void **)ptr), "0" (x)
				:"memory");

	return x;
}

Finally, in order to block on a kernel wait-list, we need to use the Futex system call, which unfortunately isn't exposed by linux/futex.h.


#include <linux/futex.h>
#include <stdio.h>
#include <stdint.h>
#include <limits.h>

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/time.h>
static long sys_futex(void *addr1, int op, int val1, struct timespec *timeout, void *addr2, int val3)
{
	return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}

Single Producer, Multiple Consumer, Single Item Queue

A useful primitive that allows communication between threads is some form of queue. The simplest possible queue only allows one item to be stored within itself. Constructing one with normal primitives isn't hard. You use a mutex to protect the internal state of the object, and a condition variable to allow the consumers to wait for the producer. The problem here is that by default, mutexes and condition variables are quite large and heavy-weight objects. We can do better using Futexes directly.

Such an object has three different states. The first state is "empty", with nothing stored. We can represent this as a NULL pointer. The second state is "full", where a pointer to the item enqueued is stored within the object. Finally, we need a third state "empty but contended" to denote when we may have consumers waiting for a producer on a Futex wait-list within the kernel.

The problem here is what to choose for the value of the third state. For 32 bit systems, the answer is obvious. We can use the bit pattern 0x00000001, which will never be a pointer to a real item. However, 64 bit systems aren't so simple. The difficulty is that Futexes only deal with 32 bit quantities. It is perfectly possible to have a real pointer to an item with the last 32 bits in that particular pattern. Fortunately, this doesn't happen very often. Only pointers to unaligned objects, or to raw bytes or characters can have this occur. If we restrict the use of the SPMCSI queue to only aligned pointers, it isn't that much of a restriction in practice.

Using a struct to allow a minimal amount of data hiding, we can define the SPMCSI object to be:


/*
 * Single producer, multiple consumer, single item.
 * Assumes that no pointer ends in 0x00000001, which is always true
 * on 32 bit, and true on 64 bit with even or greater alignment.
 */
typedef struct spmcsi spmcsi;

struct spmcsi
{
	void *ptr;
}; 

A function to produce to such an object needs to check if the state is empty. If it is, a simple compare and swap will enqueue the object in a wait-free manner. Otherwise, we need to wake up possible waiters as well.


void spmcsi_produce(spmcsi *s, void *ptr)
{
	/* No one waiting - try to produce */
	if (!s->ptr && !cmpxchg(&s->ptr, NULL, ptr)) return;
	
	/* Make sure data pointed to by ptr is valid */
	barrier();
	
	/* Store the value, and turn off wait-state */
	s->ptr = ptr;
		
	/* Wake up someone, if they exist */
	sys_futex(s, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
}

Finally, the most complex function is that for the consumers. They need to try to remove an item if one exists in a wait-free manner for performance. They also need to transition to the contended state if no item exists. The trick here is making sure that the contended state is maintained even when we finally get the enqueued item within the sleeping slow-path. We do this because we don't know if there are any other consumers waiting within the kernel. We transition to a no-waiter state only if the producer tries to wake someone up and fails.


void *spmcsi_consume(spmcsi *s)
{
	void *ptr = s->ptr;
	
	/* Something there, try to take it */
	if (((uintptr_t) ptr > 1) && (cmpxchg(&s->ptr, ptr, NULL) == ptr)) return ptr;
	
	/* A sleeping loop slow-path */
	while (1)
	{
		/* Announce intention to wait, plus grab current data */
		ptr = xchg_64(&s->ptr, (void *) 1);
			
		/*
		 * Successfully got data, leave wait state set as
		 * more waiters may exist in this loop.
		 */
		if ((uintptr_t) ptr > 1) return ptr;
		
		/* Wait, so long as bottom 32 bits of ptr are still 0x00000001 */
		sys_futex(s, FUTEX_WAIT_PRIVATE, 1, NULL, NULL, 0);
	}
}

Events

Another very simple synchronization primitive is an "event". An event can either be set or unset. If an event is unset, then a waiter will wait until it is set. If you set an event, all such waiters will be awoken. This primitive is used quite often in the Microsoft Windows world. It allows one thread to notify others waiting on a particular result. The corresponding pattern in the Linux programming world would involve a mutex and condition variable. However, we again can simplify the problem when a pure Futex is used.

We define an event to be a 32 bit quantity manipulated by the Futex system call.


typedef union event event;
union event
{
	/* Futexes deal with 32bit quantities */
	unsigned e;
	
	struct
	{
		unsigned char contended;
		unsigned char set;
	} b;
	
};

Reading whether or not the event is set or unset is easy:


int event_state(event *e)
{
	return e->b.set;
}

Transitioning to the unset state is also quite simple. We just store a zero in the requisite variable.


void event_reset(event *e)
{
	/* Make sure things don't leak across this call */
	barrier();
	
	e->b.set = 0;
	
	/* Make sure things don't leak across this call */
	barrier();
}

Setting the event it more complex. We need to have a fast-path that checks to see if anyone is waiting on us or not. If not, we can bail out early. Otherwise we need to wake the waiters. One trick to notice is the use of the xchg instruction in the place of a memory fence and an atomic store. According to Intel, using an xchg is faster. (It's also one asm instruction instead of two.)


void event_set(event *e)
{
	/* Set event, plus be an efficient memory barrier */
	xchg_8(&e->b.set, 1);
	
	/* Fast path - no waiters */
	if (!e->b.contended) return;
	
	/* No more waiters */
	e->b.contended = 0;
		
	/* Wake everyone */
	sys_futex(e, FUTEX_WAKE_PRIVATE, INT_MAX, NULL, NULL, 0);
}

Finally, we need to wait on the event. Doing this is simpler than the SPMCSI queue wait function because all waiters are woken rather than just one at a time. Thus we don't need to be quite so tricky with setting the contended state and checking when to bail out of the wait-loop.


void event_wait(event *e)
{
	unsigned v;
	
	while (!e->b.set)
	{
		/* Announce intention to wait */
		e->b.contended = 1;
		
		/* Wait, as long as not set, and still marked as contended */
		sys_futex(e, FUTEX_WAIT_PRIVATE, 1, NULL, NULL, 0);
	}
}

Countdown Events

A "countdown" event is an event that is only triggered after a certain number of signals is sent to it. One possible use for such a thing is barrier-like semantics where one thread is waiting on the results from multiple others.

We will use the bottom bit of the 32 bit Futex quantity to denote whether there are any waiters or not. The upper 31 bits can then be used as a counter.


typedef struct countdown_event countdown_event;
struct countdown_event
{
	unsigned e;
};

void ce_init(countdown_event *e, unsigned val)
{
	/* Check maximum range */
	if (val >= 0x80000000) errx(1, "countdown too large\n");
	
	/* Make the following read of e->e atomic */
	barrier();
	
	/* Convert to value to store in counter */
	val += val + (e->e & 1);
	
	/* Make the compiler use a single write to store into e->e */
	barrier();
	
	e->e = val;
}

By using an atomic add of -2, we can decrement the counter. Once we reach zero or one, then we are done, and have to wake up any waiters.


void ce_decrement(countdown_event *e)
{
	unsigned v = atomic_add(&e->e, -2);
	
	/* Not done yet? */
	if (v >= 2) return;
	
	/* No waiters? */
	if (!v) return;
	
	/* Check for underflow */
	if (v >= 0xfffffffe) errx(1, "counter underflow\n");
	
	/* No more waiters, clear lsb */
	atomic_clear_bit(&e->e, 0);
		
	/* Wake everyone */
	sys_futex(e, FUTEX_WAKE_PRIVATE, INT_MAX, NULL, NULL, 0);
}

Finally, we need the wait function which will block until the counter reaches zero (actually one, since the lowest bit will be set if we sleep.)


void ce_wait(countdown_event *e)
{
	unsigned v = e->e;

	while (v >= 2)
	{
		/* Announce intention to wait by setting lsb */
		atomic_set_bit(&e->e, 0);
		v |= 1;
		
		/* Wait */
		sys_futex(e, FUTEX_WAIT_PRIVATE, v, NULL, NULL, 0);
		
		v = e->e;
	}
}

Unfortunately this API is rather hard to use correctly. Note how there is a race between ce_wait() and ce_init() A just-woken waiter may be put back to sleep if the countdown is re-initialized before it has time to leave the wait-loop. Thus some other form of synchronization is required if you want to use such an object more than once.

Eventcounts

One rather useful synchronization primitive is an "eventcount". It is somewhat like the reverse of a seq-lock, where the waiter blocks if no signals have occurred between two points in time, rather than the converse. This primitive can be used to construct "almost lock free" data structures, that rarely call into the kernel. The trick is making sure the signal/broadcast functions are wait free in the common case where there are no waiters, and making the wait-check operation as cheap as possible.

Like other primitives, this one fits inside a 32bit Futex integer. Two billion or so possible signals between checks should be enough. (A similar amount is used to prevent the ABA bug in lock free linked lists.) We will use the bottom bit to denote whether there are waiters or not, just like the countdown event object.


typedef struct eventcount eventcount;
struct eventcount
{
	unsigned e;
};

The broadcast and signal operations are quite similar. The first needs to wake all waiters, whereas the second just one. However, we can quickly check if no waiters exist, and avoid making the relatively slow syscall into the kernel. The signal operation is quite useful in preventing the "thundering herd" issue from occurring.


void ec_broadcast(eventcount *e)
{
	/* Increment sequence counter */
	unsigned v = atomic_add(&e->e, 2);
	
	/* No waiters? */
	if (!v & 1) return;
	
	/* No more waiters, clear lsb */
	atomic_clear_bit(&e->e, 0);
	
	/* Wake everyone */
	sys_futex(e, FUTEX_WAKE_PRIVATE, INT_MAX, NULL, NULL, 0);
}

void ec_signal(eventcount *e)
{
	/* Increment sequence counter */
	unsigned v = atomic_add(&e->e, 2);
	
	/* No waiters? */
	if (!v & 1) return;
	
	/* Assume no more waiters, clear lsb */
	atomic_clear_bit(&e->e, 0);
	
	/* Wake someone */
	sys_futex(e, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
}

The first part of the waiter code needs to determine the "key" used to represent the instance in time. To do this, we use the integer value of the object. As an optimization, we set the lower bit to make comparisons faster. (We don't care about the state of that bit in the waiter code.)


unsigned ec_getkey(eventcount *e)
{
	unsigned v;
	
	/* Prevent this from being reordered by the compiler */
	barrier();
	
	/* Set wait-bit in key, which makes comparison easier */
	v = e->e | 1;
	
	/* Prevent this from being reordered by the compiler */
	barrier();
	
	return v;
}

The most complex part of the code is in the wait function. It needs to determine whether or not there have been any signals (or broadcasts) since the thread has obtained its key. We can do that by checking the integer against our key. Again, we don't care about the least significant bit, so we make sure that is set to match what we have done in the ec_getkey() function. If there have been no relevant changes, then we drop into the wait loop.


void ec_wait(eventcount *e, unsigned key)
{
	unsigned v;
	
	/* Prevent out-of order reads */
	barrier();
	
	/* Compare with wait-bit set */
	v = e->e | 1;
		
	/* Fastpath - notice that the event has been signaled */
	if (v != key) return;
		
	while (1)
	{
		/* We want to wait, or make sure wait-bit is set for other waiters after ec_signal() */
		atomic_set_bit(&e->e, 0);
		
		/* Get new value */
		v = e->e | 1;
		
		/* Has the event has been signaled? */
		if (v != key) return;
		
		/* Wait for event to be triggered */
		sys_futex(e, FUTEX_WAIT_PRIVATE, v, NULL, NULL, 0);
	}
}

Summary

The above synchronization primitives, while obscure, can be quite useful in some situations. As can be seen, they all can be efficiently implemented via the Futex API on Linux. No complex compare-exchange based loops are needed, as all atomic operations required exist within the x86 instruction set. Thus the above code should be quite fast, performing much better than generic mutex & condition variable implementations of the same tasks. You may want to investigate whether any performance critical code you may have could also be sped up in the same manner.

Comments

Jeff K said...
event_wait() has a bug in that it can result in threads blocking on events that have been set. If one thread is halted after entering the while loop but before setting the contended flag, then it may not be woken up by another thread calling event_set(). The second thread may erroneously conclude that nobody is waiting and exit without calling sys_futex().

You could fix the bug by reading and updating the two flags atomically (e.g. combining them into a single byte), but event_reset() and event_wait() would need to use atomic_bit_set() and atomic_bit_clear() or xchg_8().
sfuerst said...
Hello Jeff, Are you sure? I can't see how that can happen.

The futex system call checks that the state matches the given integer before going to sleep. If the event bit has been set, then that comparison will fail (It simultaneously checks for contended = true, event = false), and the thread will spin around to check the event state again.

If there is a bug, it probably is due to things leaking across event_reset(). I've added another compile barrier there to fix that.
Jeff K said...
Oh! Sorry, then you are right, and there is no bug. I haven't ever used the futex system calls, and I didn't know that they actually checked the memory location.
sfuerst said...
No problem. Code like this can be very difficult to verify. (Hence the article "Futexes are Tricky".)

I'll have to admit that I haven't checked the functions in this article nearly as much as those in the mutex code I derived them from. It is entirely possible that a few bugs slipped in, but good to know that none have been spotted yet. :-)
Borislav Trifonov said...
The events here seem to be usable for porting Windows code that uses Windows events for IPC (using the non-_PRIVATE version of the futex operations). Then event_wait() amended with a timeout becomes analogous to WaitForSingleObject. But what is the proper way to emulate WaitForMultipleObjects?
Samy Al Bahra said...
The volatile qualifier for ptr should be unnecessary if you list the memory clobber.
Z.G. said...
One thing to keep in mind is that Events on Windows NT can be used not only for thread-, but also process-synchronization. And with ZwDuplicateObject, we can even create an event in our own process, pass it to another process using LPC, sockets, pipes, or any other method, and then have that other process signal the event, or have us signal the event to unblock that other process.

Another issue is that events, like all other synchronization events on Windows, can be either anonymous or named. This means that an application can try to open an event created by another application based on shared knowledge of what that event name would be. What this all means is that despite the temptation to use futexes, in order to have the whoe range of functionality of Windows synchronization events on Linux we would have to use full-fledged IPC together with sockets and some form of poll() and select().
Fernanda said...
Windows has the CRITICAL_SECTION, which sounds like it is the same as the Linux futex. A CRITICAL_SECTION tries doing the test-and-set in user space. If that doesn't work then it can be told to spin for a while (on multi-core syetmss). If that doesn't work then it uses a kernel mutex to wait. Thus, it gives great performance, without ending up in a long spin loop.I've seen many people try to implement their own mutexes. They all handle contention poorly. Either they yield under contention with Sleep(), meaning that they stay idle too long, or they spin in a busy loop for hundreds or thousands of milliseconds, wasting power and causing priority inversions. Some of them even omit memory barriers so that they are inefficient *and* incorrect.Use the operating system primitives. Set a spin count on a CRITICAL_SECTION and use it rather than rolling your own. Please.
Ahmad said...
file <a href="http://lkewqws.com">potienr</a> is a <a href="http://lkewqws.com">potienr</a> which store the address of file n we can perform operation like read write on this file file <a href="http://lkewqws.com">potienr</a> is defined in header file stdio.h and its syntax is FILE*.

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.