Lockless Inc
Purchase »
- Linux
- Windows
- Developers
Benchmarks
Installation
Articles
Technical
Downloads
Documentation
What's New
About Us
Help

Priority Spin Locks

Often one wishes to use threads within a program with differing intrinsic priorities. For example, a thread which deals with the user interface should have a lower latency compared to a compute thread, so that the user experience is improved. When such threads interact, the parallel primitives then need to understand these priorities in order to satisfy them. If your program synchronizes using mutexes, the pthreads library facilitates this through the pthread_mutexattr_setprotocol() function using the values PTHREAD_PRIO_INHERIT or PTHREAD_PRIO_PROTECT.

In order to implement priority-understanding mutexes, the pthreads library then has to tell the kernel extra information compared to standard mutex locks. On Linux, this is done through the futex API. The reason the kernel is involved is that mutexes cause waiting threads to sleep. Thus the kernel needs to know when the priority of a sleeping thread changes in order to decide when to awaken it. (The highest priority threads should awaken first.) Similarly, it also needs to know which threads own which locks. Fortunately, the futex system call hides this complexity, and implementing these mutex types simply involves storing the thread id (tid) of the thread that has the lock within the futex location corresponding to the lock. The kernel then knows which threads are waiting on the lock via the internal wait-queue, and knows which thread has the lock via that tid.

However, mutexes are not the only thread synchronization primitive. Another very commonly used primitive is a spin-lock. These are more efficient when the time taken within a critical section is small compared to the context-switch time. For many cases within the kernel itself these are used. However, the problem with spinlocks is that most implementations do not support thread priorities. Thus most users have to use the mutex equivalent instead. This article shows how such spinlocks can be constructed purely within user-space.

High Priority Thread Test

To test thread priorities we need to construct a benchmark. A simple one looks at how long the latency is between a high priority thread wanting to take a lock and actually getting it. By contending that lock with many low-priority threads, we can investigate the performance impacts of using different algorithms. The lower the latency, the better.


#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static int done = 0;
static plock lock1;

static unsigned long long ttaken;
static unsigned long long locks_taken;

/* Compile Barrier */
#define barrier() asm volatile("": : :"memory")

/* Release CPU for other hyperthread */
#define cpu_relax() asm volatile("pause\n": : :"memory")

/* Compare Exchange */
#define cmpxchg(P, O, N) __sync_val_compare_and_swap((P), (O), (N))

/* Atomic Ops for spinlock code */
#define atomic_or(P, V) __sync_or_and_fetch((P), (V))
#define atomic_and(P, V) __sync_and_and_fetch((P), (V))

/* Read the TSC for benchmarking cycle counts */
static __inline__ unsigned long long readtsc(void)
{
	unsigned hi, lo;
	__asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
	return lo + ((unsigned long long)hi << 32);
}

#define HIGH_SPIN	1000
#define MED_SPIN	10000

static void *high_thread1(void *dummy)
{
	unsigned long long tstart;
	unsigned long long tend;
	int i;
	
	init_prio(0);
	
	while (!done)
	{
		tstart = readtsc();
		plock_lock(&lock1);
		tend = readtsc();
		if (tend > tstart)
		{
			ttaken += tend - tstart;
			locks_taken++;
		}
		for (i = 0; i < HIGH_SPIN; i++) barrier();
		plock_unlock(&lock1);
		for (i = 0; i < HIGH_SPIN; i++) barrier();
	}
	
	fini_prio();

	return NULL;
}

static void *med_thread(void *dummy)
{
	int i;
	
	init_prio(1);
	
	while (!done)
	{
		plock_lock(&lock1);
		for (i = 0; i < MED_SPIN; i++) barrier();
		plock_unlock(&lock1);
	}
	
	fini_prio();
	
	return NULL;
}

int main(void)
{
	pthread_t high, med1, med2, med3;
	
	pthread_create(&high, NULL, high_thread1, NULL);
	pthread_create(&med1, NULL, med_thread, NULL);
	pthread_create(&med2, NULL, med_thread, NULL);
	pthread_create(&med3, NULL, med_thread, NULL);
	
	sleep(5);
	
	barrier();
	done = 1;
	
	pthread_join(high, NULL);
	pthread_join(med1, NULL);
	pthread_join(med2, NULL);
	pthread_join(med3, NULL);
	
	printf("Total cycles waiting per lock: %llu\n", ttaken / locks_taken);
	
	return 0;
}

Now by defining different things to be the type plock and the functions plock_lock(), plock_unlock(), init_prio() and fini_prio(), we can then investigate how long (in cycles) it takes on average for the high priority thread to get the contended lock.

The first case we will try is a normal xchg-based spinlock. This type of lock doesn't understand priorities at all, and also suffers the problem where if a thread recently unlocked a lock it is more likely to be able to lock it again due to cache-effects. Thus live-lock where the high priority thread is out competed by the medium priority threads is likely.

Unfortunately, the bad case mentioned above does happen, and it takes 1-2 billion cycles on average for the "high priority" to take this lock. This is an eternity in computer time, and obviously such a simple spinlock shouldn't be used in this case.

By changing to using a different type of traditional spinlock, things could be better. The fastest lock we benchmarked on the locks page was the ticket lock. The ticket lock is also fair, preventing the live-lock situation, so the high priority thread shouldn't have to wait quite as long.

Running the benchmark shows that the ticket lock algorithm takes about 70 thousand cycles on average. This is many orders of magnitude better. However, we still are not exploiting the knowledge that the "high priority" thread actually has a higher priority. By switching to another lock algorithm we can try to lower latency even further.

Firstly, we will need some sort of structure to store the thread's priority information. The simplest case only requires a single integer for this. However, we will be a little clever, and store 1<<priority instead, because this makes the lock routine a little simpler. Thus the size of an unsigned long long controls the number of different priorities, with zero the highest, and 63 the lowest.


typedef struct prio prio;
struct prio
{
	unsigned long long priority;
};

The lock structure will need to know the priorities of the waiters. We will bitwise-or them together and store them into an unsigned long long. It also needs to know if the lock is taken or not. To simplify later code, we store a pointer to the prio structure for this. If the pointer is NULL, then the lock is untaken.


typedef struct plock plock;
struct plock
{
	prio *owner;
	unsigned long long waiters;
};

We can use a per-thread variable to store the priority information. To initialize, we simply store our priority into this, and to finalize we don't actually need to do anything, which is nice.


static __thread prio my_prio;

static void priority_set(int pr)
{
	/* Priorities 0 (highest) to 63 (lowest) */
	my_prio.priority = 1ULL << pr;
}

static void init_prio(int pr)
{
	priority_set(pr);
}

static void fini_prio(void)
{

}

Finally, we are left with the locking code itself. We need to check to see if there are any waiters with a higher priority than us. If so, we should spin. If not, we can try to take the lock by storing the address of our my_prio variable. The only subtlety is that we need to atomically set the right bits in the waiters bitfield. By avoiding setting our bit if we don't have to, we have two less atomic instructions in the fast path where the lock is untaken.


static void plock_lock(plock *p)
{
	unsigned long long mask = my_prio.priority - 1;
	
	while (p->waiters & mask)
	{
		cpu_relax();
	}

	if (!cmpxchg(&p->owner, NULL, &my_prio)) return;
	
	atomic_or(&p->waiters, my_prio.priority);
	
	while (1)
	{
		while (p->waiters & mask)
		{
			if (!(my_prio.priority & p->waiters))
			{
				atomic_or(&p->waiters, my_prio.priority);
			}
			
			cpu_relax();
		}

		if (!cmpxchg(&p->owner, NULL, &my_prio))
		{
			atomic_and(&p->waiters, ~my_prio.priority);
			return;
		}
		
		cpu_relax();
	}
}

static void plock_unlock(plock *p)
{
	barrier();
	p->owner = NULL;
}

The above code has an average latency of 21 thousand cycles for the high priority thread in the benchmark. This is quite a bit better than the non-priority ticket lock algorithm. Unfortunately, the above isn't a full solution. There exists a situation where the algorithm breaks down badly.

Priority Inversion Test

A priority inversion happens when a high-priority thread is waiting on a lock owned by a low-priority thread. If the low priority thread cannot make progress (by for example by being blocked by medium priority threads), then the high priority thread also cannot make progress. Since a high priority thread is indirectly blocked by the medium priority threads, this example shows a case where latency guarantees can break down.

To prevent priority inversion, the high priority thread needs to be able to boost the priority of the low priority thread whilst it owns the lock. Sometime afterwards, the excess priority of the low-priority thread needs to be removed to return it back to its low-priority state. This is called "priority inheritance" and complicates the above algorithm.

Firstly, the prio structure needs to change. We need to store the original priority of the thread, as well as its boosted priority. In addition, a pointer for a linked list is also useful.


typedef struct prio prio;
struct prio
{
	unsigned long long priority;
	unsigned long long orig_priority;
	prio *next;
};

The lock struct does not need to change. The only difference is that this time we actually need to use the pointer stored in the owner tag. By following that pointer we can access the owning thread, and possibly boost its priority if required. The problem there is that we have no guarantee that the thread still owns the lock, and still exists while we are following the pointer. Thus the prio structs need to be handled carefully. We cannot free them back to the operating system without further checks.


typedef struct plock plock;
struct plock
{
	prio *owner;
	unsigned long long waiters;
};

Since the prio structs cannot be freed on thread exit, we change the per-thread variable to use a pointer to such structs. We use a lock-protected free list to store them instead. Since the number of threads running at any one time is bounded, we only need a bounded number of such structs, so there is no memory leak. (Note that it is possible to construct an algorithm that can safely free these structs, but it is quite slow and not worth the effort.)


static __thread prio *my_prio;
static prio *prio_free = NULL;
static pthread_mutex_t prio_lock = PTHREAD_MUTEX_INITIALIZER;

/* Priorities 0 (highest) to 63 (lowest) */
static void init_prio(int pr)
{
	/* Scale priority to speed up lock code */
	pr = 1ULL << pr;
	
	pthread_mutex_lock(&prio_lock);
	
	/* Try to grab an unused structure from the free list */
	if (prio_free)
	{
		/* Remove from the free list */
		my_prio = prio_free;
		prio_free = my_prio->next;
	}
	else
	{
		/* Otherwise allocate a new one */
		my_prio = malloc(sizeof(prio));
		if (!my_prio) errx(1, "Out of memory allocating thread priority structure\n");
	}
	
	my_prio->priority = pr;
	my_prio->orig_priority = pr;
		
	pthread_mutex_unlock(&prio_lock);
}

static void fini_prio(void)
{
	pthread_mutex_lock(&prio_lock);

	/* Add to the free list */
	my_prio->next = prio_free;
	prio_free = my_prio;
	
	pthread_mutex_unlock(&prio_lock);
}

static void priority_set(int pr)
{
	/* Priorities 0 (lowest) to 63 (highest) */
	my_prio->orig_priority = 1ULL << pr;
}

The lock routine now needs to think about priority boosting when needed. It also needs to worry about its own priority changing due to another thread. This unfortunately slows things down a little bit from the non-priority inheritance version.


static void plock_lock(plock *p)
{
	int pr = my_prio->priority;
	prio *ow;
	unsigned long long mask;
	
	/* Read priority atomically */
	barrier();
	
	mask = pr - 1;

	/* Wait until higher priority tasks are done */
	while (p->waiters & mask)
	{
		/* Notice if my priority has changed */
		if (my_prio->priority != pr)
		{
			pr = my_prio->priority;
			
			barrier();
			
			/* Update the mask */
			mask = pr - 1;
		}
		
		cpu_relax();
	}

	/* Fast path, no one with higher priority waiting, and no one has the lock */
	ow = cmpxchg(&p->owner, NULL, my_prio);
	if (!ow) return;
	
	atomic_or(&p->waiters, pr);
	
	while (1)
	{
		while (p->waiters & mask)
		{
			/* Notice if my priority has changed */
			if (my_prio->priority != pr)
			{
				atomic_and(&p->waiters, ~pr);
				pr = my_prio->priority;
				atomic_or(&p->waiters, pr);

				/* Update the mask */
				mask = pr - 1;
			}

			cpu_relax();
		}
		
		ow = cmpxchg(&p->owner, NULL, my_prio);
		if (!ow)
		{
			atomic_and(&p->waiters, ~pr);
			return;
		}
		
		if (ow->priority > pr) ow->priority = pr;
		
		cpu_relax();
	}
}

static void plock_unlock(plock *p)
{
	/* Drop the lock */
	barrier();
	p->owner = NULL;
	
	/*
	 * Then revert to my original priority.
	 * If I am still blocking something of higher priority,
	 * they will boost me again.
	 */
	barrier();
	my_prio->priority = my_prio->orig_priority;
}

This code completes the benchmark with an average latency of about 22 thousand cycles. This is slightly slower than the old code.

Now we should test all of the spinlocks against the worst-case priority inversion situation. To do this, we construct another benchmark, using a low-priority thread in addition to the high and medium priorities. We also change the high-priority thread algorithm to use the same lock as the low priority thread:


#define LOW_SPIN	1000

static void *high_thread2(void *dummy)
{
	unsigned long long tstart;
	unsigned long long tend;
	int i;
	
	init_prio(0);
	
	while (!done)
	{
		tstart = readtsc();
		plock_lock(&lock2);
		tend = readtsc();
		if (tend > tstart)
		{
			ttaken += tend - tstart;
			locks_taken++;
		}
		for (i = 0; i < HIGH_SPIN; i++) barrier();
		plock_unlock(&lock2);
		for (i = 0; i < HIGH_SPIN; i++) barrier();
	}
	
	fini_prio();

	return NULL;
}


static void *low_thread(void *dummy)
{
	int i;
	
	init_prio(2);
	
	while (!done)
	{
		plock_lock(&lock2);
		plock_lock(&lock1);
		for (i = 0; i < LOW_SPIN; i++) barrier();
		
		/* The order of unlock shouldn't matter */
		plock_unlock(&lock2);
		plock_unlock(&lock1);
	}
	
	fini_prio();
	
	return NULL;
}


int main(void)
{
	pthread_t high, med1, med2, med3, low;
	
	pthread_create(&high, NULL, high_thread2, NULL);
	pthread_create(&med1, NULL, med_thread, NULL);
	pthread_create(&med2, NULL, med_thread, NULL);
	pthread_create(&low, NULL, low_thread, NULL);
	
	sleep(5);
	
	barrier();
	done = 1;
	
	pthread_join(high, NULL);
	pthread_join(med1, NULL);
	pthread_join(med2, NULL);
	pthread_join(low, NULL);
	
	printf("Total cycles waiting per lock: %llu\n", ttaken / locks_taken);
	
	return 0;
}

The xchg based spinlock takes a few hundred million cycles for the high priority thread to get the lock for this benchmark. The ticket lock is much faster, taking 45 thousand cycles on average. The first priority-aware algorithm is very slow (it is its worst case after all), taking a hundred million cycles on average. Finally, the priority inheritance spinlock takes about 70 thousand cycles.

Thus for this worst case situation, the ticket lock is the most efficient. Due to its lack of knowledge of the thread priorities combined with its intrinsic fairness it has the minimal latency.

Summary

As is shown above, it isn't too difficult to construct a priority-aware spinlock. Such a spinlock can be made to have priority inheritance at minimal runtime cost. Since we know that all threads participating in spinlock code must be running (and not sleeping in the kernel), the system calls are not required. This makes the above code operating system agnostic. (It is also possible to make it run in kernel-mode as well... just convert all the per-thread variables to being per-cpu.)

In the common situation, where priority inversion is rare, the use of priority-aware locks can improve latency over traditional spin locks. However, for complex locking schemes, the increased speed of the ticket lock may be better.

Comments

Linux

Windows

Developers

About Us

Returns Policy

Send us Feedback