Lockless Inc

Barriers

Barriers are a fundamental synchronization primitive that allow you to tell when a set of threads have completed a task. Once all threads reach a barrier, and not before, will they be allowed to pass through it. Due to this, they are a rather heavy-weight operation. However, many parallel algorithms can be simplified greatly by being broken into phases of computation synchronized by barriers. This ease of use means that they are a nice option to keep in your virtual toolkit of parallel idioms.

The pthreads library supports barriers via the pthread_barrier_t data type, and the pthread_barrier_init, pthread_barrier_destroy, and pthread_barrier_wait functions. To use a pthreads barrier, you first initialize it with the number of threads that are required to be synchronized. (Together with extra optional attributes like whether the barrier is required to be inter-process, rather than intra-process.) Then threads needing to be synchronized call pthread_barrier_wait(). Finally, pthread_barrier_destroy should be called when the barrier primitive is no longer needed.

Destruction of a barrier primitive is a little bit subtle. Most other synchronization primitives can have a destruction operation that is a no-op. They don't allocate memory, and the futex system call that they use for waiting doesn't need explicit deallocation. Barriers also don't have to allocate memory, and also use futexes (assuming a Linux-based implementation). However, they do have some work to do.

The problem is that a destroy call can occur whilst other threads are in the process of leaving the barrier. These threads need to check the state of the barrier due to the fact that the futex system call can have "false wakeups". If the memory used by the barrier is freed out from underneath them, then undefined behaviour will occur. For example, if the memory is unmapped by the munmap() system call, then a segfault is inevitable.

Thus the destroying thread needs to wait until all other threads have finished exiting the barrier. In order for that to happen, extra synchronization is required. This subtle issue was ignored in the winpthreads library. Fortunately, it is easy to fix by reusing the condition variable and mutex in the destroy function. That article has been updated with the corrected version.

The winpthreads library shows how we can implement our own version of a barrier from simpler mutex and condition variable primitives. Translating back from Microsoft Windows functions to pthreads gives:


typedef struct barrier_t barrier_t;
struct barrier_t
{
	unsigned count;
	unsigned total;
	pthread_mutex_t m;
	pthread_cond_t cv;
};

#define BARRIER_FLAG (1UL<<31)
void barrier_destroy(barrier_t *b)
{
	pthread_mutex_lock(&b->m);

	while (b->total > BARRIER_FLAG)
	{
		/* Wait until everyone exits the barrier */
		pthread_cond_wait(&b->cv, &b->m);
	}

	pthread_mutex_unlock(&b->m);

	pthread_cond_destroy(&b->cv);
	pthread_mutex_destroy(&b->m);
}

void barrier_init(barrier_t *b, unsigned count)
{
	pthread_mutex_init(&b->m, NULL);
	pthread_cond_init(&b->cv, NULL);
	b->count = count;
	b->total = BARRIER_FLAG;
}

int barrier_wait(barrier_t *b)
{
	pthread_mutex_lock(&b->m);

	while (b->total > BARRIER_FLAG)
	{
		/* Wait until everyone exits the barrier */
		pthread_cond_wait(&b->cv, &b->m);
	}

	/* Are we the first to enter? */
	if (b->total == BARRIER_FLAG) b->total = 0;

	b->total++;

	if (b->total == b->count)
	{
		b->total += BARRIER_FLAG - 1;
		pthread_cond_broadcast(&b->cv);

		pthread_mutex_unlock(&b->m);

		return PTHREAD_BARRIER_SERIAL_THREAD;
	}
	else
	{
		while (b->total < BARRIER_FLAG)
		{
			/* Wait until enough threads enter the barrier */
			pthread_cond_wait(&b->cv, &b->m);
		}

		b->total--;

		/* Get entering threads to wake up */
		if (b->total == BARRIER_FLAG) pthread_cond_broadcast(&b->cv);

		pthread_mutex_unlock(&b->m);

		return 0;
	}
}

We can benchmark the above by calling the barrier wait function many times repeatedly in a loop. By changing the number of threads to synchronize, we can investigate the scalability of the algorithm. Such a benchmark might look like:


/* Number of threads to use */
#define N 2

/* Number of times to call barrier primitive */
#define COUNT 1000000

static pthread_barrier_t bench_b;

static void *tbench(void *arg)
{
	int i;

	(void) arg;

	for (i = 0; i < COUNT; i++)
	{
		pthread_barrier_wait(&bench_b);
	}

	return NULL;
}

static void bench(void)
{
	pthread_t th[N - 1];
	int i;

	pthread_barrier_init(&bench_b, NULL, N);

	for (i = 0; i < N - 1; i++)
	{
		if (pthread_create(&th[i], NULL, tbench, NULL))
		{
			errx(1, "pthread_create failed");
		}
	}

	tbench(NULL);

	pthread_barrier_destroy(&bench_b);

	for (i = 0; i < N - 1; i++)
	{
		if (pthread_join(th[i], NULL)) errx(1, "pthread_join failed");
	}

	puts("bench OK");
}

We can then switch between various barrier implementations with a few macros. (The default in the above is to use the pthreads library barriers.) i.e.


#ifdef USE_MUTEX_CV_BARRIER
#define pthread_barrier_t barrier_t
#define pthread_barrier_init(B, A, N) (barrier_init((B), (N)), 0)
#define pthread_barrier_destroy(B) (barrier_destroy(B), 0)
#define pthread_barrier_wait barrier_wait
#endif

We decide to use runs with two threads, four threads, and twenty threads on this quad-core machine. The two-thread case investigates a best-case situation. The four-thread case has a small amount of interference from other processes running on the system, so we should expect it to run a little slow. The twenty-thread case determines if there are any pathological issues with running large numbers of threads through the barrier.

The pthreads library takes

Threads2420
Time (s)6.8-8.017.9-20.5102-112

The mutex/cv implementation takes

Threads2420
Time (s)7.2-7.819.6-22.5126-158

The first thing to notice in the above is the large variability in the results. (We tested five times for each combination, and recorded the range of times seen.) There seems to be about a 10% intrinsic variation, which can make it difficult to see which of two algorithms is faster. For this particular pair, it seems that only at large numbers of threads is the pthreads library significantly faster than the open-coded version.

This shouldn't be too surprising as the pthreads library actually uses a similar algorithm. It just uses a faster low-level lock than a pthreads mutex. It also is implemented in assembly for speed.

The real question now is can we do better? The first trick is to realize that the current barrier implementation is suffering from quite a large inefficiency. No threads can enter a barrier, whilst others are leaving it. They need to wait until all other threads have left, and the count reaches zero before they can start. If we can simultaneously allow new threads to start waiting on the barrier, whilst old threads are woken up and begin to leave it, we should be able to improve the times.

The first thing to notice is that we will have to use something more low-level than mutexes and condition variables. We will also have to use the futex system call directly. This calls for the usual set of atomic primitives:


#include <pthread.h>
#include <linux/futex.h>
#include <stdint.h>
#include <limits.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/time.h>

/* Compile Barrier */
#define barrier() asm volatile("": : :"memory")

/* Atomic add, returning the new value after the addition */
#define atomic_add(P, V) __sync_add_and_fetch((P), (V))

/* Atomic add, returning the value before the addition */
#define atomic_xadd(P, V) __sync_fetch_and_add((P), (V))

/* Atomic or */
#define atomic_or(P, V) __sync_or_and_fetch((P), (V))

/* Force a read of the variable */
#define atomic_read(V) (*(volatile typeof(V) *)&(V))

/* Atomic 32 bit exchange */
static inline unsigned xchg_32(void *ptr, unsigned x)
{
	__asm__ __volatile__("xchgl %0,%1"
				:"=r" ((unsigned) x)
				:"m" (*(volatile unsigned *)ptr), "0" (x)
				:"memory");

	return x;
}

/* Atomic 64 bit exchange */
static inline unsigned long long xchg_64(void *ptr, unsigned long long x)
{
	__asm__ __volatile__("xchgq %0,%1"
				:"=r" ((unsigned long long) x)
				:"m" (*(volatile unsigned long long *)ptr), "0" (x)
				:"memory");

	return x;
}


static long sys_futex(void *addr1, int op, int val1, struct timespec *timeout, void *addr2, int val3)
{
	return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3);
}

The Pool Barrier

Now we can use the above to construct a "pool barrier". It works by filling a pool of waiting threads. Once the pool is full, a sequence number is incremented. The waiting threads can see the sequence number has changed, and they then can exit the barrier. New threads will use the new sequence number. If too many threads try to enter the barrier at the same time, some will get a count number greater than the size of the pool. If so, they also need to wait until the sequence number changes in order to try again.

The key part of the algorithm looks like:


int pool_barrier_wait(pool_barrier_t *b)
{
	int ret;

	while (1)
	{
		unsigned seq = atomic_read(b->seq);
		unsigned count = atomic_xadd(&b->count, 1);

		if (count < b->total)
		{
			/* Can we proceed? */
			while (atomic_read(b->seq) == seq)
			{
				/* Sleep on it instead */
				sys_futex(&b->seq, FUTEX_WAIT_PRIVATE, seq, NULL, NULL, 0);
			}

			ret = 0;
			break;
		}

		if (count == b->total)
		{
			/* Simultaneously clear count, and increment sequence number */
			barrier();
			b->reset = b->seq + 1;
			barrier();

			/* Wake up sleeping threads */
			sys_futex(&b->seq, FUTEX_WAKE_PRIVATE, INT_MAX, NULL, NULL, 0);

			ret = PTHREAD_BARRIER_SERIAL_THREAD;
			break;
		}

		/* We were too slow... wait for the barrier to be released */
		sys_futex(&b->seq, FUTEX_WAIT_PRIVATE, seq, NULL, NULL, 0);
	}

	return ret;
}

However, there is one important thing missing. We need to be able to safely destroy the barrier. Thus we need some way of knowing how many threads are currently accessing it. The simplest way of doing that is to use a reference count. A thread entering the barrier increments the count. A thread exiting decrements it. If the count starts at one, and the destroying thread decrements and waits until the count reaches zero, we have a nice solution. Such a barrier looks like:


typedef struct pool_barrier_t pool_barrier_t;
struct pool_barrier_t
{
	union
	{
		struct
		{
			unsigned seq;
			unsigned count;
		};
		unsigned long long reset;
	};
	unsigned refcount;
	unsigned total;
};

void pool_barrier_init(pool_barrier_t *b, unsigned count)
{
	b->seq = 0;
	b->count = 0;
	b->refcount = 1;
	/* Total of waiting threads */
	b->total = count - 1;
}

void pool_barrier_destroy(pool_barrier_t *b)
{
	/* Trigger futex wake */
	atomic_add(&b->refcount, -1);

	/* Wait until all references to the barrier are dropped */
	while (1)
	{
		unsigned ref = atomic_read(b->refcount);

		if (!ref) return;

		sys_futex(&b->refcount, FUTEX_WAIT_PRIVATE, ref, NULL, NULL, 0);
	}
}

int pool_barrier_wait(pool_barrier_t *b)
{
	int ret;

	atomic_add(&b->refcount, 1);

	while (1)
	{
		unsigned seq = atomic_read(b->seq);
		unsigned count = atomic_xadd(&b->count, 1);

		if (count < b->total)
		{
			/* Can we proceed? */
			while (atomic_read(b->seq) == seq)
			{
				/* Sleep on it instead */
				sys_futex(&b->seq, FUTEX_WAIT_PRIVATE, seq, NULL, NULL, 0);
			}

			ret = 0;
			break;
		}

		if (count == b->total)
		{
			/* Simultaneously clear count, and increment sequence number */
			barrier();
			b->reset = b->seq + 1;
			barrier();

			/* Wake up sleeping threads */
			sys_futex(&b->seq, FUTEX_WAKE_PRIVATE, INT_MAX, NULL, NULL, 0);

			ret = PTHREAD_BARRIER_SERIAL_THREAD;
			break;
		}

		/* We were too slow... wait for the barrier to be released */
		sys_futex(&b->seq, FUTEX_WAIT_PRIVATE, seq, NULL, NULL, 0);
	}

	/* Are we the last to wake up? */
	if (atomic_xadd(&b->refcount, -1) == 1)
	{
		/* Wake destroying thread */
		sys_futex(&b->refcount, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
	}
	return ret;
}

The pool barrier takes:

Threads2420
Time (s)3.2-3.78.7-11.052-61

The result is a barrier that is twice as fast as the pthreads barrier! The extra speed comes from the new overlap in entry and exit.

Ticket Barrier

Can we do even better? The above pool barrier uses two different count variables to control access. The first counts how many threads have tried to enter the pool. The second is the sequence number, that counts which pool we are currently using. Perhaps we can try to combine them in some way?

One way to try to simplify the code is to follow the example of the "ticket lock". That fast locking algorithm has threads take a ticket, and wait until it is that ticket's turn to continue. If we allow a bunch of tickets to continue at the same time, then we will have something equivalent to a barrier.

Unfortunately, we still need some way to know when all threads have left the barrier. This means we need something like the refcount in the pool barrier algorithm. Since each thread must take a ticket on entry, we can use that as the initial increment. If they also take a ticket on exit, we can see when those counters match. If they do, then all threads have left. Such an algorithm looks like:


typedef struct ticket_barrier_t ticket_barrier_t;
struct ticket_barrier_t
{
	unsigned count_next;
	unsigned total;
	union
	{
		struct
		{
			unsigned count_in;
			unsigned count_out;
		};
		unsigned long long reset;
	};
};

/* Range from count_next to count_next - (1U << 31) is allowed to continue */


void ticket_barrier_init(ticket_barrier_t *b, unsigned count)
{
	b->count_in = 0;
	b->count_out = 0;
	b->count_next = -1;
	b->total = count;
}

void ticket_barrier_destroy(ticket_barrier_t *b)
{
	/* Alter the refcount so we trigger futex wake */
	atomic_add(&b->count_in, -1);

	/* However, threads can be leaving... so wait for them */
	while (1)
	{
		unsigned count_out = atomic_read(b->count_out);

		/* Make sure everything is synchronized */
		barrier();

		if (count_out == atomic_read(b->count_in) + 1) return;

		sys_futex(&b->count_out, FUTEX_WAIT_PRIVATE, count_out, NULL, NULL, 0);
	}
}

int ticket_barrier_wait(ticket_barrier_t *b)
{
	unsigned wait = atomic_xadd(&b->count_in, 1);
	unsigned next;
	unsigned long long temp;

	int ret;

	while (1)
	{
		next = atomic_read(b->count_next);

		/* Have the required number of threads entered? */
		if (wait - next == b->total)
		{
			/* Move to next bunch */
			b->count_next += b->total;

			/* Wake up waiters */
			sys_futex(&b->count_next, FUTEX_WAKE_PRIVATE, INT_MAX, NULL, NULL, 0);

			ret = PTHREAD_BARRIER_SERIAL_THREAD;

			break;
		}

		/* Are we allowed to go? */
		if (wait - next >= (1UL << 31))
		{
			ret = 0;
			break;
		}

		/* Go to sleep until our bunch comes up */
		sys_futex(&b->count_next, FUTEX_WAIT_PRIVATE, next, NULL, NULL, 0);
	}

	/* Add to count_out, simultaneously reading count_in */
	temp = atomic_xadd(&b->reset, 1ULL << 32);

	/* Does count_out == count_in? */
	if ((temp >> 32) == (unsigned) temp)
	{
		/* Notify destroyer */
		sys_futex(&b->count_out, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
	}

	return ret;
}

The ticket lock has less locked instructions than the pool algorithm. (Two instead of four.) This means that it should be faster.

Threads2420
Time (s)3.0-3.77.6-9.451-62

However, as the above table shows, that isn't the case! It seems the overhead of having to synchronize with other threads via the futex system call is much greater than the time spent in the atomic locked instructions. At least on this machine, there is very little difference in timings between the two algorithms. The intrinsic 10% variability explains all the variation.

The ticket barrier has one weakness though. It requires that no more than 231 threads can cumulatively take the barrier before all waiting threads leave it. If more do, then the count will overflow, and a deadlock may occur. The pool barrier has a similar weakness... but there the limit is 232-total threads per sequence number, and with 232 different sequence numbers before a problem occurs. This is a much larger number, and much less likely to occur. If only futexes could wait on 64bit numbers rather than 32bit numbers, and this problem wouldn't be so much of an issue. As it is though, for safety, the ticket barrier shouldn't really be used as its advantages are small.

Further Improvements

The above algorithms are twice as fast as the standard pthreads barriers. However, they may be able to be improved further. The trick is to notice that we always call the kernel via a futex no matter what happens. If few threads are synchronizing, then it may be profitable to spin a bit waiting instead. Such an extension to the pool barrier looks like:


int pool_barrier_wait2(pool_barrier_t *b)
{
	int ret;

	atomic_add(&b->refcount, 1);

	while (1)
	{
		unsigned seq = atomic_read(b->seq);
		unsigned count = atomic_xadd(&b->count, 1);

		if (count < b->total)
		{
			int i;
			seq |= 1;

			/* Spin a bit */
			for (i = 0; i < 1000; i++)
			{
				if ((atomic_read(b->seq) | 1) != seq) break;
			}

			/* Can we proceed? */
			while ((atomic_read(b->seq) | 1) == seq)
			{
				/* Hack - set a flag that says we are sleeping */
				*(volatile char *) &b->seq = 1;

				/* Sleep on it instead */
				sys_futex(&b->seq, FUTEX_WAIT_PRIVATE, seq, NULL, NULL, 0);
			}

			ret = 0;
			break;
		}

		if (count == b->total)
		{
			/* Simultaneously clear count, increment sequence number, and clear wait flag */
			seq = atomic_read(b->seq);

			if (xchg_64(&b->reset, (seq | 1) + 255) & 1)
			{
				/* Wake up sleeping threads */
				sys_futex(&b->seq, FUTEX_WAKE_PRIVATE, INT_MAX, NULL, NULL, 0);
			}

			ret = PTHREAD_BARRIER_SERIAL_THREAD;
			break;
		}

		/* We could spin a bit here as well... but reaching here is rare */

		seq |= 1;
		/* Hack - set a flag that says we are sleeping */
		*(volatile char *) &b->seq = 1;

		/* We were too slow... wait for the barrier to be released */
		sys_futex(&b->seq, FUTEX_WAIT_PRIVATE, seq, NULL, NULL, 0);
	}

	/* Are we the last to wake up? */
	if (atomic_xadd(&b->refcount, -1) == 1)
	{
		/* Wake destroying thread */
		sys_futex(&b->refcount, FUTEX_WAKE_PRIVATE, 1, NULL, NULL, 0);
	}
	return ret;
}

In the above, the wait function spins 1000 times before sleeping. Why 1000? If we spin too little, then we get no advantage at all. If we spin too much, then the case with more threads than cpus becomes much much slower. The result for this number is:

Threads2420
Time (s)0.40-0.58~11.5~90

And so we have a mixed blessing. When we have two threads, the result is ten times faster than before. When we have four or more threads, the result is slower than before. In the case of twenty threads, much slower. Unfortunately, picking a spin number smaller than 1000 rapidly makes the 10× improvement disappear. The slow-down doesn't go away so rapidly. There appears to be no good number that gets a good improvement without a disadvantage somewhere else. (The spin version of the ticket barrier gives similar numbers.)

There is a way out though. If we count how many cpus we have at initialization time, we can change the spin number if we have more or less threads synchronizing at the barrier. We can also support the inter-process barrier feature of pthreads with little overhead. The result is a uniformly fast barrier that is a drop-in replacement for the standard one:


typedef struct fast_barrier_t fast_barrier_t;
struct fast_barrier_t
{
	union
	{
		struct
		{
			unsigned seq;
			unsigned count;
		};
		unsigned long long reset;
	};
	unsigned refcount;
	unsigned total;
	int spins;
	unsigned flags;
};

void fast_barrier_init(fast_barrier_t *b, pthread_barrierattr_t *a, unsigned count)
{
	b->seq = 0;
	b->count = 0;
	b->refcount = 1;
	/* Total of waiting threads */
	b->total = count - 1;

	if (count < sysconf(_SC_NPROCESSORS_ONLN))
	{
		b->spins = 1000;
	}
	else
	{
		b->spins = 1;
	}

	/* Default to process private */
	b->flags = FUTEX_PRIVATE_FLAG;
	if (a)
	{
		/* Check for a shared barrier */
		int shared = 0;
		pthread_barrierattr_getpshared(a, &shared);
		if (shared) b->flags = 0;
	}
}

void fast_barrier_destroy(fast_barrier_t *b)
{
	/* Trigger futex wake */
	atomic_add(&b->refcount, -1);

	/* Wait until all references to the barrier are dropped */
	while (1)
	{
		unsigned ref = atomic_read(b->refcount);

		if (!ref) return;

		sys_futex(&b->refcount, FUTEX_WAIT | b->flags, ref, NULL, NULL, 0);
	}
}

int fast_barrier_wait(fast_barrier_t *b)
{
	int ret;

	atomic_add(&b->refcount, 1);

	while (1)
	{
		unsigned seq = atomic_read(b->seq);
		unsigned count = atomic_xadd(&b->count, 1);

		if (count < b->total)
		{
			int i;
			seq |= 1;

			for (i = 0; i < b->spins; i++)
			{
				if ((atomic_read(b->seq) | 1) != seq) break;
			}

			/* Can we proceed? */
			while ((atomic_read(b->seq) | 1) == seq)
			{
				/* Hack - set a flag that says we are sleeping */
				*(volatile char *) &b->seq = 1;

				/* Sleep on it instead */
				sys_futex(&b->seq, FUTEX_WAIT | b->flags, seq, NULL, NULL, 0);
			}

			ret = 0;
			break;
		}

		if (count == b->total)
		{
			/* Simultaneously clear count, increment sequence number, and clear wait flag */
			seq = atomic_read(b->seq);

			if (xchg_64(&b->reset, (seq | 1) + 255) & 1)
			{
				/* Wake up sleeping threads */
				sys_futex(&b->seq, FUTEX_WAKE | b->flags, INT_MAX, NULL, NULL, 0);
			}

			ret = PTHREAD_BARRIER_SERIAL_THREAD;
			break;
		}

		seq |= 1;
		/* Hack - set a flag that says we are sleeping */
		*(volatile char *) &b->seq = 1;

		/* We were too slow... wait for the barrier to be released */
		sys_futex(&b->seq, FUTEX_WAIT | b->flags, seq, NULL, NULL, 0);
	}

	/* Are we the last to wake up? */
	if (atomic_xadd(&b->refcount, -1) == 1)
	{
		/* Wake destroying thread */
		sys_futex(&b->refcount, FUTEX_WAKE | b->flags, 1, NULL, NULL, 0);
	}
	return ret;
}

This barrier gets the timings:

Threads2420
Time (s)0.38-0.599.7-10.552-57

It is twenty times as fast for small numbers of threads, and twice as fast for large numbers of threads, compared to the pthreads barrier. However, we have one more trick up our sleeves. The pthreads library implements its barrier in assembly language for speed. We can do the same. The resulting .S file contains:


#define FUTEX_WAIT      0
#define FUTEX_WAKE		1

#define SYS_futex 202

#define PTHREAD_BARRIER_SERIAL_THREAD	-1

#define offset_refcount 0x8
#define offset_total    0xc
#define offset_spins    0x10
#define offset_flags    0x14

.globl fast_barrier_wait_asm
.type   fast_barrier_wait_asm,@function
.align 16
fast_barrier_wait_asm:
	lock incl offset_refcount(%rdi)
	
1:	movabs $0x100000000, %rdx
	lock xadd %rdx, (%rdi)  # Simultaneously increment count + read sequence number
	mov %rdx, %r8
	or  $1, %edx
	shr $32, %r8
	
	sub offset_total(%rdi), %r8d  # compare + save result
	jne 4f

	add $255, %edx  # Also clears top 32 bits of %rdx
	xchg (%rdi), %rdx
	test $1, %dl
	jne  3f  # Any threads to wake up?

2:	mov $PTHREAD_BARRIER_SERIAL_THREAD, %eax
	lock decl offset_refcount(%rdi)
	je  8f # Do we need to wake the destroying thread?
	rep; retq
	
3:	mov $SYS_futex, %eax
	mov offset_flags(%rdi), %esi
	mov $0x7fffffff, %edx
	add $FUTEX_WAKE, %esi
	syscall  # Wake threads sleeping on the barrier
	jmp 2b
	
4:	mov offset_spins(%rdi), %eax
5:	cmp (%rdi), %edx  # Spin, waiting for the sequence number to change
	js  7f
	pause
	add $-1, %eax
	jne 5b
  
	xor %r10, %r10
	mov offset_flags(%rdi), %esi

6:	movb $1, (%rdi)
	mov $SYS_futex, %eax
	syscall  # Sleep, waiting for the sequence number to change
	cmp (%rdi), %edx
	je  6b
	
7:	test %r8d, %r8d  # Did we pass the barrier?
	jns  1b
	xor %eax, %eax
	lock decl offset_refcount(%rdi)
	je  8f # Do we need to wake the destroying thread?
	rep; retq

8:	mov %eax, %r8d  # Save return value
	mov offset_flags(%rdi), %esi
	mov $SYS_futex, %eax
	add $offset_refcount, %rdi
	mov $1, %edx
	add $FUTEX_WAKE, %esi
	syscall  # Wake up the destroying thread
	mov %r8d, %eax  # Restore return value
	retq

.size fast_barrier_wait_asm, .-fast_barrier_wait_asm

Unfortunately, there is little difference in speed between the C and asm versions of the barrier. (At least within the 10% variation seen here.) However, using the above doesn't hurt, and may even help on other machines.

Summary

Barriers are a useful synchronization primitive. However, they are relatively slow and heavy-weight in the glibc pthreads library. By changing to a different algorithm, we can speed things up by a factor ranging from 2-20.

Comments

Borislav Trifonov said...
Have you looked at phasers? It's a recent generalization of barriers that looks quite interesting; see for example "Hierarchical Phasers for Scalable Synchronization and Reductions in Dynamic Parallelism".
Bryan LaPorte said...
Definitely a better methodology, although I would put the spin loop/sleep in the straight-line path as opposed to after a branch, since it's statistically the most likely (branch prediction optimization here is a bit of a formality, I realize). It's worth pointing out that the spin loop optimization (in a loop as it is here) will be highly sensitive to the lengths of varying payloads. It might be worthwhile to use an in-out parameter to keep a per-thread spin count which can be tuned (obviously maintaining the non-spinning over-threading case): say, doubled after each thread sleep up to a maximum count.
sfuerst said...
There is now an article on Phasers. You can use the same tricks to speed those up as well. :-)

Basically, you want to spin-loop time to be roughly equal to a syscall/context switch time. Any longer, and you are wasting cpu. Any shorter, and timing jitter will get you. You could make it adaptive... but having a good default is probably better.
CT Chou said...
Suppose there are N threads in total and a pool barrier is initialized with count=N. Is it true that in pool_barrier_wait, it is impossible for any thread to reach the sys_futex call preceded by the comment: "We were too slow... wait for the barrier to be released"? So, in this case, is the while loop in pool_barrier_wait necessary?

Thanks in advance!
jessej said...
I got the code to work for 64 bit, however I could not get to work with 32 bit. Any suggestions?

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.