Lockless Inc

The Singleton Pattern

The Singleton programming pattern is an extension of global variables. It describes an class that can only ever have one object member of that class in existence at a time. This is useful, because certain resources like the keyboard, or a error log file should only have one piece of code driving them at a time. The way to construct such a class seems obvious, and looks something like:


class singleton
{
public:
	static singleton& instance()
	{
		/* Create it if it doesn't exist. */
		if (!pinstance_)
		{
			pinstance_ = new singleton;
		}
		
		return *pinstance_;
	}
	
	/* Example body function */
	static int do_something()
	{
		return 0;
	}

private:	
	static singleton *pinstance_;
	mutex mut_;
	
	/* Cannot explicitly construct or copy singleton */
	singleton();
	singleton(const singleton&);
	singleton& operator=(const singleton&);
	virtual ~singleton()
	{
		delete pinstance_;
		pinstance_ = 0;
	}
};

/* Example usage */
int global = singleton::instance().do_something();

Note how the pointer-to-implementation programming pattern is used. This is done instead of using a class with static initialization because the order of static initializations is undefined. If two singletons have a dependency relationship in their initialization, the undefinedness poses a problem. It may be possible to order the object files such that the code will work, but that is brittle, and using a different compiler or linker may cause things to break as one singleton attempts to use the other before it has been fully constructed. The pointer-to-implementation pattern solves this problem by always making sure that the singleton::instance() function has something fully constructed to return.

The above code will work well in single threaded programs. However, it needs to be modified for multithreaded ones. The obvious thing is to add a lock, and synchronize access. i.e. something like


	static singleton& instance()
	{
		scoped_lock(mut_);
	
		/* Create it if it doesn't exist. */
		if (!pinstance_)
		{
			pinstance_ = new singleton;
		}
		
		return *pinstance_;
	}
private:
	mutex mut_;

The above works very well. One thing to note is that the destructor does not require a lock. Why? The reason is that instance() returns a reference. The client code may use such a reference at any time, and such use cannot be prevented by any lock within the singleton class itself. If the singleton object destroys itself then the reference points to a deleted object and any use of the reference, including calling any member function, is undefined. This means that when the client code decides to delete the singleton, it must synchronize itself with all other users, for example via a lock.

Why is a reference returned from instance()? If a smart pointer is used, then the internal reference count could prevent deletion until all users are done. The problem with this is the overhead. A smart pointer in a threaded context requires the use of atomic instructions to update the reference count. Atomic instructions are slow, so we want to minimize their use whenever possible.

Finally, we note that normally, the lifetime of a singleton is such that it will last until program exit. During program exit, the last remaining thread will call the destructor. If so, then no other threads are executing, and thus no locking is required. So for both performance and pragmatic reasons, locking in the destructor is ignored here.

Locking is slow, and every call to instance() must take the lock in the code above. Can performance be improved by not doing so in some cases? The result of doing this is the "double-checked locking" pattern.


	static singleton& instance()
	{
		if (!pinstance_)
		{
			scoped_lock(mut_);
	
			/* Create it if it doesn't exist. */
			if (!pinstance_)
			{
				pinstance_ = new singleton;
			}
		}
		
		return *pinstance_;
	}

Unfortunately, the double-checked locking technique is broken by design. To see this well known fact, we need to move to a lower level. The problem lies with the pinstance_ = new singleton; statement, and due to the fact that new has two tasks to do. The first is to allocate memory, the second is to construct the object. There is no guarantee at which point the assignment to pinstance_ will happen. Thus the code may allocate the memory, and let pinstance_ point to it, and then finally construct the object after that is done. The problem is that the first check of pinstance_ will then be able to succeed even though the singleton object isn't fully constructed yet.

The only way to fix this problem portably is to move back to the original method of locking on every access. The obvious trick of doing

singleton *ptemp = new_singleton;
pinstance_ = ptemp;

does not work. This is because the compiler optimizer may notice that ptemp is never used after the assignment, and then eliminate it, resulting in code that is identical to what we had before. There are other tricks of course, one might try using the volatile specifier, or putting fragments of code in different compilation objects. However, for a sufficiently advanced compiler, these also fail due to the optimizer noticing that they may be transformed back into the original buggy code.

To truly fix this, we need to use memory barriers. We need to make sure that the singleton is fully constructed before we return a reference to it. This can be done via a write barrier, (which is of course non-portable):

singleton *ptemp = new_singleton;
write_barrier();
pinstance_ = ptemp;

Assuming we are on a x86 variant, which assembly instruction does the write_barrier() correspond to? The one that sticks out is the sfence instruction, which makes sure that all stores have been committed to memory before continuing. However, this isn't the whole story. There are two types of store in the x86 architecture. Normal memory writes, and non-temporal writes to uncached memory. Only the second of these is actually affected by the sfence instruction, and non-temporal writes are extremely rare. Since a user of such non-temporal writes will be following them with a memory fence anyway, it seems that no instruction is actually required.

On the x86 architecture, the write barrier turns into a compile barrier. With gcc, such a barrier may be constructed via an asm instruction which gcc is told can touch any memory:


static inline barrier()
{
	asm volatile("": : :"memory");
}

This isn't the final story, as there is one final optimization to make. Notice how during the lifetime of the singleton, the lock should only be taken once. It is only there to prevent two threads from creating two instances at the same time. Since we are non-portable anyway through the use of a barrier, can we replace the lock and lower memory usage? The answer is yes, it is possible to overlap the lock with the instance pointer, and use raw atomic instructions to implement mutual exclusion. The result looks like


	static singleton& instance()
	{
		while (1)
		{
			/* Already allocated? */
			if ((uintptr_t) pinstance_ > 1) break;
		
			/* Try to get "lock" if unallocated, and not locked */
			if (!pinstance_)
			{
				/* Try to change pinstance_ from 0->1 */
				if (__sync_bool_compare_and_swap(&pinstance_, 0, (singleton *) 1))
				{
					/* Success, create it */
					singleton *ptemp = new singleton;
					
					/* Make sure it is created */
					asm volatile("": : :"memory");
					
					/* Save value, and "unlock" */
					pinstance_ = ptemp;
					break;
				}
			}
			
			/* Don't burn too much cpu on hyperthreaded machines */
			asm volatile("rep; nop\n": : :"memory");
		}
		
		return *pinstance_;
	}

This code is equivalent to using a fast spin-lock instead of a more heavy-weight mutex. Notice how it uses the rep; nop to prevent excess resource usage on hyperthreaded cpus. That instruction also falsely tells the compiler it touches memory. This prevents gcc from caching the value of pinstance_ and thus creating a non-breakable infinite loop. Note that in real code, the asm instructions should probably be abstracted away to inline functions or macros for readability.

Comments

J. Cenpedes said...
You can avoid the lock altogether in multithreaded environments by simply ensuring the construction takes place at static runtime initialization (which is by definition single-threaded):

namespace {
class CreateSingleton
{
public:
CreateSingleton() { singleton::instance(); }
};
CreateSingleton create_singleton;
}

You still get the advantage that the instance is created on first use, but you're also guaranteed it's been created before concurrency becomes a concern.

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.