Double Dereference

Parallel programming depends on the atomic primitives available on the processor the parallel program is written on. Some form of synchronization between threads of execution is required so that they do not interfere with each other when trying to update shared state. However, different primitives have differing power. Simple atomic loads and stores are less powerful than an atomic compare-exchange instruction, for example.

By providing a wide range of atomic instructions, the x86 platform is well suited to implementing many parallel algorithms. Note that any algorithm may be implemented via a set of locks that prevent concurrent execution within critical sections. So technically, if we could just construct locking primitives from the atomic instructions, then we are okay. However, we would like to construct algorithms that execute faster than those that are barely possible.

By allowing multiple threads to execute simultaneously, and concurrently alter the same data structure we can improve performance. The trick here is to use lock-free and wait-free algorithms. The problem is that this is easier said than done. It can be extremely difficult to construct a new lock-free algorithm using the atomic primitives available. Thus adding new ones can open up new possibilities.

An alternate method of improving parallel algorithms is to use transactional memory. The is the approach taken by AMD's proposed Advanced Synchronization Facility. In such a scheme, one tentatively makes changes to a data structure, and then atomically "commits" those changes if no other thread has interfered in the meantime. The advantage of such a scheme is that transactions can be nested, allowing parallel algorithms to be composed of smaller simpler algorithms. In general, lock-based algorithms do not compose simply due to lock-order issues causing the possibility of deadlock.

Unfortunately, transactional memory isn't a perfect solution for efficient parallel algorithms. The problem it has is that progress isn't guaranteed. Threads may "live-lock" by contending for the same resources and interfering with each others updates. Since a change may only be committed if there is no interference, no real work is done as the changes are repeatedly rolled back. If transactions are composed, then the probability of failure increases as more memory locations need to be touched. Thus real-world transactional algorithms require hacks like exponential back-off in order to work.

Adding a transactional memory subsystem to the machine architecture is a difficult task. (Which is probably why the AMD proposal is still just a proposal, and not implemented in any machine you can buy today.) However, there are simpler atomic primitives that could be added which would also simplify the task of constructing lock-free algorithms.

The most common of these which hasn't been added to the x86 instruction set yet is the double-word compare and swap instruction (DCAS). This instruction compares two (not necessarily adjacent) words in memory with values in registers, and then if both equal, swaps them with two other values. (i.e. an instruction that does twice as much work as the already existing cmpxchg instruction.

Compare-exchange instructions are powerful, and most lock-free algorithms use them for that reason. However, they have a subtle downside: the comparison can falsely succeed. This happens when the memory location compared against has the right value, but wrong syntactic meaning within the data structure. Another name for this is the ABA problem. The memory location originally had the value A. Some other thread then modified that value to B. If the compare exchange instruction saw this, then the swap wouldn't occur. However, the state of the memory location is changed back to A before the compare happens. This new value of A doesn't necessarily have to correspond to the same meaning as the old value. If it doesn't, then there is a bug when the compare succeeds, and bad things can happen.

A concrete example of the ABA problem occurs within lock-free algorithms trying to implement lists or stacks. If concurrent threads can add and remove nodes, then the topology of the data structure is altered by the ABA situation. If it isn't handled correctly, then nodes may be incorrectly disconnected. One way around this is to include a sequence number with every pointer that suffers the ABA problem. If the sequence number is monotically updated for every add or remove, then the new "A" value will no longer match, and the comparison will correctly fail.

Thus double-width (as opposed to double-word) compare and swaps are often needed. Unfortunately, the first machines on the X86_64 platform only have 64bit compare and swaps, whereas 128bit ones are required. Fortunately, a few spare bits are available within pointers since the address space is 48 bits in size, rather than the full 64 bits. This leaves 16 bits for use as a counter on these particular machines. Unfortunately, this may not be enough, as there is a 1 in 64 thousand chance of failure due to chance coincidence of the counter values.

What other (currently unimplemented) atomic primitives could be useful? One set which doesn't appear to have been discussed much are those based on "double dereference" Basically, a large part of the problem with constructing a new lock-free algorithm is that the simple act of following a pointer is extremely hard to get correct. The reason is that another thread may alter the value of that pointer after you have read it, but before you can notify anyone else:

Thus the simple act of freeing memory from a lock-free data structure becomes quite difficult to do. This is why using garbage collection makes such algorithms so much easier to develop. Unfortunately, the extra communication that the garbage collection imposes acts as overhead that removes some of the performance gain obtained. We would thus like some way to do the above without any races.

This is where double-dereference helps. It allows a reference count to exist within each free-able chunk of memory by removing the pointer-following race. Firstly, defining a double-deference addressing mode:

Thus after the operation, reg1 will contain the value of the pointer, and reg2 is used to perform the operation op. By checking if the pointer was NULL, we can determine whether or not the operation was performed. Since the above is atomic, it doesn't matter if some other thread is simultaneously altering the memory location pointed to by reg1. The instruction will use either the value before or after the modification, and atomically perform the operation requested.

Using the above we can construct a reference-counting scheme where the reference count is within the object being tracked. The resulting primitives would look like:

By increasing the refcount every time a pointer is created that points to this object, and decreasing it every time a pointer is removed, we can safely delete the object when no more pointers remain. This obviously doesn't avoid the circularly-linked garbage problem inherent to refcounting schemes... but allows wait-free refcounting whereas the typical method requires some form of locking or non-scalable hazard pointers.

The fact that the double-dereference operations allow us to safely follow pointers allows very fast read and write locks to be written. In effect, we can use the pointer itself as a lock. If the pointer is NULL, then no one can follow it, and the data is exclusively owned by a writer. If the pointer is non-NULL, then multiple readers can read through it simultaneously.

The reader can access a counter that is incremented within the data structure with every write. If that counter is not altered through all the reads required, then no writer has changed anything. If that is the case, then the reader has successfully read the data. If not, then a re-read is required. Thus the above is best for the case where writing is rare, and there are many readers. Note how the readers do not have to synchronize with each other in the above algorithm, making it extremely fast.

Another possibility is if the writers do not need exclusive access. In that case, they can use an atomic exchange to update in a wait-free manner:

The safe atomic pointer-following allows the above to safely free the old object obtained by the swap. Normally this isn't much of a problem as the old object could be reused as a new one for the next update, but sometimes the writer algorithm may require an arbitrary sized buffer. If that is the case, then the above algorithm is quite convenient.

Singly Linked List via Double-Dereference

We can use a slightly different type of double-dereference to implement a lock-free singly-linked list. This algorithm doesn't require counters for each node to prevent the ABA problem. It also doesn't require garbage collection to clean up deleted nodes, making it quite simple compared to a traditional lock-free list which requires spare bits within the pointers.

The extra instruction required is a variant of the atomic exchange. The trick here is that both the source and destination are memory, as opposed to the normal version which only allows a single memory reference:

The above works even if concurrent additions elsewhere in the list are occurring. Since the exchange operation is atomic, the list is always in a consistent state.

Deletion is similarly simple. Here, we don't really do anything, just set the contents of the list node to a logically "deleted" state. A reference to the node is required... and perhaps a node-internal lock if other threads may be concurrently changing node-internal state.

The freeing of deleted nodes is handled as threads move their "cursors" along the list. Each cursor maintains a reference to the list node it points to. This prevents that node from disappearing, and the thread then trying to access possibly freed memory. The trickiest algorithm then is the one which moves a cursor along the list.

The above scans forward from the current location, trying to remove any deleted nodes it finds. If removing fails, then it reconnects the dead node and moves on. By using the safe pointer-following ref() function we can avoid a fair amount of complexity. Unfortunately, the above is untested, and may have hidden race conditions. (Since current processors do not support the double-dereference operations testing the above requires writing some sort of simulator, which is beyond the scope of this article.)

Summary

It is possible to add new atomic primitives to the x86 instruction set. The AMD Advanced Synchronization Facility is one such proposed extension. Double-dereference instructions are another possibility. The best hypothetical extension would allow the construction of many wait-free algorithms and be free of live-lock. It would hopefully also simplify the construction of new lock-free algorithms. By providing more powerful primitives it may be possible to remove much of the complexity, and perhaps even allow advanced lock-free data structures without some form of garbage collector.

Comments

xswgza said...

i0ezEw , [url=http://ddlijpvawllj.com/]ddlijpvawllj[/url], [link=http://ntwbcrqhqein.com/]ntwbcrqhqein[/link], http://pqyrkrzulrzc.com/

Jayson said...

Always reienshfrg to hear a rational answer.

alessandro said...

All this would be an incredible dream!