Concurrent Data Structures:Stacks and Queues
Stacks and Queues
Stacks and queues are among the simplest sequential data structures. Numerous issues arise in designing concurrent versions of these data structures, clearly illustrating the challenges involved in designing data structures for shared-memory multiprocessors.
Stacks
A concurrent stack is a data structure linearizable to a sequential stack that provides push and pop operations with the usual LIFO semantics. Various alternatives exist for the behavior of these data structures in full or empty states, including returning a special value indicating the condition, raising an exception, or blocking.
Michael and Scott present several linearizable lock-based concurrent stack implementations: they are based on sequential linked lists with a top pointer and a global lock that controls access to the stack [102]. They typically scale poorly because even if one reduces contention on the lock, the top of the stack is a sequential bottleneck. Combining funnels [130] have been used to implement a linearizable stack that provides parallelism under high load. As with all combining structures, it is blocking, and it has a high overhead which makes it unsuitable for low loads.
Treiber [133] was the first to propose a lock-free concurrent stack implementation. He represented the stack as a singly-linked list with a top pointer and used CAS to modify the value of the top pointer atomically. Michael and Scott [102] compare the performance of Treiber’s stack to an optimized nonblocking algorithm based on Herlihy’s methodology [50], and several lock-based stacks (such as an MCS lock [96]) in low load situations. They concluded that Treiber’s algorithm yields the best overall performance, and that this per- formance gap increases as the degree of multiprogramming grows. However, because the top pointer is a sequential bottleneck, even with an added backoff mechanism to reduce contention, the Treiber stack offers little scalability as concurrency increases [47].
Hendler et al. [47] observe that any stack implementation can be made more scalable using the elimination technique of Shavit and Touitou [126]. Elimination allows pairs of operations with reverse semantics—like pushes and pops on a stack—to complete without any central coordination, and therefore substantially aids scalability. The idea is that if a pop operation can find a concurrent push operation to “partner” with, then the pop operation can take the push operation’s value, and both operations can return immediately. The net effect of each pair is the same as if the push operation was followed immediately by the pop operation, in other words, they eliminate each other’s effect on the state of the stack. Elimination can be achieved by adding a collision array from which each operation chooses a location at random, and then attempts to coordinate with another operation that concurrently chose the same location [126]. The number of eliminations grows with concurrency, resulting in a high degree of parallelism. This approach, especially if the collision array is used as an adaptive backoff mechanism on the shared stack, introduces a high degree of parallelism with little contention [47], and delivers a scalable lock-free linearizable stack.
There is a subtle point in the Treiber stack used in the implementations above that is typical of many CAS-based algorithms. Suppose several concurrent threads all attempt a pop operation that removes the first element, located in some node “A,” from the list by using a CAS to redirect the head pointer to point to a previously-second node “B.” The problem is that it is possible for the list to change completely just before a particular pop operation attempts its CAS, so that by the time it does attempt it, the list has the node “A” as the first node as before, but the rest of the list including “B” is in a completely different order. This CAS of the head pointer from “A” to “B” may now succeed, but “B” might be anywhere in the list and the stack will behave incorrectly. This is an instance of the “ABA” problem [110], which plagues many CAS-based algorithms. To avoid this problem, Treiber augments the head pointer with a version number that is incremented every time the head pointer is changed. Thus, in the above scenario, the changes to the stack would cause the CAS to fail, thereby eliminating the ABA problem.§
Queues
A concurrent queue is a data structure linearizable to a sequential queue that provides
enqueue and dequeue operations with the usual FIFO semantics.
Michael and Scott [102] present a simple lock-based queue implementation that improves on the naive single-lock approach by having separate locks for the head and tail pointers of a linked-list-based queue. This allows an enqueue operation to execute in parallel with a dequeue operation (provided we avoid false sharing by placing the head and tail locks in separate cache lines). This algorithm is quite simple, with one simple trick: a “dummy” node is always in the queue, which allows the implementation to avoid acquiring both the head and tail locks in the case that the queue is empty, and therefore it avoids deadlock.
It is a matter of folklore that one can implement an array-based lock-free queue for a single enqueuer thread and a single dequeuer thread using only load and store operations [81]. A linked-list-based version of this algorithm appears in [46]. Herlihy and Wing [57] present a lock-free array-based queue that works if one assumes an unbounded size array. A survey in [102] describes numerous flawed attempts at devising general (multiple enqueuers, multiple dequeuers) nonblocking queue implementations. It also discusses some correct implementations that involve much more overhead than the ones discussed below.
Michael and Scott [102] present a linearizable CAS-based lock-free queue with parallel access to both ends. The structure of their algorithm is very simple and is similar to the two-lock algorithm mentioned above: it maintains head and tail pointers, and always keeps a dummy node in the list. To avoid using a lock, the enqueue operation adds a new node to the end of the list using CAS, and then uses CAS to update the tail pointer to reflect the addition. If the enqueue is delayed between these two steps, another enqueue operation can observe the tail pointer “lagging” behind the end of the list. A simple helping technique [50] is used to recover from this case, ensuring that the tail pointer is always behind the end of the list by at most one element.
While this implementation is simple and efficient enough to be used in practice, it does have a disadvantage. Operations can access nodes already removed from the list, and therefore the nodes cannot be freed. Instead, they are put into a freelist —a list of nodes stored for reuse by future enqueue operations—implemented using Treiber’s stack. This use of a freelist has the disadvantage that the space consumed by the nodes in the freelist cannot be freed for arbitrary reuse. Herlihy et al. [52] and Michael [100] have presented nonblocking memory management techniques that overcome this disadvantage.
It is interesting to note that the elimination technique is not applicable to queues: we cannot simply pass a value from an enqueue operation to a concurrent dequeue operation, because this would not respect the FIFO order with respect to other values in the queue.
Deques
A concurrent double-ended queue (deque) is a linearizable concurrent data structure that generalizes concurrent stacks and queues by allowing pushes and pops at both ends [73].
(See Chapter 2 for an introduction to stacks, queues and deques.)
As with queues, implementations that allow operations on both ends to proceed in parallel without interfering with each other are desirable.
Lock-based deques can be implemented easily using the same two-lock approach used for queues. Given the relatively simple lock-free implementations for stacks and queues, it is somewhat surprising that there is no known lock-free deque implementation that allows concurrent operations on both ends. Martin et al. [95] provide a summary of concurrent deque implementations, showing that, even using nonconventional two-word synchronization primitives such as double-compare-and-swap (DCAS) [106], it is difficult to design a lock- free deque. The only known nonblocking deque implementation for current architectures that supports noninterfering operations at opposite ends of the deque is an obstruction-free CAS-based implementation due to Herlihy et al. [53].
Comments
Post a Comment