Concurrent Data Structures:Designing Concurrent Data Structures

Designing Concurrent Data Structures

Several features of shared-memory multiprocessors make concurrent data structures significantly more diﬃcult to design and to verify as correct than their sequential counterparts.

The primary source of this additional diﬃculty is concurrency: Because threads are executed concurrently on diﬀerent processors, and are subject to operating system scheduling decisions, page faults, interrupts, etc., we must think of the computation as completely asynchronous, so that the steps of diﬀerent threads can be interleaved arbitrarily. This signiﬁcantly complicates the task of designing correct concurrent data structures.

Designing concurrent data structures for multiprocessor systems also provides numerous challenges with respect to performance and scalability. On today’s machines, the layout of processors and memory, the layout of data in memory, the communication load on the various elements of the multiprocessor architecture all inﬂuence performance. Furthermore, the issues of correctness and performance are closely tied to each other: algorithmic en- hancements that seek to improve performance often make it more diﬃcult to design and verify a correct data structure implementation.

The following example illustrates various features of multiprocessors that aﬀect concur- rent data structure design. Suppose we wish to implement a shared counter data structure that supports a fetch-and-inc operation that adds one to the counter and returns the value of the counter immediately before the increment. A trivial sequential implementation of the fetch-and-inc operation contains code like that shown on the left in Figure 47.1:∗ If we allow concurrent invocations of the fetch-and-inc operation by multiple threads, this implementation does not behave correctly. To see why, observe that most compilers will translate this source code into machine instructions that load X into a register, then add one to that register, then store that register back to X. Suppose that the counter is initially 0, and two fetch-and-inc operations execute on diﬀerent processors concurrently. Then there is a risk that both operations read 0 from X, and therefore both store back 1 and return 0. This is clearly incorrect: one of the operations should return 1.

The incorrect behavior described above results from a “bad” interleaving of the steps of the two fetch-and-inc operations. A natural and common way to prevent such inter- leavings is to use a mutual exclusion lock (also known as a mutex or a lock ). A lock is a construct that, at any point in time, is unowned or is owned by a single thread. If a thread t1 wishes to acquire ownership of a lock that is already owned by another thread t2, then t1 must wait until t2 releases ownership of the lock.

We can obtain a correct sequential implementation of the fetch-and-inc operation by using a lock as shown on the right in Figure 47.1. With this arrangement, we prevent the bad interleavings by preventing all interleavings. While it is easy to achieve a correct shared counter this way, this simplicity comes at a price: Locking introduces a host of problems related to both performance and software engineering.

Performance

The speedup of an application when run on P processors is the ratio of its execution time on a single processor to its execution time on P processors. It is a measure of how eﬀectively the application is utilizing the machine it is running on. Ideally, we want linear speedup: we would like to achieve a speedup of P when using P processors. Data structures whose speedup grows with P are called scalable. In designing scalable data structures we must take care: naive approaches to synchronization can severely undermine scalability.

Returning to the lock-based counter, observe that the lock introduces a sequential bottleneck : at any point in time, at most one fetch-and-inc operation is doing useful work, i.e. incrementing the variable X. Such sequential bottlenecks can have a surprising eﬀect on the speedup one can achieve. The eﬀect of the sequentially executed parts of the code on performance is illustrated by a simple formula based on Amdahl’s law [109]. Let b be the fraction of the program that is subject to a sequential bottleneck. If the program takes 1 time unit when executed on a single processor, then on a P -way multiprocessor the sequential part takes b time units, and the concurrent part takes (1 − b)/P time units in the best case, so the speedup S is at most 1/(b + (1 − b)/P ). This implies that if just 10% of our application is subject to a sequential bottleneck, the best possible speedup we can achieve on a 10-way machine is about 5.3: we are running the application at half of the machine’s capacity. Reducing the number and length of sequentially executed code sections is thus crucial to performance. In the context of locking, this means reducing the number of locks acquired, and reducing lock granularity, a measure of the number of instructions executed while holding a lock.

A second problem with our simple counter implementation is that it suﬀers from memory contention: an overhead in traﬃc in the underlying hardware as a result of multiple threads concurrently attempting to access the same locations in memory. Contention can be appreciated only by understanding some aspects of common shared-memory multiprocessor architectures. If the lock protecting our counter is implemented in a single memory location, as many simple locks are, then in order to acquire the lock, a thread must repeatedly attempt to modify that location. On a cache-coherent multiprocessor† for example, exclusive ownership of the cache line containing the lock must be repeatedly transferred from one processor to another; this results in long waiting times for each attempt to modify the location, and is further exacerbated by the additional memory traﬃc associated with unsuccessful attempts to acquire the lock. In Section 47.1.7, we discuss lock implementations that are designed to avoid such problems for various types of shared memory architectures.

A third problem with our lock-based implementation is that, if the thread that currently holds the lock is delayed, then all other threads attempting to access the counter are also delayed. This phenomenon is called blocking, and is particularly problematic in multipro grammed systems, in which there are multiple threads per processor and the operating system can preempt a thread while it holds the lock. For many data structures, this diﬃ- culty can be overcome by devising nonblocking algorithms in which the delay of a thread does not cause the delay of others. By deﬁnition, these algorithms cannot use locks.

Below we continue with our shared counter example, discussing blocking and nonblocking techniques separately; we introduce more issues related to performance as they arise.

Blocking Techniques

In many data structures, the undesirable eﬀects of memory contention and sequential bot- tlenecks discussed above can be reduced by using a ﬁne-grained locking scheme. In ﬁne- grained locking, we use multiple locks of small granularity to protect diﬀerent parts of the data structure. The goal is to allow concurrent operations to proceed in parallel when they do not access the same parts of the data structure. This approach can also help to avoid excessive contention for individual memory locations. For some data structures, this happens naturally; for example, in a hash table, operations concerning values that hash to diﬀerent buckets naturally access diﬀerent parts of the data structure.

For other structures, such as the lock-based shared counter, it is less clear how contention and sequential bottlenecks can be reduced because—abstractly—all operations modify the same part of the data structure. One approach to dealing with contention is to spread operations out in time, so that each operation accesses the counter in a separate time interval from the others. One widely used technique for doing so is called back oﬀ [3]. However, even with reduced contention, our lock-based shared counter still lacks parallelism, and is therefore not scalable. Fortunately, more sophisticated techniques can improve scalability. One approach, known as a combining tree [36, 37, 51, 137], can help implement a scalable counter. This approach employs a binary tree with one leaf per thread. The root of the tree contains the actual counter value, and the other internal nodes of the tree are used to coordinate access to the root. The key idea is that threads climb the tree from the leaves towards the root, attempting to “combine” with other concurrent operations. Every time the operations of two threads are combined in an internal node, one of those threads—call it the loser—simply waits at that node until a return value is delivered to it. The other thread—the winner—proceeds towards the root carrying the sum of all the operations that have combined in the subtree rooted at that node; a winner thread that reaches the root of the tree adds its sum to the counter in a single addition, thereby eﬀecting the increments of all of the combined operations. It then descends the tree distributing a return value to each waiting loser thread with which it previously combined. These return values are distributed so that the eﬀect is as if all of the increment operations were executed one after the other at the moment the root counter was modiﬁed.

The technique losers employ while waiting for winners in the combining tree is crucial to its performance. A loser operation waits by repeatedly reading a memory location in a tree node; this is called spinning. An important consequence in a cache-coherent multi- processor is that this location will reside in the local cache of the processor executing the loser operation until the winner operation reports the result. This means that the waiting loser does not generate any unnecessary memory traﬃc that may slow down the winner’s performance. This kind of waiting is called local spinning, and has been shown to be crucial for scalable performance [96].

In so-called non-uniform memory access (NUMA) architectures, processors can access their local portions of shared memory faster than they can access the shared memory portions of other processors. In such architectures, data layout —the way nodes of the combining tree are placed in memory—can have a signiﬁcant impact on performance. Performance can be improved by locating the leaves of the tree near the processors running the threads that own them. (We assume here that threads are statically bound to processors.)

Data layout issues also aﬀect the design of concurrent data structures for cache-coherent multiprocessors. Recall that one of the goals of the combining tree is to reduce contention for individual memory locations in order to improve performance. However, because cache- coherent multiprocessors manage memory in cache-line-sized chunks, if two threads are accessing diﬀerent memory locations that happen to fall into the same cache line, performance suﬀers just as if they had been accessing the same memory location. This phenomenon is known as false sharing, and is a common source of perplexing performance problems.

By reducing contention for individual memory locations, reducing memory traﬃc by using local spinning, and allowing operations to proceed in parallel, counters implemented using combining trees scale with the number of concurrent threads much better than the single lock version does [51]. If all threads manage to repeatedly combine, then a tree of width P will allow P threads to return P values after every O(log P ) operations required to ascend and descend the tree, oﬀering a speedup of O(P/ log P ) (but see Section 47.1.4).

Despite the advantages of the combining tree approach, it also has several disadvantages. It requires a known bound P on the number of threads that access the counter, and it requires O(P ) space. While it provides better throughout under heavy loads, that is, when accessed by many concurrent threads, its best-case performance under low loads is poor: It must still traverse O(log P ) nodes in the tree, whereas a fetch-and-inc operation of the single-lock-based counter completes in constant time. Moreover, if a thread fails to combine because it arrived at a node immediately after a winner left it on its way up the tree, then it must wait until the winner returns before it can continue its own ascent up the tree. The coordination among ascending winners, losers, and ascending late threads, if handled incorrectly, may lead to deadlocks: threads may block each other in a cyclic fashion such that none ever makes progress. Avoiding deadlocks signiﬁcantly complicates the task of designing correct and eﬃcient implementations of blocking concurrent data structures.

In summary, blocking data structures can provide powerful and eﬃcient implementations if a good balance can be struck between using enough blocking to maintain correctness, while minimizing blocking in order to allow concurrent operations to proceed in parallel.

Nonblocking Techniques

As discussed earlier, nonblocking implementations aim to overcome the various problems associated with the use of locks. To formalize this idea, various nonblocking progress condi- tions—such as wait-freedom [49, 82], lock-freedom [49], and obstruction-freedom [53]—have been considered in the literature. A wait-free operation is guaranteed to complete after a ﬁnite number of its own steps, regardless of the timing behavior of other operations. A lock-free operation guarantees that after a ﬁnite number of its own steps, some operation completes. An obstruction-free operation is guaranteed to complete within a ﬁnite number of its own steps after it stops encountering interference from other operations.

Clearly, wait-freedom is a stronger condition than lock-freedom, and lock-freedom in turn is stronger than obstruction-freedom. However, all of these conditions are strong enough to preclude the use of blocking constructs such as locks.‡ While stronger progress conditions seem desirable, implementations that make weaker guarantees are generally simpler, more eﬃcient in the common case, and easier to design and to verify as correct. In practice, we can compensate for the weaker progress conditions by employing backoﬀ [3] or more sophisticated contention management techniques [54].

Let us return to our shared counter. It follows easily from results of Fischer et al. [32] (ex- tended to shared memory by Herlihy [49] and Loui and Abu-Amara [92]) that such a shared counter cannot be implemented in a lock-free (or wait-free) manner using only load and

store instructions to access memory. These results show that, in any proposed implementation, a bad interleaving of operations can cause incorrect behaviour. These bad interleavings are possible because the load and store are separate operations. This problem can be over- come by using a hardware operation that atomically combines a load and a store. Indeed, all modern multiprocessors provide such synchronization instructions, the most common of which are compare-and-swap (CAS) [61, 63, 136] and load-linked/store-conditional (LL/SC) [62, 69, 131]. The semantics of the CAS instruction is shown in Figure 47.2. For purposes of illustration, we assume an atomically keyword which requires the code block it labels to be executed atomically, that is, so that that no thread can observe a state in which the code block has been partially executed. The CAS operation atomically loads from a memory location, compares the value read to an expected value, and stores a new value to the location if the comparison succeeds. Herlihy [49] showed that instructions such as CAS and LL/SC are universal : there exists a wait-free implementation for any concurrent data structure in a system that supports such instructions.

A simple lock-free counter can be implemented using CAS. The idea is to perform the fetch-and-inc by loading the counter’s value and to then use CAS to atomically change the counter value to a value greater by one than the value read. The CAS instruction fails to increment the counter only if it changes between the load and the CAS. In this case, the operation can retry, as the failing CAS had no eﬀect. Because the CAS fails only as a result of another fetch-and-inc operation succeeding, the implementation is lock-free. However, it is not wait-free because a single fetch-and-inc operation can repeatedly fail its CAS.

The above example illustrates an optimistic approach to synchronization: the fetch-and-inc operation completes quickly in the hopefully common case in which it does not encounter interference from a concurrent operation, but must employ more expensive techniques under contention (e.g., backoﬀ).

While the lock-free counter described above is simple, it has many of the same disad- vantages that the original counter based on a single lock has: a sequential bottleneck and high contention for a single location. It is natural to attempt to apply similar techniques as those described above in order to improve the performance of this simple implementation. However, it is usually more diﬃcult to incorporate such improvements into nonblocking im- plementations of concurrent data structures than blocking ones. Roughly, the reason for this is that a thread can use a lock to prevent other threads from “interfering” while it performs some sequence of actions. Without locks, we have to design our implementations to be cor- rect despite the actions of concurrent operations; in current architectures, this often leads to the use of complicated and expensive techniques that undermine the improvements we are trying to incorporate. As discussed further in Section 47.1.7, transactional mechanisms make it much easier to design and modify eﬃcient implementations of complex concurrent data structures. However, hardware support for such mechanisms does not yet exist.

Complexity Measures

A wide body of research is directed at analyzing the asymptotic complexity of concurrent data structures and algorithms in idealized models such as parallel random access machines [35, 121, 134]. However, there is less work on modeling such data structures in a realistic multiprocessor setting. There are many reasons for this, most of which have to do with the interplay of the architectural features of the hardware and the asynchronous execution of threads. Consider the combining tree example. Though we argued a speedup of O(P/ log P ) by counting instructions, this is not reﬂected in empirical studies [51, 128]. Real-world behavior is dominated by other features discussed above, such as cost of contention, cache behavior, cost of universal synchronization operations (e.g. CAS), arrival rates of requests, eﬀects of backoﬀ delays, layout of the data structure in memory, and so on. These factors are hard to quantify in a single precise model spanning all current architectures. Complexity measures that capture some of these aspects have been proposed by Dwork et al. [28] and by Anderson and Yang [7]. While these measures provide useful insight into algorithm design, they cannot accurately capture the eﬀects of all of the factors listed above. As a result, concurrent data structure designers compare their designs empirically by running them using micro-benchmarks on real machines and simulated architectures [9, 51, 96, 102]. In the remainder of this chapter, we generally qualify data structures based on their empirically observed behavior and use simple complexity arguments only to aid intuition.

Correctness

It is easy to see that the behavior of the simple lock-based counter is “the same” as that of the sequential implementation. However, it is signiﬁcantly more diﬃcult to see this is also true for the combining tree. In general, concurrent data structure implementations are often subtle, and incorrect implementations are not uncommon. Therefore, it is important to be able to state and prove rigorously that a particular design correctly implements the required concurrent data structure. We cannot hope to achieve this without a precise way of specifying what “correct” means.

Data structure speciﬁcations are generally easier for sequential data structures. For ex- ample, we can specify the semantics of a sequential data structure by choosing a set of states, and providing a transition function that takes as arguments a state, an operation name and arguments to the operation, and returns a new state and a return value for the operation. Together with a designated initial state, the transition function speciﬁes all ac- ceptable sequences of operations on the data structure. The sequential semantics of the counter is speciﬁed as follows: The set of states for the counter is the set of integers, and the initial state is 0. The transition function for the fetch-and-inc operation adds one to the old state to obtain the new state, and the return value is the old state of the counter.

Operations on a sequential data structure are executed one-at-a-time in order, and we simply require that the resulting sequence of operations respects the sequential semantics speciﬁed as discussed above. However, with concurrent data structures, operations are not necessarily totally ordered. Correctness conditions for concurrent data structures generally require that some total order of the operations exists that respects the sequential semantics. Diﬀerent conditions are distinguished by their diﬀerent requirements on this total ordering. A common condition is Lamport’s sequential consistency [80], which requires that the total order preserves the order of operations executed by each thread. Sequential consistency has a drawback from the software engineering perspective: a data structure implemented

using sequentially consistent components may not be sequentially consistent itself.

A natural and widely used correctness condition that overcomes this problem is Herlihy and Wing’s linearizability [57], a variation on the serializability [16] condition used for database transactions. Linearizability requires two properties: (1) that the data structure be sequentially consistent, and (2) that the total ordering which makes it sequentially consistent respect the real-time ordering among the operations in the execution. Respecting the real- time ordering means that if an operation O1 ﬁnishes execution before another operation O2 begins (so the operations are not concurrent with each other), then O1 must be ordered before O2. Another way of thinking of this condition is that it requires us to be able to identify a distinct point within each operation’s execution interval, called its linearization point , such that if we order the operations according to the order of their linearization points, the resulting order obeys the desired sequential semantics.

It is easy to see that the counter implementation based on a single lock is linearizable. The state of the counter is always stored in the variable X. We deﬁne the linearization point of each fetch-and-inc operation as the point at which it stores its incremented value to X. The linearizability argument for the CAS-based lock-free implementation is similarly simple, except that we use the semantics of the CAS instruction, rather than reasoning about the lock, to conclude that the counter is incremented by one each time it is modiﬁed.

For the combining tree, it is signiﬁcantly more diﬃcult to see that the implementation is linearizable because the state of the counter is incremented by more than one at a time, and because the increment for one fetch-and-inc operation may in fact be performed by another operation with which it has combined. We deﬁne the linearization points of fetch-and-inc operations on the combining-tree-based counter as follows. When a winner thread reaches the root of the tree and adds its accumulated value to the counter, we linearize each of the operations with which it has combined in sequence immediately after that point. The operations are linearized in the order of the return values that are subsequently distributed to those operations. While a detailed linearizability proof is beyond the scope of our presentation, it should be clear from this discussion that rigorous correctness proofs for even simple concurrent data structures can be quite challenging.

The intuitive appeal and modularity of linearizability makes it a popular correctness condition, and most of the concurrent data structure implementations we discuss in the remainder of this chapter are linearizable. However, in some cases, performance and scala- bilitycan be improved by satisfying a weaker correctness condition. For example, the quies- cent consistency condition [10] drops the requirement that the total ordering of operations respects the real-time order, but requires that every operation executed after a quiescent state—one in which no operations are in progress—must be ordered after every operation executed before that quiescent state. Whether an implementation satisfying such a weak condition is useful is application-dependent. In contrast, a linearizable implementation is always usable, because designers can view it as atomic.

Veriﬁcation Techniques

In general, to achieve a rigorous correctness proof for a concurrent data structure implementation, we need mathematical machinery for specifying correctness requirements, accurately modeling a concurrent data structure implementation, and ensuring that a proof that the implementation is correct is complete and accurate. Most linearizability arguments in the literature treat at least some of this machinery informally, and as a result, it is easy to miss cases, make incorrect inferences, etc. Rigorous proofs inevitably contain an inordinate amount of mundane detail about trivial properties, making them tedious to write and to read. Therefore, computer-assisted methods for verifying implementations are required. One approach is to use a theorem prover to prove a series of assertions which together imply that an implementation is correct. Another approach is to use model checking tools,

which exhaustively check all possible executions of an implementation to ensure that each one meets speciﬁed correctness conditions. The theorem proving approach usually requires signiﬁcant human insight, while model checking is limited by the number of states of an implementation that can be explored.

Tools of the Trade

Below we discuss three key types of tools one can use in designing concurrent data structures: locks, barriers, and transactional synchronization mechanisms. Locks and barriers are traditional low-level synchronization mechanisms that are used to restrict certain inter- leavings, making it easier to reason about implementations based on them. Transactional mechanisms seek to hide the tricky details of concurrency from programmers, allowing them to think in a more traditional sequential way.

Locks

As discussed earlier, locks are used to guarantee mutually exclusive access to (parts of) a data structure, in order to avoid “bad” interleavings. A key issue in designing a lock is what action to take when trying to acquire a lock already held by another thread. On uniprocessors, the only sensible option is to yield the processor to another thread. However, in multiprocessors, it may make sense to repeatedly attempt to acquire the lock, because the lock may soon be released by a thread executing on another processor. Locks based on this technique are called spinlocks . The choice between blocking and spinning is often a diﬃcult one because it is hard to predict how long the lock will be held. When locks are supported directly by operating systems or threads packages, information such as whether the lock-holder is currently running can be used in making this decision.

A simple spinlock repeatedly uses a synchronization primitive such as compare-and-swap to atomically change the lock from unowned to owned. As mentioned earlier, if locks are not designed carefully, such spinning can cause heavy contention for the lock, which can have a severe impact on performance. A common way to reduce such contention is to use exponential backoﬀ [3]. In this approach, which is illustrated in Figure 47.3, a thread that is unsuccessful in acquiring the lock waits some time before retrying; repeated failures cause longer waiting times, with the idea that threads will “spread themselves out” in time, resulting in lower contention and less memory traﬃc due to failed attempts.

Exponential backoﬀ has the disadvantage that the lock can be unowned, but all threads attempting to acquire it have backed oﬀ too far, so none of them is making progress. One way to overcome this is to have threads form a queue and have each thread that releases the lock pass ownership of the lock to the next queued thread. Locks based on this approach are called queuelocks. Anderson [8] and Graunke and Thakkar [38] introduce array-based queuelocks, and these implementations are improved upon by the list-based MCS queue locks of Mellor-Crummey and Scott [96] and the CLH queue locks of Craig and Landin and Hagersten [25, 93].

Threads using CLH locks form a virtual linked list of nodes, each containing a done ﬂag; a thread enters the critical section only after the done ﬂag of the node preceding its own node in the list is raised. The lock object has a pointer to the node at the tail of the list, the last one to join it. To acquire the lock, a thread creates a node, sets its done ﬂag to false indicate that it has not yet released the critical section, and uses a synchronization primitive such as CAS to place its own node at the tail of the list while determining the node of its predecessor. It then spins on the done ﬂag of the predecessor node. Note that each thread spins on a diﬀerent memory location. Thus, in a cache-based architecture, when a thread sets its done ﬂag to inform the next thread in the queue that it can enter the critical section, the done ﬂags on which all other threads are spinning are not modiﬁed, so those threads continue to spin on a local cache line, and do not produce additional memory traﬃc. This signiﬁcantly reduces contention and improves scalability in such systems. However, if this algorithm is used in a non-coherent NUMA machine, threads will likely have to spin on remote memory locations, again causing excessive memory traﬃc. The MCS queuelock [96] overcomes this problem by having each thread spin on a done ﬂag in its own node. This way, nodes can be allocated in local memory, eliminating the problem.

There are several variations on standard locks that are of interest to the data structure designer in some circumstances. The queuelock algorithms discussed above have more advanced abortable versions that allow threads to give up on waiting to acquire the lock, for example, if they are delayed beyond some limit in a real-time application [122, 124], or if they need to recover from deadlock. The algorithms of [122] provide an abort that is nonblocking. Finally, [102] presents preemption-safe locks, which attempt to reduce the negative performance eﬀects of preemption by ensuring that a thread that is in the queue but preempted does not prevent the lock from being granted to another running thread.

In many data structures it is desirable to have locks that allow concurrent readers. Such reader-writer locks allow threads that only read data in the critical section (but do not modify it) to access the critical section exclusively from the writers but concurrently with each other. Various algorithms have been suggested for this problem. The reader-writer queuelock algorithms of Mellor-Crummey and Scott [97] are based on MCS queuelocks and use read counters and a special pointer to writer nodes. In [75] a version of these algo- rithms is presented in which readers remove themselves from the lock’s queue. This is done by keeping a doubly-linked list of queued nodes, each having its own simple “mini-lock.” Readers remove themselves from the queuelock list by acquiring mini-locks of their neigh- boring nodes and redirecting the pointers of the doubly-linked list. The above-mentioned real-time queuelock algorithms of [122] provide a similar ability without locking nodes.

The reader-writer lock approach can be extended to arbitrarily many operation types through a construct called group mutual exclusion or room synchronization. The idea is that operations are partitioned into groups, such that operations in the same group can execute concurrently with each other, but operations in diﬀerent groups must not. An interesting application of this approach separates push and pop operations on a stack into diﬀerent groups, allowing signiﬁcant simpliﬁcations to the implementations of those operations because they do not have to deal with concurrent operations of diﬀerent types [18]. Group mutual exclusion was introduced by Joung [68]. Implementations based on mutual exclusion locks or fetch-and-inc counters appear in [18, 70].

More complete and detailed surveys of the literature on locks can be found in [6, 116].

Barriers

A barrier is a mechanism that collectively halts threads at a given point in their code, and allows them to proceed only when all threads have arrived at that point. The use of barriers arises whenever access to a data structure or application is layered into phases whose execution should not overlap in time. For example, repeated iterations of a numerical algorithm that converges by iterating a computation on the same data structure or the marking and sweeping phases of a parallel garbage collector.

One simple way to implement a barrier is to use a counter that is initialized to the total number of threads: Each thread decrements the counter upon reaching the barrier, and then spins, waiting for the counter to become zero before proceeding. Even if we use the techniques discussed earlier to implement a scalable counter, this approach still has the problem that waiting threads produce contention. For this reason, specialized barrier implementations have been developed that arrange for each thread to spin on a diﬀerent location [21, 48, 123]. Alternatively, a barrier can be implemented using a diﬀusing computation tree in the style of Dijkstra and Scholten [27]. In this approach, each thread is the owner of one node in a binary tree. Each thread awaits the arrival of its children, then notiﬁes its parent that it has arrived. Once all threads have arrived, the root of the tree releases all threads by disseminating the release information down the tree.

Transactional Synchronization Mechanisms

A key use of locks in designing concurrent data structures is to allow threads to modify multiple memory locations atomically, so that no thread can observe partial results of an update to the locations. Transactional synchronization mechanisms are tools that allow the programmer to treat sections of code that access multiple memory locations as a single atomic step. This substantially simpliﬁes the design of correct concurrent data structures because it relieves the programmer of the burden of deciding which locks should be held for which memory accesses and of preventing deadlock.

As an example of the beneﬁts of transactional synchronization, consider Figure 47.4, which shows a concurrent queue implementation achieved by requiring operations of a simple sequential implementation to be executed atomically. Such atomicity could be ensured either by using a global mutual exclusion lock, or via a transactional mechanism. However, the lock-based approach prevents concurrent enqueue and dequeue operations from executing in parallel. In contrast, a good transactional mechanism will allow them to do so in all but the empty state because when the queue is not empty, concurrent enqueue and dequeue operations do not access any memory locations in common.

The use of transactional mechanisms for implementing concurrent data structures is in- spired by the widespread use of transactions in database systems. However, the problem of supporting transactions over shared memory locations is diﬀerent from supporting transactions over databases elements stored on disk. Thus, more lightweight support for transactions is possible in this setting.

Kung and Robinson’s optimistic concurrency control (OCC) [79] is one example of a transactional mechanism for concurrent data structures. OCC uses a global lock, which is held only for a short time at the end of a transaction. Nonetheless, the lock is a sequential bottleneck, which has a negative impact on scalability. Ideally, transactions should be supported without the use of locks, and transactions that access disjoint sets of memory locations should not synchronize with each other at all.

Transactional support for multiprocessor synchronization was originally suggested by Herlihy and Moss, who also proposed a hardware-based transactional memory mechanism for supporting it [55]. Recent extensions to this idea include lock elision [113, 114], in which the

hardware automatically translates critical sections into transactions, with the beneﬁt that two critical sections that do not in fact conﬂict with each other can be executed in parallel. For example, lock elision could allow concurrent enqueue and dequeue operations of the above queue implementation to execute in parallel, even if the atomicity is implemented using locks. To date, hardware support for transactional memory has not been built.

Various forms of software transactional memory have been proposed by Shavit and Touitou [127], Harris et al. [44], Herlihy et al. [54], and Harris and Fraser [43].

Transactional mechanisms can easily be used to implement most concurrent data structures, and when eﬃcient and robust transactional mechanisms become widespread, this will likely be the preferred method. In the following sections, we mention implementations based on transactional mechanisms only when no direct implementation is known.

Search This Blog

algorithms

Concurrent Data Structures:Designing Concurrent Data Structures

Designing Concurrent Data Structures

Comments

Post a Comment

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Collision Detection:Penetration Depth Computation