Concurrent Data Structures:Shared Counters and Fetch-and-φ Structures
Shared Counters and Fetch-and-φ Structures
Counters have been widely studied as part of a broader class of fetch-and-φ coordination structures, which support operations that fetch the current value of a location and apply some function from an allowable set φ to its contents. As discussed earlier, simple lock- based implementations of fetch-and-φ structures such as counters suffer from contention and sequential bottlenecks. Below we describe some approaches to overcoming these problems.
Combining
The combining tree technique was originally invented by Gottlieb et al. [37] to be used in the hardware switches of processor-to-memory networks. In Section 47.1.2 we discussed a software version of this technique, first described by Goodman et al. [36] and Yew et al. [137], for implementing a fetch-and-add counter. (The algorithm in [137] has a slight bug; see [51].)
This technique can also be used to implement fetch-and-φ operations for a variety of sets of combinable operations, including arithmetic and boolean operations, and synchronization operations such as load, store, swap, test-and-set, etc. [76].
As explained earlier, scalability is achieved by sizing the tree such that the there is one leaf node per thread. Under maximal load, the throughput of such a tree is proportional to O(P/ log P ) operations per time unit, offering a significant speedup. Though it is pos-
sible to construct trees with fan-out greater than two in order to reduce tree depth, that would sacrifice the simplicity of the nodes and, as shown by Shavit and Zemach [130], will most likely result in reduced performance. Moreover, Herlihy et al. [51] have shown that combining trees are extremely sensitive to changes in the arrival rate of requests: as the load decreases, threads must still pay the price of traversing the tree while attempting to combine, but the likelihood of combining is reduced because of the reduced load.
Shavit and Zemach overcome the drawbacks of the static combining tree structures by introducing combining funnels [130]. A combining funnel is a linearizable fetch-and-φ structure that allows combining trees to form dynamically, adapting its overall size based on load patterns. It is composed of a (typically small) number of combining layers. Each such layer is implemented as a collision array in memory. Threads pass through the funnel layer by layer, from the first (widest) to the last (narrowest). These layers are used by threads to locate each other and combine their operations. As threads pass through a layer, they read a thread ID from a randomly chosen array element, and write their own in its place. They then attempt to combine with the thread whose ID they read. A successful combination allows threads to exchange information, allowing some to continue to the next layer, and others to await their return with the resulting value. Combining funnels can also support the elimination technique (described in Section 47.3) to allow two operations to complete without accessing the central data structure in some cases.
Counting Networks
Combining structures provide scalable and linearizable fetch-and-φ operations. However, they are blocking. An alternative approach to parallelizing a counter that overcomes this problem is to have multiple counters instead of a single one, and to use a counting network to coordinate access to the separate counters so as to avoid problems such as duplicated or omitted values. Counting networks, introduced by Aspnes et al. [10], are a class of data structures for implementing, in a highly concurrent and nonblocking fashion, a restricted class of fetch-and-φ operations, the most important of which is fetch-and-inc.
Counting networks, like sorting networks [24], are acyclic networks constructed from simple building blocks called balancers. In its simplest form, a balancer is a computing element with two input wires and two output wires. Tokens arrive on the balancer’s input wires at arbitrary times, and are output on its output wires in a balanced way. Given a stream of input tokens, a balancer alternates sending one token to the top output wire, and one to the bottom, effectively balancing the number of tokens between the two wires.
We can wire balancers together to form a network. The width of a network is its number of output wires (wires that are not connected to an input of any balancer). Let y0, .., yw−1 respectively represent the number of tokens output on each of the output wires of a network of width w. A counting network is an acyclic network of balancers whose outputs satisfy the following step property:
In any quiescent state, 0 ≤ yi − yj ≤ 1 for any i < j.
Figure 47.5 shows a sequence of tokens traversing a counting network of width four based on Batcher’s Bitonic sorting network structure [13]. The horizontal lines are wires and the vertical lines are balancers, each connected to two input and output wires at the dotted points. Tokens (numbered 1 through 5) traverse the balancers starting on arbitrary input wires and accumulate on specific output wires meeting the desired step-property. Aspnes et al. [10] have shown that every counting network has a layout isomorphic to a sorting network, but not every sorting network layout is isomorphic to a counting network.
On a shared memory multiprocessor, balancers are records, and wires are pointers from one record to another. Threads performing increment operations traverse the data structure from some input wire (either preassigned or chosen at random) to some output wire, each time shepherding a new token through the network.
The counting network distributes input tokens to output wires while maintaining the step property stated above. Counting is done by adding a local counter to each output wire, so that tokens coming out of output wire i are assigned numbers i, i + w,...,i + (yi − 1)w. Because threads are distributed across the counting network, there is little contention on the balancers, and the even distribution on the output wires lowers the load on the shared counters. However, as shown by Shavit and Zemach [128], the dynamic patterns through the networks increase cache miss rates and make optimized layout almost impossible.
There is a significant body of literature on counting networks, much of which is surveyed by Herlihy and Busch [22]. An empirical comparison among various counting techniques can be found in [51]. Aharonson and Attiya [4] and Felten et al. [31] study counting networks with arbitrary fan-in and fan-out. Shavit and Touitou [126] show how to perform decrements on counting network counters by introducing the notion of “anti-tokens” and elimination. Busch and Mavronicolas [23] provide a combinatorial classification of the various properties of counting networks. Randomized counting networks are introduced by Aiello et al. [5] and fault-tolerant networks are presented by Riedel and Bruck [117].
The classical counting network structures in the literature are lock-free but not lineariz- able, they are only quiescently consistent. Herlihy et al. [56] show the tradeoffs involved in making counting networks linearizable.
Klugerman and Plaxton present an optimal log w-depth counting network [72]. How-ever, this construction is not practical, and all practical counting network implementations have log2 w depth. Shavit and Zemach introduce diffracting trees [128], improved count- ing networks made of balancers with one input and two output wires laid out as a binary tree. The simple balancers of the counting network are replaced by more sophisticated diffracting balancers that can withstand high loads by using a randomized collision array approach, yielding lower depth counting networks with significantly improved throughput. An adaptive diffracting tree that adapts its size to load is presented in [26].
Comments
Post a Comment