Data Structures for Sets:The Disjoint Set Union-Find Problem

The Disjoint Set Union-Find Problem

In this section we cover the best-studied set problem: the disjoint set union-find problem. The study of this problem and its variants has generated dozens of papers; indeed, it would not be unreasonable to devote an entire chapter to this problem alone. Fortunately, much of the work on this problem took place from the 1960s through to the early 1990’s, and this work is surveyed in an excellent article by Galil and Italiano [18]. In this section, we summarise the main results prior to 1991 but focus on developments since then.

We begin with by formulating the problem in a recent, more general way [19], that fits better in our framework. We start with the base repertoire, but now require that insert(x, A) is only permitted if x /∈ B, for all sets B F − {A}. This ensures that all sets in F are pairwise disjoint. We now add the following operations:

dunion(A, B, C) This sets C A B, but destroys (removes from F ) A and B.

find(x) For any x ∈ ∪A∈F A, returns the name of the set that x is in.

This problem has a number of applications, beginning historically with the efficient implementation of EQUIVALENCE and COMMON statements in early FORTRAN compilers in the 1960s, and continues to find applications in diverse areas such as dynamic graph algorithms, meldable data structures and implementations of unification algorithms.

It is difficult to discuss the disjoint-set union-find problem without reference to Acker- mann’s function, which is defined for integers i, n 0 as follows:

image

(i) An intermixed sequence of f find, m insert, m create, d m delete operations and at most m 1 dunion operations takes O(m + (f + d)α(f + m, m)) time. The size of the data structure at any time during the sequence is proportional to the number of live items in it.

(ii) For any parameter k > 1, an intermixed sequence of find, dunion, insert, create, and delete operations can be processed in the following worst-case time bounds: find and delete in O(log m/ log k) time, create and insert in O(1) time and dunion in O(k) time, where m is the current number of elements in the data structure.

In fact, [20] showed that this problem could be reduced to the classical union-find problem (see below). This reduction is such that the time bounds for find and dunion change only by a constant factor, but the time to delete an element x is the same as the time it takes to find the set containing x plus the time it takes to unite a singleton set with this set. The results follow by applying this result, respectively, to the classical union-find data structures given in Theorem 33.6(i) and Theorem 33.6(ii) (in fact they use a variant of the latter result due to Smid [37]).

An interesting variant of the above problem was considered by the same authors in [19]. They replaced the find operation with the seemingly simpler:

bfind(x, A) Return ‘true’ if x A and ‘false’ otherwise.

This ‘boolean’ union-find problem was motivated by the problem of implementing meldable priority queues. Kaplan et al. showed that the lower bounds established for the classical union-find problem apply to this problem as well (see Theorem 33.7).

The Classical Union-Find Problem and Variants

In the classical version of this problem, we have U = {1,... , m} and F initially equals {{1}, {2},... , {m}}; we assume that the name of the set {i} is simply i. The operation dunion is modified as follows:

dunion(A, B) This sets A A B, but destroys (removes from F ) B.

The operation find(x) operates as described above. We now mention the main upper bounds that are known for the classical version of the problem:

THEOREM 33.6 A sequence of m 1 dunion and f find operations can be processed as follows:

image

We begin by remarking that the bounds (i) and (ii) are in fact implied by Theorem 33.5, but are included here for completeness (see [18] for a more complete history).

The upper

bound of Theorem 33.5(i) is due to Tarjan [39] (with many interesting refinements due to [41]). Note that since α(f + m, m) is essentially constant, the running time is linear for all practical purposes. In (i), each dunion takes O(1) worst-case time, but a single find may take Ω(log m) time. Result (iii) allows us to process each find quickly, in essentially constant time, but an individual dunion could be quite slow and may in fact take Ω(m) time. The overall cost of the sequence remains as in (i), though. This was explicitly proved in [27], but the main ideas were already present in [16]. In (iv), the time for a dunion is traded off for find, and the overall running time remains essentially linear (at least for k 2) [6].

In either (i) or (iii), although the overall cost of the operation sequence is very low, an individual operation could take Ω(log m) or Ω(m) time respectively. If we wish to minimise the maximum cost of a single operation over the sequence, then choosing (say) k = Θ(log m) in (ii), we get that no operation takes more than O(log m/ log log m) time.

However, the cost of the entire sequence increases to Ω(f log m/ log log m). This result is due to [8].

In [2], the authors ask the question whether one can combine the best features of the results (i) (or (iii)) and (ii). Except possibly for a few extreme cases, this is achieved in [2, Theorem 11]. The result presented there has some technical restrictions (e.g. the algorithm needs to know the value of m in advance), but the authors claim that these are not essential. We now come to lower bounds for this problem, which are either proved in the pointer machine model or the cell probe model. In general the cell probe model is more powerful (since there are no restrictions on the way that memory is accessed), but the lower bounds are proven assuming that all memory locations can store only numbers of O(log m) bits. By contrast, the pointer machine model does not place any restrictions on the size of the numbers that can be stored at each node. In the pointer machine model, however, the algorithm is required to maintain the name of each set in distinct nodes, and an operation such as find(x) must traverse the graph of nodes from the node for x to the node that contains the name of the set that x is in. Lower bounds for the pointer machine model

sometimes make the following separability assumption [40]:

At any time during the computation, the contents of the memory can be partitioned into collections of records such that each collection corresponds to a currently existing set, and no record in one collection contains a pointer to a record in another collection.

We now give some of the known lower bounds:

THEOREM 33.7

(i) Any algorithm that processes a sequence of m 1 dunion and f find operations must take Ω(m + f α(f + m, m)) steps on either the pointer machine model or the cell probe model with a word size of O(log m) bits.

(ii) Any algorithm that processes a sequence of m 1 dunion and f find operations must take Ω(log m/ log log m) steps for some operation, on either the separable pointer machine model or the cell probe model with a word size of O(log m) bits.

The lower bound (i) for the pointer machine model is due to [28]; this generalizes an earlier lower bound for separable pointer machine algorithms due to [40] (see also the result due to [4]). The lower bound (ii) is due to [8], while both cell probe lower bounds are due to [15]. Additional lower bounds may be found in [2].

We refer the reader to [18] for variations of the classical union-find problem, including union-find with backtracking and persistent union-find (for the latter, see also [12, 13]). An- other important variant, which has a similar complexity to that of the union-find problem, is the split-find problem. Here we are given an ordered universe U and start with F containing a single set which contains all of U . We perform an intermixed series of operations S, T split(S, x) for some x S, which removes from S all elements > x and places them in a new set T , and find(x), which as before returns the name of the set containing x. Upper and lower bounds of the form given in Theorem 33.6(i) or 33.7(i) have been proven by [16, 27, 28]. Finally, we mention the very interesting Union-copy problem studied in [23]. In addition to the classical union-find operations, they support an additional operation copy, which takes one set as its argument, and adds an exact copy of it to F . Although it is obviously no longer true that all sets in F are disjoint, they require that two sets that are to be united must be disjoint. In addition, they want the data structure to support dual operations, in the sense that the roles of sets and elements are reversed. For example, the operation element-union(x, y) takes two elements x and y that do not both belong to any set in F , and turns them into a single element that is member of all the sets to which x or y previously belonged. Duals of copy and find are also supported. They show applications of this data structure to the (geometric) dynamic segment tree data structure.

Comments

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout