Data Structures for Sets:Equality Testing.

Equality Testing

We now consider representations that only support the base repertoire (33.1) and the equal operation (33.4). Clearly we would like solutions that are better (at least in some respects) than those given by Theorem 33.1, and we begin by considering deterministic data structures. We first discuss one due to [38], which is based on binary tries (cf. Section 33.2).

The operations applied to our data structure are divided into epochs. At the start of an epoch the data structure is rebuilt ‘from scratch’. Without loss of generality we can consider U to consist of all the elements that were present in F at the start of an epoch, plus all the elements that have appeared as the argument to an insert or delete operation since then (note that U may contain elements no longer present in F ). We can then label elements in U with values from {0,... , |U |1}, using plog ml-bit integers, where m = |U |. Whenever we insert an element that is not in U , we give it the next higher integer. If ∈F |A| = n0 at the start of an epoch, we start the next epoch after n0/2 updates (inserts or deletes).

Rebuilding from scratch involves resetting U to be only those elements that are currently present in F , relabeling all elements with integers from {0,... , |U |1}, and recreating the data structure with these labels, as well as any auxiliary data structures. It will be easy to see that the amortized cost of rebuilding is negligible, and that |U | = O(n) at all times.

Each set is represented as a binary trie, with shared subtrees as in the previous section. Updates are also done non-destructively. A key difference is the method used to avoid creating nodes that already exist. Nodes are numbered serially in order of creation. If node p points to node v, we say that p is one of (possibly many) parents of v. Each node v, maintains a set parents(v) of all its parents. Each parent p parents(v) is assigned a key equal to (node-number(w), b), where w is the other node (besides v) pointed to by p, and b equals 0 or 1 depending on whether v is a left child of p or not. When doing an update, before creating a new node with left child x and right child y, we we search set parents(x) for the key (node-number(y), 0). If such a key is found, we return the matching parent, which is precisely a node with left child x and right child y. Otherwise, create a new node p with pointers to v and w, set parents(p) to empty, insert p into parents(v) and parents(w), and return p. The issue is how to represent the sets parents(v) (cf. the dictionary in the previous section).

Each set parents(v) is represented by a binary search tree and a variation of the splay algorithm is used to perform searches. The splay algorithm is especially appropriate as it can be used to exploit patterns in the sequence of accesses. Although a detailed consideration of splay trees is beyond the scope of this chapter, the necessary facts are, roughly, that (a) insertion of a key that is larger than the maximum key currently in the tree (a passive insertion) has constant amortized cost while an insertion that is not passive (an active)

insertion) has logarithmic amortized cost (b) if the frequency of search for an item i is 0 < αi < 1 then the amortized cost of all searches for item i is essentially O(1 + log(1i)).

This is summarized in the lemma below:

LEMMA 33.1 Consider a sequence of insertions and searches performed on an (initially empty) binary search tree using splays. Let:

image

REMARK 33.2 Insertions into an initially non-empty tree are easily handled: pretend to start with an empty tree and initialize the tree with a sequence of passive insertions. By the above lemma, the additional cost is linear in the size of the initial tree.

We now argue that although an update (insert or delete) may result in Θ(log m) node creations, and hence as many insertions into the parents dictionaries, most of the insertions are passive. Specifically, let x(y, z) denote the creation of a node x with children y and z. Then, an update may result in a sequence of node creations: y1(x0, x1), y2(y1, x2),..., yl(yl1, xl ), where y1,... , yl are ‘new’ nodes and x0,..., xl are nodes that existed previously in the data structure. Note that because the yi’s are new nodes, they have higher node-numbers than the xis. As a result, of the 2l insertions into parents dictionaries caused by this update, all but two—the insertion of yi with key (node-number(x0), b) ((node-number(x1), 1 b)) into parents(x1) (parents(x0))—are passive.

We now need to bound the time to perform searches on the parents dictionary. For this, we consider a single epoch, and let G = (V, E) be directed acyclic graph formed by the nodes at the end of the epoch. The sources (nodes of zero in degree) are the roots of (current and past) sets and the sinks (nodes of zero outdegree) are the elements of U . Note that all nodes have outdegree at most 2.

An update to sets in F results searches in the the parents dictionary of nodes that lie along a path π from a source to a sink in G. It should be noted that the path is traversed in reverse, from a sink to a source. Note that each time that a search traverses an edge (v, w), where w is a parent of v, the key that is searched for in parents(v) is the same. Thus we can identify the traversal of an edge in G with a search of a particular key in a particular dictionary. Let ae denote the number of times an edge e E is traversed, and note that we can delete from G all edges with ae = 0. Letting Av = (w,v)E a(w,v), the cost of searches in parents(v) for any vertex v V , denoted cost(v), is given by w,v)E a(w,v) log Av /aw,v, by Lemma 33.1. Let V I denote the set of nodes that are neither sinks nor sources, and note that for any v V I, (w,v)E a(w,v) = (v,w)E a(v,w). Thus, we have:

image

THEOREM 33.2 There is a data structure for a family of sets that supports insert, delete and member in O(log n) amortized time, equal in O(1) worst-case time, and uses O(n log n) words of space, where n is the current size of the family of sets.

We now describe the partition tree data structure for this problem [24, 44]. Since the partition tree is closely related to the above data structure, results based on the partition tree are not significantly different from those of Theorem 33.2. We describe a slightly simplified version of the partition tree by assuming that U is fixed. The partition tree is a full binary tree T with |U | leaves, i.e. a tree where the leaves are all at depth l| log U |J or p| log U |l, and all nodes have either two or zero children. At the leaves of this tree we place the elements of U (in any order). For each internal node v of T , we let D(v) denote the elements of U that are at leaves descended from v. A set A F is stored at an internal node v if D(v) A /= . Furthermore, all sets that are stored at an internal node v are grouped into equivalence classes, where the equivalence relation is given by A B (A D(v) = B D(v)). Clearly, two sets are equal iff they belong to the same equivalence class at the root of the tree and so this representation supports constant-time equality testing. Note that if nv is the number of sets stored at an internal node v of T , then v nv = O(n log |U |). This is because each set containing x appears once on each node from the path leading from x to the root (and hence O(log |U |) times in all), and xU |{A F | x A}| = n. The key issue, of course, is how to update the partition tree when executing insert(x, A) (delete(x, A)). We traverse T from x to the root, storing (or removing) A from all the nodes along the path. We now show how to maintain the equivalence classes at these nodes.

At the leaf, there is only one equivalence class, consisting of all sets that contain x. We merely need to add (delete) A to (from) this equivalence class. In general, however, at each node we need to determine whether adding/deleting x to/from A causes A to move into a new equivalence class of its own or into an existing equivalence class. This can be done as follows. Suppose γ is a (non-empty) equivalence class at a (non-leaf) node u in T and suppose that v, w are u’s children. A little thought shows that must be (non-empty) equivalence classes α, β at v, w respectively such that γ = α β. A global dictionary stores the name of γ with the key (α, β). Inductively assume that following an operation insert(x, A) (or delete(x, A)) we have updated the equivalence class of A at all ancestors of x up to and including a node v, and suppose that A is in the equivalence class α at v. Next we determine the equivalence class β of A in v’s sibling (this would not have changed by the update). We then look up up the dictionary with the key (α, β); if we find an equivalence class γ stored with this key then A belongs to γ in v’s parent u, otherwise we create a new equivalence class in u and update the dictionary.

Lam and Lee [24] asked whether a solution could be found that performed all operations in good single-operation worst-case time. The main obstacles to achieving this are the amortization in the splay tree and the periodic rebuilding of the partition tree. Again dividing the operation sequence into epochs, Lam and Lee noted that at the end of an epoch, the partition tree could be copied and rebuilt incrementally whilst allowing the old partition tree to continue to answer queries for a while (this is an implementation of the global rebuilding approach of Overmars [26]). By ‘naming’ the equivalence classes using integers from an appropriate range, they note that the dictionary may be implemented using a two- dimensional array of size O(n2 log n) words, which supports dictionary operations in O(1) worst-case time. Alternatively, using standard balanced binary trees or Willard’s q-fast tries [42], or the more complex data structure of Beame and Fich [5] gives a worst-case complexity of O(log n), O(log n) and O((log log n)2) for the dictionary lookup, respectively. Using these data structures for the dictionary we get:

THEOREM 33.3 There is a data structure for a family of sets that supports equal in O(1) worst-case time, and insert and delete in either (i) O(log n(log log n)2) worst-case time

image

REMARK 33.3 The reader may have noticed the similarities between the lookups in partition trees and the lookup needed to avoid creating existing nodes in the solutions of Theorems 33.1 and 33.2; indeed the examples in Figure 33.2 should make it clear that, at least in the case that |U | is a power of 2, there is a mapping from nodes in the DAG of Theorem 33.2 and partitions in the partition tree. The figure illustrates the following example: U = {a, b, c, d}, F = {A, B, C, D}, A = {a, c, d} = C, B = {a, b, d}, D = {c, d}, and an assumed labeling function that labels a, b, c and d with 00, 01, 10 and 11 respectively. The partitions in (ii) shown circled with a dashed/dotted line correspond to the nodes circled with a dashed/dotted line in (i).

Comments

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout