Data Structures for Sets:Introduction and Simple Randomized Set Representations

Introduction

Sets are a fundamental concept in computer science: the study of algorithms and data structures for maintaining them were among the earliest developments in data structures.

Our focus will be on problems that involve maintaining a family F of sets, where all sets are drawn from a universe U of elements. We do not assume that there is a particular natural order among the elements of U , and in particular do not focus on data structuring problems on sets where the operations explicitly assume such an order (e.g. priority queues). A base repertoire of actions is as follows:

image

The following operation is fundamental for data structuring problems involving sets in general, but plays only an auxiliary role in this chapter:

image

If we take only insert, delete and member, we get the dictionary problem, covered in Part III.

The base repertoire plus member is essentially no more difficult, as it represents the problem of maintaining a collection of independent dictionaries over a common universe. In this chapter, we focus on adding operations to this base repertoire that take two or more sets as arguments. For example, we could consider the standard set-theoretic operations on two sets A and B:

image

A data structure may support only an enumerative form of these operations, whereby the result of A op B is, in essence, some kind of enumeration of the set A op B. This result may not be in the same representation as A or B, and so one may not be able operate on it (e.g. ((A op B) op C) may not involve just two executions of op). The complexity of the algorithms will generally be measured in terms of the following parameters:

image

33.1.1 Models of Computation

The problems that we consider in this chapter have been studied on a variety of different computation models. The primary models for proving upper bounds are the pointer machine model and the random-access machine (RAM) model. The pointer machine [22, 35, 40] postulates a storage consisting of an unbounded collection of registers (or records) connected by pointers. Each register can contain an arbitrary amount of additional information, and no arithmetic is allowed to compute the address of a register. The processor has a constant number of (data and pointer) registers that it can manipulate directly, but all other temporary results must be held in the storage records. In particular, this means that to answer a query, the processor must either start from a pointer p into the data structure provided by the ‘user’ or from one of the constant number of pointers it itself holds, and explore the data structure by following pointers starting from p.

The RAM model we use is a standard variant of the original RAM model [1], the word RAM model [17]. Briefly, the word RAM model uses the unit-cost criterion, and a word size of Θ(log n) bits, where n = �S∈F |S|.Clearly the word size should be at least log n + O(1) bits—otherwise one could not even address the amount of memory required to store F . Nevertheless, there are instances where the solutions that result from the use of this model could be viewed as “cheating”. For example, we could have n = 2Θ(|U |), in which case the word size would be Θ(|U |) bits, which would allow most set operations to be done in O(1) time by bitwise operations on a single word. The solutions that we discuss, however, do not exploit this fact. In the related cell probe model, the storage of the computer is modeled by cells numbered 0, 1, 2,.. ., each of which can store a number of O(log n) bits. The running time of an algorithm in this model is measured as just the number of words (cells) accessed during an operation. All other computations are free.

The arithmetic model, used primarily for proving lower bounds, was proposed by Fredman [14] and Yao [43]. We now give a somewhat simplified description of this model which conveys the essential aspects: the interested reader is referred to [25, section 7.2.3] for further details. In many cases of interest, it is useful to assume that the data structure operates on values from a set M, which is a monoid. This means that M = (M, +, 0) is augmented with an associative and commutative operator + such that M is closed under + and 0 is the identity element for +. The data structure is modeled as an collection of variables v0, v1,.. ., each of which can hold a value from M and initially contains 0. After receiving the input to each operation, the algorithm executes a sequence of operations of the form vi INPUT, vi vj + vk or OUTPUT vi. The algorithm must be correct for all choices of M, thus in particular it cannot assume that the operator + is invertible.

The cost of an algorithm in processing a sequence of operations is the total number of such instructions executed by it. The restriction that M is a monoid (rather than say, a group) is partly for ease of proving lower bounds. However, known algorithms in this framework do not gain significant asymptotic speedups by exploiting stronger models (e.g. by using the fact that + is invertible in groups).

Simple Randomized Set Representations

In this section we cover a simple, but general-purpose set representation due to [33, 34]. In addition to the base repertoire (33.1), we wish to support:

image

This representation touches upon a topic of interest in its own right, that of the unique representation of sets, which can be defined as follows. We consider a class of representations that suffice to represent all subsets of a given universe U . However, we require that each set should have a a unique representation within that class. In particular the representation should be independent of the sequence of operations that created the set (for example, a red-black tree representation is not unique). Unique representations have the desirable property that it is then possible to ensure that all instances of the same set within a family (including sets that are created as intermediate results in a computation) are represented by the same object within the data structure. This allows constant-time equality tests of two sets: just check if they are the same object! The difficulty, of course, is to combine unique representations with rapid updates.

This definition also applies to randomized algorithms, which may access a source of random bits while executing a query or update, and the uniqueness requirement here means that the choice of representation for a set depends solely upon the sequence of random bits that are output by the source. (If, as in practice, one uses a pseudo-random generator, then the representation for a given set depends only upon the seed used by the pseudo-random number generator.)

The Hash Trie

We first weaken the notion of a unique representation, and speak about the unique representation of sets from a labeled universe. Here, we assume that the elements of the universe U are labeled with (not necessarily unique) b-bit strings, and for x U we denote its label by £(x). In addition, we also have the notion of a labeled set Ay , where y is a sequence of b bits. Any labeled set Ay satisfies the property that for all x Ay , y is a prefix of £(x).

Given a set from a labeled universe, one can of course use the well-known binary trie [21] to represent it. For the sake of precision we now define the binary trie, when used to represent a (labeled) set Sy :

image

Without exploiting the unique representation property, we would get the representation in Figure 33.1(i), and by exploiting it and storing subtrees uniquely it we get the representation in Figure 33.1(ii). Updates now need to be done with a little care: one cannot, for example, add a new child to a node, as the subtree in which the node lies may be shared among a number of sets. The solution is to use nondestructive updates, namely, implementing updates by making a copy of the path from the leaf to be added/deleted to the name of the sets (cf. persistent data structures). Figure 33.1(iii) shows the state of the data structure after executing the commands insert(a, Y ) and insert(b, X ), which we now explain.

First, let us consider the operation insert(a, Y ). We need to insert a as a sibling of c in the trie for Y . Since the path from c up to the root of Y could potentially be shared by other sets, we do not modify the node that is the parent of c, instead making a copy of the entire path from c to the node that is the root of the representation of Y , as shown in the figure. The pointer for Y is then appropriately modified. The nodes that were previously part of the representation of Y (shown in dotted outline) are, in this example, not reachable as part of any set and may be cleared as part of a garbage-collection phase (they are not explicitly freed during the insertion). Now coming to insert(b, X ), we proceed as before, and do not insert b directly as a sibling of (c, f ) under X. However, before creating a new node with the leaves b and (c, f ) as children, we check the data structure to see if such a node exists and find one (this is the node representing the set Z01). Therefore, we avoid creating this node. Continuing, we discover that all the nodes that we would have tried to create as part of this insertion already exist and therefore conclude that the sets X and Z are now the same (and hence their tries should be represented by the same node).

To support the reuse of existing nodes, a dictionary data structure that stores all nodes currently in the data structure is maintained. A node x with left child y and right child z is stored in this dictionary with the key (y, z) (either of y or z could be nil). Each insertion requires Θ(b) lookups or insertions into this dictionary. Avoiding this overhead is

image

important, but in a randomized setting the dictionary can be implemented using dynamic perfect hashing, giving O(1) overhead. This idea is also used in practical implementations of functional programming languages such as LISP, and is referred to as hashed consing in that context.

In a randomized setting, the other problem, that of obtaining suitable labels, can also be solved easily. We simply let £ be a function chosen at random from a universal class of hash functions (cf. Chapter 9).

By choosing £ : U → {1,... , n3}(for example) we can ensure that the number of collisions, or pairs of distinct elements x, y in U with £(x) = £(y) is O(1) with a probability that tends to 1 as n → ∞. This means that we can essentially ignore the possibility that there are more than a constant number of elements at the leaf of any trie. Clearly, labels are O(log n) bits long.

Note that we can ensure that both the lengths of the labels and the number of collisions stays under control as n changes, by simply rehashing whenever the value of n doubles or halves since the last rehash. We now analyze some parameters of this data structure. Clearly, equality testing takes O(1) time. Each insertion or deletion takes time and space proportional to the depth of the tries, which is O(log n). Both insertions and deletions may cause some nodes to become ‘garbage’ (i.e. unreachable from any set)—these need to be compacted periodically, for example, when a rehash occurs. It is easy to verify that the amortized cost of rehashing and garbage collection is negligible. This gives parts (i) and (ii) of the following theorem; for part (iii) we refer the reader to [33, Chapter 8]:

(i) supports insert and delete in O(log n) amortized expected time and equality-testing in O(1) worst-case time;

(ii) uses O(n log n) space, and

(iii) given two sets A and B, supports the constructive version of A op B for op {, , , } in O(|S2|(1 + log(S3/S2) amortized expected time, where S2 and S3 are the middle and largest sets of A B, B A and A B.

REMARK 33.1 Note that a fairly naive bound for an operation A op B, for op {, , , } is O(|A| log |B|), implemented by storing each tree in the family in a binary-tree based dictionary. Thus, Theorem 33.1 is not, in the worst case, a lot better. On the other hand, if we have two large sets S and T that differ in only a constant number k of elements then the expected running time of the above algorithm is O(k log max{|S|, |T |}).

Some Remarks on Unique Representations

There are interesting issues regarding the unique representation problem, in particular for unlabeled universes. We first mention here that both the randomized search tree and the skip list (cf. Chapter 13) satisfy the unique representation described above, even for unlabeled universes, and support the dictionary operations (insert, delete and member) in O(log n) expected time. In each of these cases, the algorithm described there should be modified so that instead of choosing a random height for a node in a skip list, or a random priority for a node in a randomized search tree, we should choose a function U [0, 1] that behaves ‘randomly’ (we refer the reader to [36] for a precise theoretical statement).

By representing each set in a family using either one of these representations we get alternatives to parts (i) and (ii) for Theorem 33.1. However, one may ask about deterministic unique representations that support rapid updates. The reader is directed to [3, 38] and the references therein for pointers to the literature on this fascinating area. We only note here that [3] show a Θ(n1/3) time bound for supporting the above dictionary operations, provided that the class of representations is restricted to “graph-like” representations (including, of course, trees). This shows an exponential separation between randomized uniquely represented dictionaries and deterministic (non-unique) dictionaries on the other hand, and deterministic uniquely represented dictionaries on the other. This result is not entirely conclusive, however: by labeling elements in U with integers from {0,... ,m 1}, as observed in [38, Section 2.3], the above trie-based approach gives uniquely represented dictionaries that support all operations in O(log m) time, where m = |U |. Thus, there is a “dichotomy” depending upon whether we are interested in bounds that depend on m alone or n alone (cf. Chapter 39).

It is not known how the complexity of unique representation depends on the relationship between m and n. Indeed, the gap in knowledge is quite large, as the Ω(n1/3) applies when n log|U |, while the O(log m) algorithm is optimal only E when n ∼ |U | for some constant c > 0.

Comments

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout