Succinct Representation of Data Structures:Succinct Dictionaries.

Succinct Dictionaries

The (static) dictionary problem is to store a subset S of size n so that membership queries can be answered efficiently. In our case, the universe is taken to be the set [m]. This problem has been widely studied and various solutions have been proposed to support membership queries in constant time. As we have seen in the last section, a bitvector is a simple way of representing a set from a given universe. But this requires m bits of space. Since there are (m/ sets of size n from a universe of size m, one would require only Blg (m/ (n(lg m lg n + lg e), when n = o(m)) bits to store a canonical representation of any such set. Thus a bitvector is quite wasteful of space when the set is sparse. A sorted array is another simple representation, but it requires Θ(lg n) time to answer queries.

A fusion tree (see Chapter 39) also takes linear space and supports membership queries in Θ(lg n/ lg lg n) time. In this section, we consider representations of sets whose space complexity is close to the information theoretic minimum and support queries in constant time. (As all the structures outlined below support membership queries in worst case constant time, we do not mention the query complexity explicitly.)

Fredman, Koml´os and Szemer´edi [19] gave the first linear space structure for the static dictionary problem. This takes n lg m + O(nlg n + lg lg m) bits of space. The lower order term was later improved by Schmidt and Siegel [55] to O(n + lg lg m). This structure uses a universe reduction function followed by a two-level hash function to hash the given subset one-to-one onto the set [n], and stores the elements of subset in a hash table (in the order determined by the hash function). The hash table takes n lg m bits and a clever encoding of the hash function takes O(n + lg lg m) bits of space. We refer to this as the FKS hashing scheme. Note that the space required for this structure is Θ(n lg n) bits more than the optimal bound of B bits.

Brodnik and Munro [5] gave a static dictionary representation that takes B + o(B) bits of space. It uses two different solutions depending on the relative values of n and m. When the set is relatively sparse (namely, when n m/(lg m)lg lg m), it partitions the elements into buckets based on the first lg n lg lg m bits of their bit representations, and store explicit pointers to refer to the representations of individual buckets. Each bucket is represented by storing all the elements that fall into it in a perfect hash table for that bucket. Otherwise, when the set is dense, it uses two levels of bucketing (at each level splitting the universe into a number of equal-range buckets, depending only on the universe size) after which the range of these buckets reduces to Θ(lg n). These small buckets are stored (almost) optimally by storing pointers into a precomputed table that contains all possible small buckets. In either case the space occupancy can be shown to be B + o(B) bits.

Pagh [46] observed that each bucket of the hash table may be resolved with respect to the part of the universe hashing to that bucket. Thus, one can save space by compressing the hash table, storing only the quotient information, rather than the element itself. From the FKS hash function, one can obtain a quotienting function that takes lg(m/n) + O(1) bits for each element. Using this idea one can obtain a dictionary structure that takes n lg(m/n)+ O(n + lg lg m) bits of space, which is only Θ(n) bits more than the information- theoretic lower bound (except for the O(lg lg m) term). Pagh has also given a dictionary structure that takes only B + o(n) + O(lg lg m) bits of space.

Indexable Dictionary

One useful feature of the sorted array representation of a set is that, given an index i, the i-th smallest element in the set can be retrieved in constant time. Furthermore, when we locate an element in the array, we immediately know its rank (the number of elements in the set which are less than the given element). On the other hand, hashing based schemes support membership in constant time, but typically do not maintain the ordering information. In this section we look at a structure that combines the good features of both these approaches. An indexable dictionary is a structure representing a set S of size n from the universe [m]

to support the following queries in constant time:

rank(x, S): Given x [m], return 1 if x /∈ S, and |{y S|y < x}| otherwise, and select(i, S): Given i ∈ {1,... n}, return the i-th smallest element in S.

Here the rank operation is only supported for the elements present in the set S. Ajtai [1] showed that the more general problem of supporting rank for every element in the universe has a query lower bound of Ω(lg lg n), even if the space used is polynomial in n. As a consequence, we emphasize the need for handling both S and its complement in the next section.

A dictionary that supports rank operation [52], as well as an indexable dictionary is very useful in representing trees [50] (see Section 37.4.3).

Elias [15] considered the indexable dictionary problem and gave a representation that takes n lg m n lg n + O(n) bits and supports the queries in O(1) time, though he only considered the average case time complexity of the queries. Raman et al. [50] have given an indexable dictionary structure that takes B + o(n)+ O(lg lg m) bits. The main idea here is, again, to partition the elements into buckets based on their most significant bits, as in the static dictionary structure of Brodnik and Munro [5]. The difference is that instead of storing explicit pointers to the bucket representations, they store the bucket sizes using a succinct representation that supports partial sum queries (see Section 37.8) in constant time. This not only saves a significant amount of space, but also provides the extra functionality needed for supporting rank and select.

Using similar ideas, one can also represent multisets and collections of sets using almost optimal space. See [50] for details.

Fully Indexable Dictionary

Given a set S [m], a fully indexable dictionary (FID) of S is a representation that supports rank and select operations on both S and its complement S¯ = [m] \ S in constant time [50].

It is easy to see that the bitvector representation of a set, with auxiliary structures to support rank and select on both the bits as mentioned in Section 37.2, is an FID. But this requires m + o(m) bits, where m is the size of the universe. Here we look at an FID representation that takes B +o(m) bits of space. Note that when the set is reasonably sparse (namely when n = m/ω(lg m)) B = o(m), and hence it improves the space complexity of the bitvector representation.

Let S [m] be a given set of size n. Divide [m] into blocks of consecutive elements, with block size u = 1 1 lg m . Let Si be the subset of S that falls into the i-th block. Each of the Si’s is represented by storing an index into a table that contains the characteristic bitvectors of all possible subsets of a particular size from the universe [u]. As a consequence, the space occupied by these representations together with all the precomputed tables can

be shown to be B + o(m) bits. To enable fast access to the representations of these subsets, we store the partial sums of the sizes of the subsets, and also the partial sums of the lengths of the representations of these subsets, which take O(m lg lg m/ lg m) bits. This can be used to support rank in O(1) time.

To support select, we first store the positions of every (lg2 m)-th element explicitly in an array, which takes O(m/ lg m) bits. Call the part the universe that lies between two successive elements in this array a segment. If the size of a segment is more than lg4 m, then we explicitly store all the lg2 m elements of S that belong to this segment in sorted order. This takes lg3 m bits for every such ‘sparse’ segment, and hence at most m/ lg m bits, over all the sparse segments. Dense segments are handled by constructing a complete tree with branching factor lg m, and so constant height, whose leaves are the blocks that constitute this segment, and storing some additional information to navigate this tree efficiently (see the searchable partial sum structure in Section 37.8).

To support rank and select on S¯, first observe that an implicit representation of a set over a given universe is also an implicit representation of its complement. Thus, we need not store the implicit representations of S¯ again. Except for this, we repeat the above construction with Si’s replaced by S¯ ’s.

The overall space requirement of the structure is B + O(m lg lg m/ lg m) bits, and rank and select are supported on both S and S¯ in O(1) time. See [50] for details.

Dynamic Dictionary

We have looked at several succinct structures for static dictionaries. We now briefly consider the dynamic dictionary problem where one can add and delete elements from the set while supporting the membership queries.

Model: The model of memory allocation is very important in dynamic data structures. One widely used model [4, 45, 50] is to assume the existence of a ‘system’ memory manager that would allocate and free memory in variable-sized chunks. In this model, the space complexity of a structure is counted as the total size of all the blocks allocated for that structure, and hence this approach does not account for the space wastage due to external fragmentation.

Fundamentally, memory is most easily viewed as a large array. If we are to use the storage, we must manage it. Therefore a simple view is to count all the fragmentation we may cause and count the memory usage as the difference between the addresses of the first and last locations used by the structure. While more complex scenarios may be more realistic in certain cases, we take this simple address difference model as our focus. The methods we discuss are equivalent under either model up to constant factors.

A balanced tree can be used to support all the dynamic dictionary operations in O(lg n) time using n lg m + O(n lg n) bits, where n is the current size of the set. Using the ideas of the FKS dictionary, Dietzfelbinger et al. [14] gave a dynamic dictionary structure that supports membership in O(1) time and updates (insert/delete) in O(1) expected amortized time. This structure takes O(n lg m) bits of space. There have been several improvements, lowering the space complexity close to the information theoretic-minimum, culminating in a structure that takes B + o(B) bits with the same query complexity as above. See [5, 18, 46, 47, 51] and the references therein.

All these structures also support associating satellite information with the elements, so that whenever an element is found to be in the set, we can also retrieve the satellite information associated with it in constant time.

Comments

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout