Data Structures for Sets:Extremal Sets and Subset Testing

Extremal Sets and Subset Testing

This section deals with testing sets in F for the subset containment relationship. We first survey a static version of this problem, and then consider a dynamisation.

Static Extremal Sets

Here we assume that we are given a family F of sets from a common universe as input. A set S is maximal in F if there is no set T F such that S T . A set S is minimal in F if there is no set T F such that S T . The extremal sets problem is that of finding all the maximal (or all the minimal) sets in F . A closely related problem is the computation of the complete subset graph, which represents all edges in the partial order induced among members of F by the subset relation. Specifically, the complete subset graph is a directed graph whose vertices are the members of F and there is an edge from x to y iff x y. This is a problem that arises in a number of practical contexts, for example, it can be used to maximally simplify formulae in restricted conjunctive normal form [29].

This problem has been considered in a number of papers including [29–32, 46]. We now summarize the results in these papers. As before, we let k = |F | denote the size of the family, and n = �S∈F |S| be the total cardinality of all sets. A number of straightforward approaches can be found with a worst-case complexity of O(n2). It can be shown that the subset graph has Ω(n2/(log n)2) edges [46], which gives a lower bound for any algorithm that explicitly lists the complete subset graph. An aim, therefore, is to bridge the gap between these bounds; all results below are in the word RAM model.

Yellin and Jutla gave an algorithm that computes the subset graph using O(n2/ log n) dictionary operations. Using hashing, all dictionary operations take O(1) expected time, and we get an O(n2/ log n) expected running time. Using Willard’s q-fast tries [42], or the more complex data structure of Beame and Fich [5] gives running times of O(n2/log n) or O((n log log n)2/ log n) respectively. Pritchard [30] gave a simple algorithm that ran in O(n2/ log n) time. In [31] he re-analyzed an earlier algorithm [29] to show that this algorithm, too, ran in O(n2/ log n) time (the algorithm of [29, 31] uses RAM operations less extensively and is a good candidate for porting to the pointer machine model). Finally, Pritchard [32] gave an algorithm that uses the bit-parallelism of the word RAM extensively to achieve an O(n log log n/(log n)2) running time.

All of the above are static problems—it is very natural to ask about the complexity of this problem when updates may change the sets in F . Again, if one wishes explicitly to maintain the entire subset graph, it is easy to see that Ω(n) edges may change as the result of a single update. Take U = {1,... , u} and F = {A, B2,... , Bu}, where A = U and Bi = {1, i} for i = 2,... , u. The sum of the sizes of the sets in F is 2u 2, but deleting the element 1 from A removes all u 1 edges from the complete subset graph. It is not very hard to come up with algorithms that can achieve an O(n) running time (see e.g. [46]). To obtain a better complexity, therefore, we must consider algorithms that do not explicitly store this graph. One way of doing this is to consider the dynamic subset testing problem, defined as the problem of supporting the base repertoire (33.1) and the subset operation (33.5). Since the query subset(A, B) tests if there is an edge between A and B in the complete subset graph, this problem seems to be an appropriate dynamisation of the extremal sets problem, and it is dealt with in the next section.

Dynamic Set Intersections and Subset Testing

Rather than consider the dynamic subset testing problem directly, we consider a related problem, the dynamic set intersection problem. In addition to the base repertoire (33.1) of actions above, we consider an enumerative version of the following:

intersect(A, B) Report the intersection of sets A and B. (33.6) Other variations could include returning the size of the intersection, or retrieving some values associated with the elements in the intersection. A unifying way to study these problems is as follows: we are given a set M of information items that will be associated with elements of U , a function I : U M that associates values from M with keys in U . We assume that M = (M, +, 0) is a monoid. The query intersect(A, B) then returns xAB I(x), where the sum is taken with respect to +. It is easy to cast the intersection problem and its variants in this framework. The basic problem defined above can be obtained by letting M = (2U , , {}) and I(x) = {x} for all x, and the problem where one merely has to report the size of the intersection can be obtained by setting M = (IN, +, 0) and I(x) = 1 for all x, where + here is normal (arithmetic) addition. Note that dynamic subset testing, defined in the previous section, easily reduces to the intersection problem:

A B iff |A| = |A B|.

We now survey the results in this area. It is useful to use the notation O-(f (n)) = c=0O(f (n) logc n), which ignores polylogarithmic factors are ignored (a similar convention is used for the Ω notation, with inverse polylogarithmic factors being ignored).

For this problem, Yellin [45] gave an algorithm that processed a sequence of n insertions and deletions and q intersect operations in time O-(n · n1/k + qn(11/k) ) time for any fixed k. The intuition behind this result is as follows. Take t = n11/k and say that a set S F is large if |S| t and small otherwise. All sets are represented as dictionaries (allowing membership testing in logarithmic time). Intersections between two small sets, or between a small set and a large one, are handled by testing each element of one set for membership in the other. Clearly this takes O-(t) time, and insertion into and deletion from a small set takes O-(1) time. For every pair of large sets S, T , the algorithm keeps track of I(S T ) explicitly, by storing the elements of S T in a balanced binary search tree, and storing at any internal node x the monoid sums of all the elements under x. Since there are at most n/t = n1/k large sets, an insertion into, or a deletion from, a large set requires updating at most n1/k of the pairwise intersections with other sets, and takes O-(n1/k ) time. This proves the time bound, modulo some details such as dealing with changes in the value of t caused by changes in n and so on.

Dietz et al. [11] noticed that if one were to process a sequence of n updates and q queries, where n and q are known in advance, then the overall cost of Yellin’s algorithm for processing the sequence is minimized by taking n1/k = min{n, q}, giving an overall time bound of O-(q + nq). They gave an algorithm that achieves this bound even when n and q are not known in advance.

We now show that this bound is essentially tight in the arithmetic model, by giving a lower bound of Ω-(q + nq). For simplicity, consider the problem where the elements in the intersection are to be reported, i.e., take M = (2U , , {}) and I(x) = {x}. Starting from an empty family F , we build it up by insertions so that the (sums of) sizes of the pairwise intersections of sets in F are large, and query all possible intersections of the sets. If the answers to all the queries were to be obtained by adding (unioning) together singleton sets, then the lower bound would follow. Unfortunately, this is too simplistic: subsets obtained as temporary values during the computation of one answer may be re-used to answer another query. To get around this, we note that a subset that is used to compute the answer to several intersection queries must lie in the common intersection of all the sets involved, and construct F so that the common intersection of any of a sufficiently large (yet small) number of sets in F is small. This means that no large subset can be reused very often.

image

The existence of such a family is easily shown by a probabilistic argument (a collection of random sets of the appropriate size suffices). We now represent F in the data structure by executing the appropriate insertions. Note that this requires at most km = N = Θ(n) update operations, since |Si|m for all i. We then query the pairwise intersections of all the sets; there are (k) = Θ(q) queries in all.

Firstly, note that the sizes of all the output sets sum to Ω(mk2) by (a) above. The output to each query is conceptually obtained by a binary union tree in which each internal node combines the answers from its children; the external nodes represent singleton sets. Each node can be labeled with a set in the obvious way. Consider the entire forest; we wish to count the number of distinct nodes in the forest (that is, nodes labeled with distinct sets). Since each distinct set corresponds to at least one instruction that is not counted elsewhere, counting the number of distinct sets is a lower bound on the number of instructions executed. We consider only nodes that correspond to sets of size £ or larger. Clearly, the number of such sets is Ω(mk2). Furthermore, no such set can be used to answer more than ( ) different queries by (b). It follows that there are Ω(mk23) = Ω-(nq) distinct sets, giving us a lower bound of this magnitude (as mentioned above, a lower bound of Ω(q) is trivial whenever q = Ω-(nq)).

Dietz et al. also considered the relatively high memory usage of the above algorithms. For example, if q = Θ(n) then both Yellin’s algorithm and Dietz et al.’s algorithm use Ω(n3/2) space. Dietz et al. also considered the complexity of the above problem when the algorithms were restricted to use s memory locations, where n s n2, and gave an algorithm that processed n operations (queries and updates) in O-(n/s1/3) time. Both the upper bound and the lower bound are more complex than those for the unlimited-memory case. We summarize with the main theorems of Dietz et al.:

THEOREM 33.4 For the problem of maintaining a family of sets under create, insert, delete and intersect, we have the following results:

(i) Any algorithm requires Ω-(q + nq) time to process a sequence of n updates and q queries in the arithmetic model of computation.

(ii) Any intermixed sequence of n updates and q queries can be processed in O-(nq + q) time using O(min{nq, n2}) space.

(iii) There is a sequence of O(n) updates and queries that requires Ω-(n2s1/3) operations in the arithmetic model, if the algorithm is restricted to using s memory locations.

(iv) Any intermixed sequence of O(n) updates and queries can be performed in O-(n2/s1/3) time and O(s) space.

As mentioned above, the arithmetic model is a relatively weak one (making it easy to prove lower bounds), but all the algorithms that realise the upper bounds fit into this model. It would be interesting to prove similar lower bounds for stronger models or to design algorithms that improve upon the performance of the above in stronger models. Note also, that the lower bounds are for the intersection problem, not the problem that we originally considered, that of subset testing. Since we get back a boolean answer from a subset testing query, it does not fit into the arithmetic framework.

Comments

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout