Data Structures for Sets:Partition Maintenance Algorithms
Partition Maintenance Algorithms
A partition P of a given universe U = {1, 2,... , m} is a collection of #P disjoint subsets (parts), P (1),P (2),... ,P (#P ), of U such that ∪iPi = U . A case of special interest is that of bipartitions, which is a partition with two parts. A partition P is an equivalence relation on U , which we denote as ≡P . Given two partitions P and Q, the induced partition of P and Q is the partition that represents the equivalence relation x ≡ y ⇔ ((x ≡P y) ∨ (x ≡Q y))
(in words, two elements belong to the same part of the induced partition iff they belong to the same part in both P and Q). The problem we consider is the following. Given a collection F of partitions (initially empty), to support the following operations:
We assume here that each new partition P is given by an array A[1..m] containing integers from 0 to #P − 1. Other operations include asking if two elements belong to the same partition of the global induced partition, reporting the set of elements that belong to the same part of the global induced partition, or, for any two elements, reporting the number of partitions in which these two elements are in different parts.
As noted by Bender et al. [7] and Calinescu [9], this problem has a number of applications. The general issue is supporting a classification system that attempts to categorize objects based on a number of tests, where each test realizes a partition of the objects (a bipartition, for instance, could model the outcome of a boolean test). The above data structure would be useful in the pre-processing phase of such a classification system, where, for example, an algorithm may repeatedly insert and delete partitions (tests) until it finds a small set of tests that distinguish all items of the universe. Examples in optical character recognition (OCR) and VLSI are given by the above authors.
We now discuss some solutions to these problems. The model used is the word RAM model with word size Θ(log n), where we take k = |F | and n = km. We first consider the case of general partitions, and then that of bipartitions. A simple data structure for general partitions was given by Bender et al. [7], and resembles the partition tree of Section 33.3.
A key fact they use is (apply radix sorting on triples of the form (P [i], Q[i], i)):
PROPOSITION 33.1 Given two partitions P and Q of U = {1,... , m} we can calculate the induced partition of P and Q in O(m) time.
This immediately implies that we can maintain the induced partition of F under inserts alone in O(m) time and also support report in O(m) time—we simply maintain the global induced partition and update it with each insert. deletes are not so easy, however.
We place the partitions at the leaves of a full binary tree with k leaves. At each internal node v of this binary tree we store the partition induced by all the partitions descended from v. Clearly, the root contains the global induced partition. Inserting a partition P involves replacing a leaf containing a partition Q by an internal node that has P and Q as children, and updating the partitions stored at this internal node and all its ancestors. Deleting a partition involves deleting the appropriate leaf, and maintaining fullness by possibly swapping a ‘rightmost’ leaf with the deleted leaf. In either case, we update partitions from at most two leaves up to the root. This gives a data structure that supports insert and delete in O(m log k) time and report in O(m) time. The space used is O(mn) words of memory.
The time for insert can be improved to amortised O(m), while leaving the time complexity of all other operations the same. The idea is to group all partitions into disjoint groups of t = plog kl each, leaving perhaps one incomplete group of size smaller than t. For each group we store its induced partition, and also store the induced partitions of all groups, except the incomplete group, at the leaves of a tree T (which may now have O(k/ log k) leaves) as before. In addition, we explicitly store the global induced partition G.
When performing insert(P ) we add P to the incomplete group and in O(m) time, update the global partition G. If the incomplete group reaches size t, in addition, we calculate the group’s induced partition in O(mt) = O(m log k) time and insert this induced partition into T , also taking O(m log k) time, and start a new empty incomplete group. Deleting a partition P is done as follows. We delete P from its group. If P is in the incomplete group we recompute the G in O(mt) time. If P ’s group is stored in T , we recompute the new partition induced by P ’s group in O(mt) time and update T in O(m log k) time (if P is the last remaining partition in its group we delete the corresponding leaf, as before). We then recompute G in O(mt) = O(m log k) time as well. Note that the amortised cost of an insertion is now O(m), since we spend O(m log k) time every log k insertions. Finally, Bender et al. note that a signature-based scheme gives a Monte-Carlo method that performs all operations in O(m) time, but has a small chance of outputting an incorrect result (the algorithm runs correctly with probability 1 − O(m−c ) on all inputs).
We now consider the important special case of bi-partitions, and give a sketch of Ca- linescu’s [9] algorithm for solving this problem in optimal amortised time. Again letting k = |F |, one can associate a k-bit binary string σx with each x ∈ U = {1,... , m}, which specifies in which part of each of the k partitions x lies. Let π denote a permutation such that σπ−1 (i) ≤ σπ−1 (i+1), for i = 1,... ,m − 1; i.e., π represents a sorted order on the (multi- set) of strings {σx | x ∈ U }. Furthermore, let lcpi denote the most significant position where σπ−1 (i) and σπ−1 (i+1) differ, for i = 1,... ,m − 1 (lcpi = k +1 if σπ−1 (i) = σπ−1 (i+1)).
We can now clearly support report in O(m) time, as elements in the same part of the global induced partition will be consecutive in sorted order, and the lcp values allow us to deter- mine the boundaries of parts of the global induced partition without inspecting the strings. An insert of a partition is also easily supported, as only local re-arrangements of elements lying within the same part of the previous global induced partition are required.
Deleting a partition is a little harder, and requires a few observations. Suppose that we delete the partition that gives the t-th most significant bit of σ1,..., σm. Suppose that for two indices i, j, j > i, lcpi < t and lcpj < t, but all lcp’s in between are at least t. Then we can conclude that the strings in positions 1 to i − 1 of the sorted order continue to appear in (possibly a different order in) positions 1 to i − 1 after the deletion of this partition, and likewise the strings in positions j + 1 to m also do not ‘move’ except internally. Let Σ denote the strings that appear in positions i through j in sorted order, i.e., Σ = {σl | l ∈ {π−1(i), π−1(i + 1),..., π−1(j)}}. We now show how to sort Σ, and repeated application of this procedure suffices to re-sort the array. Note that all the strings in Σ that have 0 in the t-th position maintain their relative order after the deletion, and likewise those strings with 1 in the t-th position. Thus, re-sorting Σ is simply a matter of merging these two sets of strings. At a high level, the merging procedure proceeds like the (standard, trivial) algorithm. However, a naive approach would require Θ(k) time per comparison, which is excessive. Instead, we note that at each step of the merging, the next candidate can either be determined in O(1) time (when the relative order of the two candidates is implicit from the lcp data) or a number, c, of comparisons need to be made. However, if c comparisons are made, then there is at least one lcp value in the new array that is c more than its counterpart in the old array. Since lcp values are bounded by O(k), no string will be involved in more than O(k) comparisons during its lifetime, and the cost of these comparisons can be charged to the insertions of the partitions. This intuition can be formalized by a potential function argument to show that the amortised cost of a deletion is indeed O(n), thus giving an optimal (amortised) algorithm for this problem.
Comments
Post a Comment