Persistent Data Structures:Making Specific Data Structures More Efficient
Making Specific Data Structures More Efficient
The purely functional deques of Kaplan and Tarjan [23], the confluently persistent deques of Kaplan, Okasaki, and Tarjan [21], the purely functional heaps of Brodal and Okasaki [6], and the purely functional finger search trees of Kaplan and Tarjan [22], are all based on a simple and useful mechanism called redundant counters, which to the best of our knowledge first appeared in lecture notes by Clancy and Knuth [9]. In this section we describe what redundant counters are, and demonstrate how they are used in simple persistent deques data structure.
A persistent implementation of deques support the following operations:
ql = push(x, q): Inserts an element x to the beginning of the deque q returning a new deque ql in which x is the first element followed by the elements of q.
(x, ql) = pop(q): Returns a pair where x is the first element of q and ql is a deque containing all elements of q but x.
ql = Inject(x, q): Inserts an element x to the end of the deque q returning a new deque ql in which x is the last element preceded by the elements of q.
(x, ql) = eject(q): Returns a pair where x is the last element of q and ql is a deque containing all elements of q but x.
A stack supports only push and pop, a queue supports only push and eject. Catenable deques also support the operation q = catenate(q1, q2): Returns a queue q containing all the elements of q1 followed by the elements of q2.
Although queues, and in particular catenable queues, are not trivial to make persistent, stacks are easy. The regular representation of a stack by a singly linked list of nodes, each containing an element, ordered from first to last, is in fact purely functional. To push an element onto a stack, we create a new node containing the new element and a pointer to the node containing the previously first element on the stack. To pop a stack, we retrieve the first element and a pointer to the node containing the previously second element.
Direct ways to make queues persistent simulate queues by stacks. One stack holds ele- ments from the beginning of the queue and the other holds elements from its end. If we are interested in fully persistence this simulation should be real time and its details are not trivial. For a detailed discussion see Kaplan and Tarjan [23] and the references there.
Kaplan and Tarjan [23] described a new way to do a simulation of a deque with stacks. They suggest to represent a deque by a recursive structure that is built from bounded-size deques called buffers. Buffers are of two kinds: prefixes and suffixes. A non-empty deque q over a set A is represented by an ordered triple consisting of a prefix , pref ix(q), of elements of A, a child deque, child(q), whose elements are ordered pairs of elements of A, and a suffix , suff ix(q), of elements of A. The order of elements within q is the one consistent with the orders of all of its component parts. The child deque child(q), if non-empty, is represented in the same way. Thus the structure is recursive and unwinds linearly. We define the descendants {childi( q)} of deque d in the standard way, namely child0(q) = q and childi+1(q) = child(childi(q)) for i ≥ 0 if childi(q) is non-empty.
Observe that the elements of q are just elements of A, the elements of child(q) are pairs of elements of A, the elements of child(child(q)) are pairs of pairs of elements of A, and so on. One can think of each element of childi(q) as being a complete binary tree of depth i, with elements of A at its 2i leaves. One can also think of the entire structure representing q as a stack (of q and its descendants), each element of which is prefix-suffix pair. All the elements of q are stored in the prefixes and suffixes at the various levels of this structure, grouped into binary trees of the appropriate depths: level i contains the prefix and suffix of childi(q). See Figure 31.3.
FIGURE 31.3: Representation of a deque of elements over A. Each circle denotes a deque and each rectangle denotes a buffer. Squares correspond to elements from A which we denote by numbers and letters. Each buffer contains 0, 1, or 2 elements. Three versions are shown V1, V2, and V3. Version V2 was obtained from V1 by injecting the element f . Version V3 obtained from version V2 by injecting the element g. The latter inject triggered two recursive injects into the child and grandchild deques of V2. Note that identical binary trees and elements are represented only once but we draw them multiple times to avoid cluttering of the figure.
Because of the pairing, we can bring two elements up to level i by doing one pop or eject at level i + 1. Similarly, we can move two elements down from level i by doing one push or inject at level i + 1. This two-for-one payoff leads to real-time performance.
Assume that each prefix or suffix is allowed to hold 0, 1, or 2 elements, from the beginning or end of the queue, respectively. We can implement ql = push(x, q) as follows. If the prefix of q contains 0 or 1 elements we allocate a new node to represent ql make its child deque and its suffix identical to the child and suffix of q, respectively. The prefix of ql is a newly allocated prefix containing x and the element in the prefix of q, if the prefix of q contained one element. We return a pointer the new node which represents ql. For an example consider version V2 shown in Figure 31.3 that was obtained from version V1 by a case of inject symmetric to the case of the push just described.
The hard case of the push is when the prefix of q already contains two elements. In this case we make a pair containing these two elements and push this pair recursively into child(q). Then we allocate a new node to represent ql, make its suffix identical to the suffix of q, make the deque returned by the recursive push to child(q) the child of ql, and make the prefix of ql be a newly allocated prefix containing x. For an example consider version V3 shown in Figure 31.3 that was obtained from version V2 by a recursive case of inject symmetric to the recursive case of the push just described. The implementations of pop and eject is symmetric.
This implementation is clearly purely functional and therefore fully persistent. However the time and space bounds per operation are O(log n). The same bounds as one gets by using search trees to represent the deques with the path copying technique. These logarithmic time bounds are by far off from the ephemeral O(1) time and space bounds.
Notice that there is a clear correspondence between this data structure and binary coun- ters. If we think of a buffer with two elements as the digit 1, and of any other buffer as the digit 0, then the implementation of push(q) is similar to adding one to the binary number defined by the prefixes of the queues childi(q). It follows that if we are only interested in partially persistent deques then this implementation has O(1) amortized time bound per operation (see the discussion of binary counters in the next section). To make this simu- lation efficient in a fully persistent setting and even in the worst case, Kaplan and Tarjan suggested to use redundant counters.
Redundant Binary Counters
To simplify the presentation we describe redundant binary counters, but the ideas carry over to any basis. Consider first the regular binary representation of an integer i. To obtain from this representation the representation of i + 1 we first flip the rightmost digit. If we flipped a 1 to 0 then we repeat the process on the next digit to the left. Obviously, this process can be long for some integers. But it is straightforward to show that if we carry out a sequence of such increments starting from zero then on average only a constant number of digits change per increment.∗∗ Redundant binary representations (or counters as we will call them) address the problem of how to represent i so we can obtain a representation of i + 1 while changing only a constant number of digits in the worst case.
A redundant binary representation, d, of a non-negative integer x is a sequence of digits We call d regular if, between any two digits equal to 2, there is a 0, and there is a 0 between the rightmost 2 and the least significant digit. Notice that the traditional binary representation of each integer (which does not use the digit 2) is regular . In the sequel when we refer to a regular representation we mean a regular redundant binary representation, unless we explicitly state otherwise.
Suppose we have a regular representation of i. We can obtain a regular representation of i + 1 as follows. First we increment the rightmost digit. Note that since the representation of i is regular, its rightmost digit is either 0 or 1. So after the increment the rightmost digit is either 1 or 2 and we still have a redundant binary representation for i + 1. Our concern is that this representation of i + 1 may not be regular. However, since the representation of i we started out with was regular the only violation to regularity that we may have in the representation of i + 1 is lacking a 0 between the rightmost 2 and the least significant digit. It is easy to check that between any two digits equal to 2, there still is a 0, by the regularity of i.
We can change the representation of i + 1 to a representation which is guaranteed to be regular by a simple fix operation. A fix operation on a digit di = 2 increments di+1 by 1 and sets di to 0, producing a new regular representation dl representing the same number as d.†† If after incrementing the rightmost digit we perform a fix on the rightmost 2 then we switch to another representation of i + 1 that must be regular. We omit the proof here which is straightforward.
It is clear that the increment together with the fix that may follow change at most three digits. Therefore redundant binary representations allow to perform an increment while changing constantly many digits. However notice that in any application of this numbering system we will also need a representation that allows to get to the digits which we need to fix efficiently. We show one such representation in the next section.
These redundant representations can be extended so we can also decrement it while changing only a constant number of digits, or even more generally so that we can increment or decrement any digit (add or subtract 2i) while changing a constant number of other digits. These additional properties of the counters were exploited by other applications (see e.g. [22, 24]).
Kaplan and Tarjan use this redundant binary system to improve the deque implementation we described above as follows. We allow each of the prefixes and suffixes to contain between 0 and 5 elements. We label each buffer, and each deque, by one of the digits 0, 1, and 2. We label a buffer 0 if it has two or three elements, we label it 1 if it has one or four elements, and we label it 2 if it has zero or five elements. Observe that we can add one element to or delete one element from a buffer labeled 0 or 1 without violating its size constraint: A buffer labeled 0 may change its label to 1, and a buffer labeled 1 may change its label to 2. (In fact a 1 can also be changed to 0 but this may not violate regularity.) The label of a deque is the larger among the labels of its buffers, unless its child and one of its buffers are empty, in which case the label of the deque is identical to the label of its nonempty buffer. This coloring of the deques maps each deque to a redundant binary representation. The least significant digit of this representation is the digit of q, the next significant digit is the digit of child(q), and, in general, the ith significant digit is the digit corresponding to childi(q) if the latter is not empty. We impose an additional constraint on the deques and require that the redundant binary representation of any top-level deque is regular.
A regular top-level deque is labeled 0 or 1 which implies that both its prefix and its suffix are labeled 0 or 1. This means that any deque operation can be performed by operating on the appropriate top-level buffer. Suppose that the operation is either a push or a pop, the case of inject and eject is symmetric. We can construct the resulting queue ql by setting child(ql) = child(q) and suff ix(ql) = suff ix(q). The prefix of ql is a newly allocated buffer that contains the elements in pref ix(q) together with the new element in case of push or without the first element in case of pop. Clearly all these manipulations take constant time. The label of ql, however, may be one larger than the label of q. So the redundant binary representation corresponding to ql is either the same as the redundant binary representation of q in which case it is regular, or it is obtained from the redundant binary representation of q by incrementing the least significant digit. (The least significant digit can also decrease in which case regularity is also preserved.) This corresponds to the first step in the increment procedure for redundant regular representations described in the previous section.
To make the redundant binary representation of ql regular we may have to apply a fix operation. Let i be the minimum such that childi(ql) is labeled 2. If for all j < i, childj (ql) is labeled 1 then the fix has to change the label of childi(ql) to 0 and increment the label of childi+1(ql).
Fortunately, we have an appropriate interpretation for such a fix. Assume childi+1(ql) have a non-empty child. (We omit the discussion of the case where childi+1(ql) have an empty child which is similar.) We know that the label of childi+1(ql) is either 0 or 1 so neither of its buffers is empty or full. If the prefix of childi(ql) has at least four elements we eject 2 of these elements and push them as a single pair to the prefix of childi+1(ql). If the prefix of childi(ql) has at most one element we pop a pair from the prefix of childi+1(ql) and inject the components of the pair into the prefix of childi(ql). This makes the prefix of childi(ql) containing either two or three elements. Similarly by popping a pair from or pushing a pair to the suffix of childi(ql), and injecting a pair to or ejecting a pair from the suffix of childi+1(ql) we make the suffix of childi(ql) containing two or three elements. As a result the label of childi(ql) and its two buffers becomes 0 while possibly increasing the label of one or both buffers of childi+1(ql) and thereby the label of childi+1(ql) as well.
There is one missing piece for this simulation to work efficiently. The topmost deque labeled 2 may be arbitrarily deep in the recursive structure of ql, since it can be separated from the top level by many deques labeled 1. To implement the fix efficiently we have to be able to find this deque fast and change it in a purely functional way by copying the deques that change without having to copy all their ancestors deques.
For this reason we do not represent a deque in the obvious way, as a stack of prefix-suffix pairs. Instead, we break this stack up into substacks. There is one substack for the top- level deque and one for each descendant deque labeled 0 or 2 not at the top level. Each substack consists of a top-level, or a deque labeled 0, or a deque labeled 2 and all consecutive proper descendant deques labeled 1. We represent the entire deque by a stack of substacks of prefix-suffix pairs using this partition into substacks. This can be realized with four pointers per each node representing a deque at some level. Two of the pointers are to the prefix and suffix of the deque. One pointer is to the node for the child deque if this deque is non-empty and labeled 1. One pointer is to the node of the nearest proper descendant deque not labeled 1, if such a deque exists and q itself is not labeled 1 or top-level. See Figure 4.2.
A single deque operation will require access to at most the top three substacks, and to at most the top two elements in any such substack. The label changes caused by a deque operation produce only minor changes to the stack partition into substacks, changes that can be made in constant time. In particular, changing the label of the top-level deque does not affect the partition into substacks. Changing the label of the topmost deque which is labeled 2 to 0 and the label of its child from 1 to 2 splits one substack into its first element, now a new substack, and the rest. This is just a substack pop operation. Changing the label of the topmost deque which is labeled 2 to 0 and the label of its child from 0 to 1 merges a singleton substack with the substack under it. This is just a substack push operation.
To add catenation, Kaplan and Tarjan had to change the definition of the data structure and allow deques to be stored as components of elements of recursive deques. The redundant
FIGURE 31.4: Pointer representation of stack of substacks structure. Each circle corre- sponds to a deque and it is marked by its label. Each buffer is a rectangle which is marked by its label. Triangles denote complete binary trees of elements whose depths depend on the level. This particular queue is represented by a stack of three substacks.
binary numbering system, however, still plays a key role. To represent a catenable deque, Kaplan and Tarjan use noncatenable deques as the basic building blocks. They define a triple over a set A recursively as a prefix of elements of A, a possibly empty deque of triples over A, and a suffix of elements of A, where each prefix or suffix is a noncatenable deque. Then, they represent a catenable deque of elements from A by either one or two triples over A. The underlying skeleton of this structure is a binary tree (or two binary trees) of triples.
The redundant binary number system is extended so that it can distribute work along these trees by adding an extra digit.
Kaplan, Okasaki, and Tarjan [21] simplified these data structures at the expense of making the time bounds amortized rather than worst case and using assignment, thus obtaining a confluently persistent data structure which is not purely functional. The key idea underlying their data structure is to relax the rigid constraint of maintaining regularity. Instead, we “improve” the representation of a deque q with full or empty prefix when we try to push or pop an element from it. Similarly, with full or empty suffix. This improvement in the representation of q is visible to all deques that contain q as a subdeque at some level and prevents from pushing into deques with full prefixes or popping from deques with empty prefixes from happening too often.
More specifically, assume that we push into a deque q with full prefix. First, we eject two elements from this prefix, make a pair containing them, and push the pair recursively into child(q). Let the result of the recursive push be childl(q). Then we change the representation of q so that it has a new prefix which contains all the elements in the prefix of q but the two which we ejected, and its child deque is childl(q). The suffix of q does not change. Finally we perform the push into q by creating a new queue ql that has the same suffix and child deque as q, but has a new prefix that contains the elements in the prefix of q together with the new element. A careful but simple analysis shows that each operation in this implementation takes O(1) amortized time. By extending this idea, Kaplan, Okasaki, and Tarjan managed to construct catenable deques using only constant size buffers as the basic building blocks.
Comments
Post a Comment