Persistent Data Structures:Introduction and Algorithmic Applications of Persistent Data Structures

Introduction

Think of the initial conﬁguration of a data structure as version zero, and of every subsequent update operation as generating a new version of the data structure. Then a data structure is called persistent if it supports access to all versions and it is called ephemeral otherwise. The data structure is partially persistent if all versions can be accessed but only the newest version can be modiﬁed. The structure is fully persistent if every version can be both accessed and modiﬁed. The data structure is conﬂuently persistent if it is fully persistent and has an update operation which combines more than one version. Let the version graph be a directed graph where each node corresponds to a version and there is an edge from node V1 to a node V2 if and only of V2 was created by an update operation to V1. For partially persistent data structure the version graph is a path; for fully persistent data structure the version graph is a tree; and for conﬂuently persistent data structure the version graph is a directed acyclic graph (DAG).

A notion related to persistence is that of purely functional data structures.

(See Chapter 40 by Okasaki in this handbook.) A purely functional data structure is a data structure that can be implemented without using an assignment operation at all (say using just the functions car, cdr, and cons, of pure lisp). Such a data structure is automatically persistent. The converse, however, is not true. There are data structures which are persistent and perform assignments.

Since the seminal paper of Driscoll, Sarnak, Sleator, and Tarjan (DSST) [18], and over the past ﬁfteen years, there has been considerable development of persistent data structures. Persistent data structures have important applications in various areas such as functional programming, computational geometry and other algorithmic application areas.

The research on persistent data structures splits into two main tracks. The ﬁrst track is of designing general transformations that would make any ephemeral data structure persistent while introducing low overhead in space and time. The second track is on how to make speciﬁc data structures, such as lists and search trees, persistent. The seminal work of DSST mainly addresses the question of ﬁnding a general transformation to make any data structure persistent. In addition DSST also address the special case of making search trees persistent in particular. For search trees they obtain a result which is better than what one gets by simply applying their general transformation to, say, red-black trees.

There is a naive scheme to make any data structure persistent. This scheme performs the operations exactly as they would have been performed in an ephemeral setting but before each update operation it makes new copies of all input versions. Then it performs the update on the new copies. This scheme is obviously ineﬃcient as it takes time and space which is at least linear in the size of the input versions.

When designing an eﬃcient general transformation to make a data structure persistent DSST get started with the so called fat node method . In this method you allow each ﬁeld in the data structure to store more than one value, and you tag each value by the version which assigned it to the ﬁeld. This method is easy to apply when we are interested only in a partially persistent data structure. But when the target is a fully persistent data structure, the lack of linear order on the versions already makes navigation in a naive implementation of the fat node data structure ineﬃcient. DSST manage to limit the overhead by linearizing the version tree using a data structure of Dietz and Sleator so we can determine fast whether one version precedes another in this linear order.

Even when implemented carefully the fat node method has logarithmic (in the number of versions) time overhead to access or modify a ﬁeld of a particular node in a particularver sion. To reduce this overhead DSST described two other methods to make data structures persistent. The simpler one is the node copying method which is good to obtain partially persistent data structures. For obtaining fully persistent data structures they suggest the node splitting method. These methods simulate the fat node method using nodes of constant size. They show that if nodes are large enough (but still of constant size) then the amount of overhead is constant per access or update of a ﬁeld in the ephemeral data structure.

These general techniques suggested by DSST have some limitations. First, all these methods, including even the fat node method, fail to work when the data structure has an update operation which combines more than one version, and conﬂuent persistence is desired. Furthermore, the node splitting and node copying methods apply only to pointer based data structures (no arrays) where each node is of constant size. Since the simulation has to add reverse pointers to the data structure the methods require nodes to be of bounded indegree as well. Last, the node coping and the node splitting techniques have O(1) amortized overhead per update or access of a ﬁeld in the ephemeral data structure. DSST left open the question of how to make this overhead O(1) in the worst case.

These limitations of the transformations of DSST were addressed by subsequent work. Dietz and Raman [13] and Brodal [5] addressed the question of bounding the worst case overhead of an access or an update of a ﬁeld. For partial persistence Brodal gives a way to implement node coping such that the overhead is O(1) in the worst case. For fully persistence, the question of whether there is a transformation with O(1) worst case overhead is still unresolved.

The question of making data structures that use arrays persistent with less than loga- rithmic overhead per step has been addressed by Dietz [12]. Dietz shows how to augment the fat node method with a data structure of van Emde Boaz, Kaas, and Zijlstra [33, 34] to make an eﬃcient fully persistent implementation of an array. With this implementation, if we denote by m the number of updates, then each access takes O(log log m) time, an update takes O(log log m) expected amortized time and the space is linear in m. Since we can model the memory of a RAM by an array, this transformation of Dietz can make any data structure persistent with slowdown double logarithmic in the number of updates to memory.

The question of how to make a data structure with an operation that combines versions conﬂuently persistent has been recently addressed by Fiat and Kaplan [19]. Fiat and Ka- plan point out the fundamental diﬀerence between fully persistent and conﬂuently persistent data structures. Consider the naive scheme described above and assume that each update operation creates constantly many new nodes. Then, as long as no update operation com- bines more than one version, the size of any version created by the naive scheme is linear in the number of versions. However when updates combine versions the size of a single version can be exponential in the number of versions. This happens in the simple case where we update a linked list by concatenating it to itself n times. If the initial list is of size one then the ﬁnal list after n concatenations is of size 2n.

Fiat and Kaplan prove by simple information theoretic argument that for any general reduction to make a data structure conﬂuently persistent there is a DAG of versions which cannot be represented using only constant space per assignment. Speciﬁcally, Fiat and Kaplan deﬁne the eﬀective depth of the DAG which is the logarithm of the maximum number of diﬀerent paths from the root of the DAG to any particular vertex. They show that the number of bits that may be required for assignment is at least as large as the eﬀective depth of the DAG. Fiat and Kaplan also give several methods to make a data structure conﬂuently persistent. The simplest method has time and space overhead proportional to the depth of the DAG. Another method has overhead proportional to the eﬀective depth of the DAG and degenerate to the fat node method when the DAG is a tree. The last method reduces the time overhead to be polylogarithmic in either the depth of the DAG or the eﬀective depth of the DAG at the cost of using randomization and somewhat more space.

The work on making speciﬁc data structures persistent has started even prior to the work of DSST. Dobkin and Munro [16] considered a persistent data structure for computing the rank of an object in an ordered set of elements subject to insertions and deletions. Overmars [29] improved the time bounds of Dobkin and Munro and further reduced the storage for the case where we just want to determine whether an element is in the current set or not. Chazelle [8] considered ﬁnding the predecessor of a new element in the set. As we already mentioned DSST suggest two diﬀerent ways to make search trees persistent. The more eﬃcient of their methods has O(log n) worst case time bound and O(1) worst case space bound for an update.

A considerable amount of work has been devoted to the question of how to make concatenable double ended queues (deques) conﬂuently persistent. Without catenation, one can make deques fully persistent either by the general techniques of DSST or via real-time simulation of the deque using stacks (see [23] and the references there). Once catenation is added, the problem of making stacks or deques persistent becomes much harder, and the methods mentioned above fail. A straightforward use of balanced trees gives a representa- tion of persistent catenable deques in which an operation on a deque or deques of total size n takes O(log n) time. Driscoll, Sleator, and Tarjan [17] combined a tree representation with several additional ideas to obtain an implementation of persistent catenable stacks in which the kth operation takes O(log log k) time. Buchsbaum and Tarjan [7] used a recursive de- composition of trees to obtain two implementations of persistent catenable deques. The ﬁrst has a time bound of 2O(log∗ k) and the second a time bound of O(log∗ k) for the kth operation, where log∗ k is the iterated logarithm, deﬁned by log(1) k = log2 k, log(i) k = log log(i−1) k for i > 1, and log∗ k = min{i | log(i) k ≤ 1}.

Finally, Kaplan and Tarjan [23] gave a real-time, purely functional (and hence conﬂuently persistent) implementation of deques with catenation in which each operation takes O(1) time in the worst case. A related structure which is simpler but not purely functional and has only amortized constant time bound on each operation has been given by Kaplan, Okasaki, and Tarjan [21]. A key ingredient in the results of Kaplan and Tarjan and the result of Kaplan, Okasaki, and Tarjan is an algorithmic technique related to the redundant digital representations devised to avoid carry propagation in binary counting [9]. If removing elements from one side of the deque is disallowed. Okasaki [28] suggested another conﬂuently persistent implementation with O(1) time bound for every operation. This technique is related to path reversal technique which is used in some union-ﬁnd data structures [32].

Search trees also support catenation and split operations [31] and therefore conﬂuently persistent implementation of search trees is natural to ask for. Search trees can be made persistent and even conﬂuently persistent using the path copying technique [18]. In path copying you copy every node that changes while updating the search tree and its ancestors. Since updates to search trees aﬀect only a single path, this technique results in copying at most one path and thereby costs logarithmic time and space per update. Making ﬁnger search trees conﬂuently persistent is more of a challenge, as we want to prevent the update operation to propagate up on the leftmost and rightmost spines of the tree. This allows an update to be made at distance d from the beginning or end of the list in O(log d) time. Kaplan and Tarjan [22] used the redundant counting technique to make ﬁnger search tree conﬂuently persistent. Using the same technique they also managed to reduce the time (and space) overhead of catenation to be O(log log n) where n is the number of elements in the larger tree.

The structure of the rest of this chapter is as follows. Section 31.2 describes few algorithms that use persistent data structures to achieve their best time or space bounds. Section 31.3 surveys the general methods to make data structures persistent. Section 31.4 gives the highlights underlying persistent concatenable deques. We conclude in Section 31.5.

Algorithmic Applications of Persistent Data Structures

The basic concept of persistence is general and may arise in any context where one maintains a record of history for backup and recovery, or for any other purpose. However, the most remarkable consequences of persistent data structures are speciﬁc algorithms that achieve their best time or space complexities by using a persistent data structure. Most such algorithms solve geometric problems but there are also examples from other ﬁelds. In this section we describe few of these algorithms.

The most famous geometric application is the algorithm for planar point location by Sarnak and Tarjan [30] that triggered the development of the whole area. In the planar point location problem we are given a subdivision of the Euclidean plane into polygons by n line segments that intersect only at their endpoints. The goal is to preprocess these line segments and build a data structure such that given a query point we can eﬃciently determine which polygon contains it. As common in this kind of computational geometry problem, we measure a solution by three parameters: The space occupied by the data structure, the preprocessing time, which is the time it takes to build the data structure, and the query time.

Sarnak and Tarjan suggested the following solution (which builds upon previous ideas of Dobkin and Lipton [15] and Cole [10]). We partition the plane into vertical slabs by drawing a vertical line through each vertex (intersection of line segments) in the planar subdivision. Notice that the line segments of the subdivision intersecting a slab are totally ordered. Now it is possible to answer a query by two binary searches. One binary search locates the slab that contains the query, and another binary search locates the segment preceding the query point within the slab. If we associate with each segment within a slab, the polygon just above it, then we have located the answer to the query. If we represent the slabs by a binary search tree from left to right, and the segments within each slab by a binary search tree sorted from bottom to top, we can answer a query in O(log n) time.∗ However if we build a separate search tree for each slab then the worst case space requirement is Ω(n2), when Ω(n) lines intersect Ω(n) slabs.

The key observation is that the sets of line segments intersecting adjacent slabs are similar. If we have the set of one particular slab we can obtain the set of the slab to its right by deleting segments that end at the boundary between these slabs, and inserting segments that start at that boundary. As we sweep all the slabs from left to right we get that in total there are n deletions and n insertions; one deletion and one insertion for every line segment. This observation reduces the planar point location to the problem of maintaining partially persistent search trees. Sarnak and Tarjan [30] suggested a simple implementation of partially persistent search tree where each update takes O(log n) amortized time and consumes O(1) amortized space. Using these search trees they obtained a data structure for planar point location that requires O(n) space, takes O(n log n) time to build, and can answer each query in O(log n) time.

The algorithm of Sarnak and Tarjan for planar point location in fact suggests a general technique for transforming a 2-dimensional geometric search problem into a persistent data structure problem. Indeed several applications of this technique have emerged since Sarnak and Tarjan published their work [3]. As another example consider the problem of 3-sided range searching in the plane. In this problem we preprocess a set of n points in the plane so given a triple (a, b, c) with a ≤ b we can eﬃciently reports all points (x, y) ∈ S such that a ≤ x ≤ b, and y ≤ c. The priority search tree of McCreight [26] yields a solution tot his problem with O(n) space, O(n log n) preprocessing time, and O(log n) time per query.

Using persistent data structure, Boroujerdi and Moret [3] suggest the following alternative. Let y1 ≤ y2 ≤ ··· ≤ yn be the y-coordinates of the points in S in sorted order. For each i, 1 ≤ i ≤ n we build a search tree containing all i points (x, y) ∈ S where y ≤ yi, and associate that tree with yi. Given this collection of search tree we can answer a query (a, b, c) in O(log n) time by two binary searches. One search uses the y coordinate of the query point to ﬁnd the largest i such that yi ≤ c. Then we use the search tree associated with yi to ﬁnd all points (x, y) in it with a ≤ x ≤ b. If we use partially persistent search trees then we can build the trees using n insertions so the space requirement is O(n), and the preprocessing time is O(n log n).

This technique of transforming a 2-dimensional geometric search problem into a persistent data structure problem requires only a partially persistent data structure. This is since we only need to modify the last version while doing the sweep. Applications of fully persistent data structures are less common. However, few interesting ones do exist.

One such algorithm that uses a fully persistent data structure is the algorithm of Alstrup et al. for the binary dispatching problem [1]. In object oriented languages there is a hierarchy of classes (types) and method names are overloaded (i.e., a method may have diﬀerent implementations for diﬀerent types of its arguments). At run time when a method is invoked, the most speciﬁc implementation which is appropriate for the arguments has to be activated. This is a critical component of execution performance in object oriented languages. Here is a more formal speciﬁcation of the problem.

We model the class hierarchy by a tree T with n nodes, each representing a class. A class A which is a descendant of B is more speciﬁc than B and we denote this relation by A ≤ B or A< B if we know that A j= B. In addition we have m diﬀerent implementations of methods, where each such implementation is speciﬁed by a name, number of arguments, and the type of each argument. We shall assume that m > n, as if that is not the case we can map nodes that do not participate in any method to their closest ancestor that does participate in O(n) time. A method invocation is a query of the form s(A1,... , Ad) where s is a method name that has d arguments with types A1,..., Ad, respectively. An implementation s(B1,..., Bd) is applicable for s(A1,..., Ad) if Ai ≤ Bi for every 1 ≤ i ≤ d.

The most speciﬁc method which is applicable for s(A1,... , Ad) is the method s(B1,..., Bd) such that Ai ≤ Bi for 1 ≤ i ≤ d, and for any other implementation s(C1,..., Cd) which is applicable for s(A1,... , Ad) we have Bi ≤ Ci for 1 ≤ i ≤ d. Note that for d > 1 this may be ambiguous, i.e. we might have two applicable methods s(B1,..., Bd) and s(C1,... , Cd) where Bi j= Ci, Bj j= Cj , Bi ≤ Ci and Cj ≤ Bj . The dispatching problem is to ﬁnd for each invocation the most speciﬁc applicable method if it exists. If it does not exist or in case of ambiguity, “no applicable method” or “ambiguity” has to be reported, respectively. In the binary dispatching problem, d = 2, i.e. we assume that all implementations and invocations have two arguments.

Alstrup et al. describe a data structure for the binary dispatching problem that use O(m) space, O(m(log log m)2) preprocessing time and O(log m) query time. They obtain this data structure by reducing the problem to what they call the bridge color problem. In the bridge color problem the input consists of two trees T1 and T2 with edges, called bridges, connecting vertices in T1 to vertices in T2. Each bridge is colored by a subset of colors from C. The goal is to construct a data structure which allows queries of the following form. Given a triple (v1, v2, c) where v1 ∈ T1, v2 ∈ T2, and c ∈ C ﬁnds the bridge (w1, w2) such that 1. v1 ≤ w1 in T1, and v2 ≤ w2 in T2, and c is one of the colors associated with (w1, w2).

2. There is no other such bridge (wl , wll) with v2 ≤ wll < w2 or v1 ≤ wl < w1.

If there is no bridge satisfying the ﬁrst condition the query just returns nothing and if there is a bridge satisfying the ﬁrst condition but not the second we report “ambiguity”. We reduce the binary dispatching problem to the bridge color problem by taking T1 and T2 to be copies of the class hierarchy T of the dispatching problem. The set of colors is the set of diﬀerent method names. (Recall that each method name may have many implementations for diﬀerent pairs of types.) We make a bridge (v1, v2) between v1 ∈ T1 and v2 ∈ T2 whenever there is an implementation of some method for classes v1 and v2. We color the bridge by all names of methods for which there is an implementation speciﬁc to the pair of type (v1, v2). It is easy to see now that when we invoke a method s(A1, A2) the most speciﬁc implementation of s to activate corresponds to the bridge colored s connecting an ancestor of v1 to an ancestor of v2 which also satisﬁes Condition (2) above.

In a way which is somewhat similar to the reduction between static two dimensional problem to a dynamic one dimensional problem in the plane sweep technique above, Alstrup et al. reduce the static bridge color problem to a similar dynamic problem on a single tree which they call the tree color problem. In the tree color problem you are given a tree T , and a set of colors C. At any time each vertex of T has a set of colors associated with it. We want a data structure which supports the updates, color(v,c): which add the color c to the set associated with v; and uncolor(v,c) which deletes the color c from the set associated with v. The query we support is given a vertex v and a color c, ﬁnd the closest ancestor of v that has color c.

The reduction between the bridge color problem and the tree color problem is as follows. For each node v ∈ T1 we associate an instance £v of the tree color problem where the underlying tree is T2 and the set of colors C is the same as for the bridge color problem. The label of a node w ∈ T2 in £v contains color c if w is an endpoint of a bridge with color c whose endpoint in T1 is an ancestor of v. For each pair (w, c) where w ∈ T2 and c is a color associated with w in £v we also keep the closest ancestor vl to v in T1 such that there is a bridge (vl, w) colored c. We can use a large (sparse) array indexed by pairs (w, c) to map each such pair to its associated vertex. We denote this additional data structure associated with v by av . Similarly for each vertex u ∈ T2 we deﬁne an instance £u of the tree color problem when the underlying tree is T1, and the associated array au.

We can answer a query (v1, v2, c) to the bridge color data structure as follows. We query the data structure £v1 with v2 to see if there is an ancestor of v2 colored c in the coloring of T2 deﬁned by £v1 . If so we use the array av1 to ﬁnd the bridge (w1, w2) colored c where v1 ≤ w1 and v2 ≤ w2, and w1 is as close as possible to v1. Similarly we use the data structures £v2 and av2 to ﬁnd the bridge (w1, w2) colored c where v1 ≤ w1 and v2 ≤ w2, and w2 is as close as possible to v2, if it exists. Finally if both bridges are identical then we have the answer to the query (v1, v2, c) to the bridge color data structure. Otherwise, either there is no such bridge or there is an ambiguity (when the two bridges are diﬀerent).

The problem of this reduction is its large space requirement if we represent each data structure £v , and av for v ∈ T1 ∪ T2 independently.† The crucial observation though is that these data structures are strongly related. Thus if we use a dynamic data structure for the tree color problem we can obtain the data structure corresponding to w from the data structure corresponding to its parent using a small number of modiﬁcations. Speciﬁcally, suppose we have generated the data structures £v and av for some v ∈ T1. Let w be a child of v in T1. We can construct £w by traversing all bridges whose one endpoint is w. For each such bridge (w, u) colored c, we perform color(u,c), and update the entry of (u, c) in av to contain w.

So if we were using fully persistent arrays and a fully persistent data structure for the tree color problem we can construct all data structures mentioned above while doing only O(m) updates to these persistent data structures. Alstrup et al. [1] describe a data structure for the tree color problem where each update takes O(log log m) expected time and query time is O(log m/ log log m). The space is linear in the sum of the sizes of the color-sets of the vertices. To make it persistent without consuming too much space Alstrup et al. [1] suggest how to modify the data structure so that each update makes O(1) memory modiﬁcations in the worst case (while using somewhat more space). Then by applying the technique of Dietz [12] (see also Section 31.3.3) to this data structure we can make it fully persistent. The time bounds for updates and queries increase by a factor of O(log log m), and the total space is O(|C|m). Similarly, we can make the associated arrays av fully persistent.

The resulting solution to the binary dispatching problem takes O(m(log log m)2) time to construct, requires O(|C|m) space and support a query in O(log m) time. Since the number of memory modiﬁcations while constructing the data structure is only O(m) Alstrup et al. also suggest that the space can be further reduces to O(m) by maintaining the entire memory as a dynamic perfect hashing data structure.

Fully persistent lists proved useful in reducing the space requirements of few three di- mensional geometric algorithms based on the sweep line technique, where the items on the sweep line have secondary lists associated with them. Kitsios and Tsakalidis [25] considered hidden line elimination and hidden surface removal. The input is a collection of (non intersecting) polygons in three dimensions. The hidden line problem asks for the parts of the edges of the polygons that are visible from a given viewing position. The hidden surface removal problem asks to compute the parts of the polygons that are visible from the viewing position.

An algorithm of Nurmi [27] solves these problems by projecting all polygons into a collection of possible intersecting polygons in the plane and then sweeping this plane, stopping at any vertex of a projected polygon, or crossing point of a pair of projected edges. When the sweep stops at such point, the visibility status of its incident edges is determined. The algorithm maintain a binary balanced tree which stores the edges cut by the sweep line in sorted order along the sweep line. With each such edge it also maintains another balanced binary tree over the faces that cover the interval between the edge and its successor edge on the sweep line. These faces are ordered in increasing depth order along the line of sight.

An active edge is visible if the topmost face in its list is diﬀerent from the topmost face in the list of its predecessor. If n is the number of vertices of the input polygons and I is the number of intersections of edges on the projection plane then the sweep line stops at n + I points. Looking more carefully at the updates one has to perform during the sweep, we observe that a constant number of update operations on balanced binary search trees has to be performed non destructively at each point. Thus, using fully persistent balanced search trees one can implement the algorithm in O((n + I) log n) time and O(n + I) space. Kitsios and Tsakalidis also show that by rebuilding the data structure from scratch every O(n) updates we can reduce the space requirement to O(n) while retaining the same asymptotic running time.

Similar technique has been used by Bozanis et al. [4] to reduce the space requirement of an algorithm of Gupta et al. [20] for the rectangular enclosure reporting problem. In this problem the input is a set S of n rectangles in the plane whose sides are parallel to the axes.

The algorithm has to report all pairs (R, Rl) of rectangles where R, Rl ∈ S and R encloses Rl. The algorithm uses the equivalence between the rectangle enclosure reporting problem and the 4-dimensional dominance problem. In the 4-dimensional dominance problem the input is a set of n points P in four dimensional space. A point p = (p1, p2, p3, p4) dominates pl = (pl , pl , pl , pl ) if and only if pi ≥ pl for i = 1, 2, 3, 4. We ask for an algorithm to report all dominating pairs of points, (p, pl), where p, pl ∈ P , and p dominates pl. The algorithm of Gupta et al. ﬁrst sorts the points by all coordinates and translates the coordinates to ranks so that they become points in U 4 where U = {0, 1, 2,... , n}. It then divides the sets into two equal halves R and B according to the forth coordinate (R contains the points with smaller forth coordinate). Using recurrence on B and on R it ﬁnds all dominating pairs (p, pl) where p and pl are either both in B or both in R. Finally it ﬁnds all dominating pairs (r, b) where r ∈ R and b ∈ B by iterating a plane sweeping algorithm on the three dimensional projections of the points in R and B. During the sweep, for each point in B, a list of points that it dominates in R is maintained. The size of these lists may potentially be as large as the output size which in turn may be quadratic. Bozanis et al. suggest to reduce the space by making these lists fully persistent, which are periodically being rebuilt.

Search This Blog

algorithms