Persistent Data Structures:General Techniques for Making Data Structures Persistent
General Techniques for Making Data Structures Persistent
We start in Section 31.3.1 describing the fat node simulation. This simulation allows us to obtain fully persistent data structures and has an optimal space expansion but time slow- down logarithmic in the number of versions. Section 31.3.2 describes the node copying and the node splitting methods that reduce the time slowdown to be constant while increasing the space expansion only by a constant factor. In Section 31.3.3 we address the question of making arrays persistent. Finally in Section 31.3.4 we describe simulation that makes data structures confluently persistent.
DSST first considered the fat node method . The fat node method works by allowing a field in a node of the data structure to contain a list of values. In a partial persistent setting we associate field value x with version number i, if x was assigned to the field in the update operation that created version i.‡ We keep this list of values sorted by increasing version number in a search tree. In this method simulating an assignment takes O(1) space, and O(1) time if we maintain a pointer to the end of the list. An access step takes O(log m) time where m is the number of versions.
The difficulty with making the fat node method work in a fully persistent setting is the lack of total order on the versions. To eliminate this difficulty, DSST impose a total order on the versions consistent with the partial order defined by the version tree. They call this total order the version list. When a version i is created it is inserted into the version list immediately after its parent (in the version tree). This implies that the version list defines a preorder on the version tree where for any version i, the descendants of i in the version tree occur consecutively in the version list, starting with i.
The version list is maintained in a data structure that given two versions x and y allows to determine efficiently whether x precedes y. Such a data structure has been suggested by Dietz and Sleator [11].
(See also a simpler related data structure by [2].)
The main idea underlying these data structures is to assign an integer label to each version so that these labels monotonically increase as we go along the list. Some difficulty arises since in order to use integers from a polynomial range we occasionally have to relabel some versions. For efficient implementation we need to control the amount of relabeling being done. We denote such a data structure that maintains a linear order subject to the operation insert(x, y) which inserts x after y, and order(x, y) which returns “yes” if x precedes y, an Order Maintenance (OM) data structure.
As in the partial persistence case we keep a list of version-value pairs in each field. This list contains a pair for each value assigned to the field in any version. These pairs are ordered according to the total order imposed on the versions as described above. We maintain these lists such that the value corresponding to field f in version i is the value associated with the largest version in the list of f that is not larger than i. We can find this version by carrying out a binary search on the list associated with the field using the OM data structure to do comparisons.
To maintain these lists such that the value corresponding to field f in version i is the value associated with the largest version in the list of f that is not larger than i, the simulation of an update in the fully persistent setting differ slightly from what happens in the partially persistent case. Assume we assign a value x to field f in an update that creates version i. (Assume for simplicity that this is the only assignment to f during this update.) First we add the pair (i, x) to the list of pairs associated with field f . Let il be the version following i in the version list (i.e. in the total order of all versions) and let ill be the version following i in the list associated with f . ( If there is no version following i in one of these lists we are done.) If ill > il then the addition of the pair (i, x) to the list of pairs associated with f may change the value of f in all versions between il and the version preceding ill in the version list, to be x. To fix that we add another pair (il, y) to the list associated with f , where y is the value of f before the assignment of x to f . The overhead of the fat node method in a fully persistent settings is O(log m) time and O(1) space per assignment, and O(log m) time per access step, where m is the number of versions. Next, DSST suggested two methods to reduce the logarithmic time overhead of the fat node method. The simpler one obtains a partially persistent data structure and is called node copying. To obtain a fully persistent data structure DSST suggested the node splitting method.
Node Copying and Node Splitting
The node-copying and the node splitting methods simulate the fat node method using nodes of constant size. Here we assume that the data structure is a pointer based data structure where each node contains a constant number of fields. For reasons that will become clear shortly we also assume that the nodes are of constant bounded in-degree, i.e. the number of pointer fields that contains the address of any particular node is bounded by a constant.
In the node copying method we allow nodes in the persistent data structure to hold only a fixed number of field values. When we run out of space in a node, we create a new copy of the node, containing only the newest value of each field. Let d be the number of pointer fields in an ephemeral node and let p be the maximum in-degree of an ephemeral node. Each persistent node contains d fields which corresponds to the fields in the ephemeral node, p predecessor fields, e extra fields, where e is a sufficiently large constant that we specify later, and one field for a copy pointer.
All persistent nodes which correspond to the same ephemeral node are linked together in a single linked list using the copy pointer. Each field in a persistent node has a version stamp associated with it. As we go along the chain of persistent nodes corresponding to one ephemeral node then the version stamps of the fields in one node are no smaller than version stamps of the fields in the preceding nodes. The last persistent node in the chain is called live. This is the persistent node representing the ephemeral node in the most recent version which we can still update. In each live node we maintain predecessor pointers. If x is a live node and node z points to x then we maintain in x a pointer to z.
We update field f in node v, while simulating the update operation creating version i, as follows.§ Let x be the live persistent node corresponding to v in the data structure. If there is an empty extra field in x then we assign the new value to this extra field, stamp it with version i, and mark it as a value associated with original field f . If f is a pointer field which now points to a node z, we update the corresponding predecessor pointer in z to point to x. In case all extra fields in x are used (and none of them is stamped with version
i) we copy x as follows.
We create a new persistent node y, make the copy pointer of x point to y, store in each original field in y the most recent value assigned to it, and stamp these values with version stamp i. In particular, field f in node y stores its new value marked with version i. For each pointer field in y we also update the corresponding predecessor pointer to point to y rather than to x.
Then we have to update each field pointing to x in version i − 1 to point to y in version
i. We follow, in turn, each predecessor pointer in x. Let z be a node pointed to by such a predecessor pointer. We identify the field pointing to x in z and update its value in version
i to be y. We also update a predecessor pointer in y to point to z. If the old value of the pointer to x in z is not tagged with version i (in particular this means that z has not been copied) then we try to use an extra field to store the new version-value pair. If there is no free extra pointer in z we copy z as above. Then we update the field that points to x to point to y in the new copy of z. This sequence of node copying may cascade, but since each node is copied at most once, the simulation of the update step must terminate. In version i, y is the live node corresponding to v.
A simple analysis shows that if we use at least as many extra fields as predecessor fields at each node (i.e. e ≥ p) then the amortized number of nodes that are copied due to a single update is constant. Intuitively, each time we copy a node we gain e empty extra fields in the live version that “pay” for the assignments that had to be made to redirect pointers to the new copy.
A similar simulation called the node splitting method makes a data structure fully persistent with O(1) amortized overhead in time and space. The details however are somewhat more involved so we only sketch the main ideas. Here, since we need predecessor pointers for any version¶ it is convenient to think of the predecessor pointers as part of the ephemeral data structure, and to apply the simulation to the so called augmented ephemeral data structure.
We represent each fat node by a list of persistent nodes each of constant size, with twice as many extra pointers as original fields in the corresponding node of the augmented ephemeral data structure. The values in the fields of the persistent nodes are ordered by the version list. Thus each persistent node x is associated with an interval of versions in the version lists, called the valid interval of x, and it stores all values of its fields that fall within this interval. The first among these values is stored in an original field and the following ones occupy extra fields.
The key idea underlying this simulation is to maintain the pointers in the persistent structure consistent such that when we traverse a pointer valid in version i we arrive at a persistent node whose valid interval contains version i. More precisely, a value c of a pointer field must indicate a persistent node whose valid interval contains the valid interval of c.
We simulate an update step to field f , while creating version i from version p(i), as follows. If there is already a persistent node x containing f stamped with version i then we merely change the value of f in x. Otherwise, let x be the persistent node whose valid interval contains version i. Let i+ be the version following i in the version list. Assume the node following x does not have version stamp of i+. We create two new persistent node xl, and xll, and insert them into the list of persistent nodes of x, such that xl follows x, and xll follows xl. We give node xl version stamp of i and fill all its original fields with their values at version i. The extra fields in xl are left empty. We give xll version stamp of i+. We fill the original fields of xll with their values at version i+. We move from the extra fields of x all values with version stamps following i+ in the version list to xll. In case the node which follows x in its list has version stamp i+ then xll is not needed.
After this first stage of the update step, values of pointer fields previously indicating x may be inconsistent. The simulation then continues to restore consistency. We locate all nodes containing inconsistent values and insert them into a set S. Then we pull out one node at the time from S and fix its values. To fix a value we may have to replace it with two or more values each valid in a subinterval of the valid interval of the original value. This increases the number of values that has to be stored at the node so we may have to split the node. This splitting may cause more values to become inconsistent. So node splitting and consistency fixing cascades until consistency is completely restored. The analysis is based on the fact that each node splitting produce a node with sufficiently many empty extra fields. For further details see [18].
Dietz [12] describes a general technique for making arrays persistent. In his method, it takes O(log log m) time to access the array and O(log log m) expected amortized time to change the content of an entry, where m is the total number of updates. The space is linear in m. We denote the size of the array by n and assume that n < m.
Dietz essentially suggests to think of the array as one big fat node with n fields. The list of versions-values pairs describing the assignments to each entry of the array is represented in a data structure of van Emde Boas et al. [33, 34]. This data structure is made to consume space linear in the number of items using dynamic perfect hashing [14]. Each version is encoded in this data structure by its label in the associated Order Maintenance (OM) data structure. (See Section 31.3.1.)
A problem arises with the solution above since we refer to the labels not solely via order queries on pairs of versions. Therefore when a label of a version changes by the OM data structure the old label has to be deleted from the corresponding van Emde Boaz data structure and the new label has to be inserted instead. We recall that any one of the known OM data structures consists of two levels. The versions are partitioned into sublists of size O(log m). Each sublist gets a label and each version within a sublist gets a label. The final label of a version is the concatenation of these two labels. Now this data structure supports an insertion in O(1) time. However, this insertion may change the labels of a constant number of sublists and thereby implicitly change the labels of O(log m) versions. Reinserting all these labels into the van Emde Boaz structures containing them may take Ω(log m log log m) time Dietz suggests to solve this problem by bucketizing the van Emde Boaz data structure. Consider a list of versions stored in such a data structure. We split the list into buckets of size O(log m). We maintain the versions in each bucket in a regular balanced search tree and we maintain the smallest version from each bucket in a van Emde Boaz data structure. This way we need to delete and reinsert a label of a version into the van Emde Boaz data structure only when the minimum label in a bucket gets relabeled.
Although there are only O(m/ log m) elements now in the van Emde Boaz data structures, it could still be the case that we relabel these particular elements too often. This can happen if sublists that get split in the OM data structure contains a particular large number of buckets’ minima. To prevent that from happening we modify slightly the OM data structure as follows.
We associate a potential to each version which equals 1 if the version is currently not a minimum in its bucket of its van Emde Boaz data structure and equals log log m if it is a minimum in its bucket. Notice that since there are only O(m/ log m) buckets’ minima the total potential assigned to all versions throughout the process is O(m). We partition the versions into sublists according to their potentials where the sum of the potentials of the elements in each sublist is O(log m). We assign labels to the sublists and within each sublists as in the original OM data structure. When we have to split a sublist the work associated with the split, including the required updates on the associated van Emde Boaz data structures, is proportional to the increase in the potential of this sublist since it had last split.
Since we can model the memory of a Random Access Machine (RAM) as a large array. This technique of Dietz is in fact general enough to make any data structure on a RAM persistent with double logarithmic overhead on each access or update to memory.
Making Data Structures Confluently Persistent
Finding a general simulation to make a pointer based data structure confluently persistent is a considerably harder task. In a fully persistent setting we can construct any version by carrying out a particular sequence of updates ephemerally. This seemingly innocent fact is already problematic in a confluently persistent setting. In a confluently persistent setting when an update applies to two versions, one has to produce these two versions to perform the update. Note that these two versions may originate from the same ancestral version so we need some form of persistence even to produce a single version. In particular, methods that achieve persistence typically create versions that share nodes. Semantically however, when an update applied to versions that share nodes we would like the result to be as if we perform the update on two completely independent copies of the input versions.
In a fully persistent setting if each operation takes time polynomial in the number of versions then the size of each version is also polynomial in the number of versions. This breaks down in a confluently persistent setting where even when each operation takes con- stant time the size of a single version could be exponential in the number of versions. Recall the example of the linked list mentioned in Section 31.1. It is initialized to contain a single node and then concatenated with itself n time. The size of the last versions is 2n. It follows that any polynomial simulation of a data structure to make it confluently persistent must in some cases represent versions is a compressed form.
Consider the naive scheme to make a data structure persistent which copies the input versions before each update. This method is polynomial in a fully persistent setting when we know that each update operation allocates a polynomial (in the number of versions) number of new nodes. This is not true in a confluently persistent setting as the linked list example given above shows. Thus there is no easy polynomial method to obtain confluently persistence at all.
What precisely causes this difficulty in obtaining a confluently persistent simulation ? Lets assume first a fully persistent setting and the naive scheme mentioned above. Consider a single node x created during the update that constructed version v. Node x exists in version v and copies of it may also exist in descendant versions of v. Notice however that each version derived from v contains only a single node which is either x or a copy of it. In contrast if we are in a confluently persistent setting a descendant version of v may contain more than a single copy of x. For example, consider the linked list being concatenated to itself as described above. Let x be the node allocated when creating the first version. Then after one catenation we obtain a version which contains two copies of x, after 2 catenations we obtain a version containing 4 copies of x, and in version n we have 2n copies of x.
Now, if we get back to the fat node method, then we can observe that it identifies a node in a specific version using a pointer to a fat node and a version number. This works since in each version there is only one copy of any node, and thus breaks down in the confluently persistent setting. In a confluently persistent setting we need more than a version number and an address of a fat node to identify a particular node in a particular version.
To address this identification problem Fiat and Kaplan [19] used the notion of pedigree. To define pedigree we need the following notation. We denote the version DAG by D, and the version corresponding to vertex v ∈ D by Dv . Consider the naive scheme defined above.
Let w be some node in the data structure Dv . We say that node w in version v was derived from node y in version u if version u was one of the versions on which the update producing v had been performed, and furthermore node w ∈ Dv was formed by a (possibly empty) set of assignments to a copy of node y ∈ Du.
Let w be a node in some version Du where Du is produced by the naive scheme. We associate a pedigree with w, and denote it by p(w). The pedigree, p(w), is a path p = (v0, v1,..., vk = u) in the version DAG such that there exist nodes w0, w1, .. ., wk−1, wk = w, where wi is a node of Dvi , w0 was allocated in v0, and wi is derived from wi−1 for 1 ≤ i ≤ k. We also call w0 the seminal node of w, and denote it by s(w). Note that p(w) and s(w) uniquely identify w among all nodes of the naive scheme.
As an example consider Figure 31.1.
We see that version v4 has three nodes (the 1st, 3rd, and 5th nodes of the linked list) with the same seminal node wl . The pedigree of the 1st node in Dv4 is (v0, v1, v3, v4). The pedigree of the 2nd node in Dv4 is also (v0, v1, v3, v4) but its seminal node is w0. Similarly, we can see that the pedigrees of the 3rd, and the 5th nodes of Dv4 are (v0, v2, v3, v4) and (v0, v2, v4), respectively.
The basic simulation of Fiat and Kaplan is called the full path method and it works as follows. The data structure consists of a collection of fat nodes. Each fat node corresponds to an explicit allocation of a node by an update operation or in another words, to some seminal node of the naive scheme. For example, the update operations of Figure 31.1 performs 3 allocations (3 seminal nodes) labeled w0, wl , and wll , so our data structure will have 3 fat nodes, f (w0), f (wl ) and f (wll). The full path method represents a node w of the naïve scheme by a pointer to the fat node representing s(w), together with the pedigree p(w).
Thus a single fat node f represents all nodes sharing the same seminal node. We denote this set of nodes by N (f ). Note that N (f ) may contain nodes that co-exist within the same version and nodes that exist in different versions. A fat node contains the same fields as the corresponding seminal node. Each of these fields, however, rather than storing a single value as in the original node stores a dynamic table of field values in the fat node. The simulation will be able to find the correct value in node w ∈ N (f ) using p(w). To specify the representation of a set of values we need the following definition of an assignment pedigree. Let p = (v0,..., vk = u) be the pedigree of a node w ∈ Du. Let wk = w, wk−1 ,..., w1, wi ∈ Dvi be the sequence of nodes such that wi ∈ Dvi is derived from wi−1 ∈ Dvi−1 . This sequence exists by the definition of node’s pedigree. Let A be a field in w and let j be the maximum such that there has been an assignment to field A in wj during the update that created vj . We define the assignment pedigree of a field A in node w, denoted by p(A, w), to be the pedigree of wj , i.e. p(A, w) = (v0, v1,..., vj ).
In the example of Figure 31.1 the nodes contain one pointer field (named next) and one data field (named x). The assignment pedigree of x in the 1st node of Dv4 is simply (v0), the assignment pedigree of x in the 2nd node of Dv4 is likewise (v0), the assignment pedigree of x in the 3rd node of Dv4 is (v0, v2, v3). Pointer fields also have assignment pedigrees. The assignment pedigree of the pointer field in the 1st node of Dv4 is (v0, v1), the assignment pedigree of the pointer field in the 2nd node of Dv4 is (v0, v1, v3), the assignment pedigree of the pointer field of the 3rd node of Dv4 is (v0, v2), finally, the assignment pedigree of the pointer field of the 4th node of Dv4 is (v2, v3, v4).
We call the set {p(A, w) | w ∈ N (f )} the set of all assignment pedigrees for field A in a
fat note f , and denote it by P (A, f ). The table that represents field A in fat node f contains an entry for each assignment pedigree in P (A, f ). The value of a table entry, indexed by an assignment pedigree p = (v0, v1,..., vj ), depends on the type of the field as follows. If A is a data field then the value stored is the value assigned to A in the node wj ∈ Dvj whose pedigree is p. If A is a pointer field then let w be the node pointed to by field A after the assignment to A in wj . We store the pedigree of w and the address of the fat node that represents the seminal node of w.
An access pointer to a node w in version v is represented by a pointer to the fat node representing the seminal node of w and the pedigree of w.
In Figure 31.2 we give the fat nodes of the persistent data structure given in Figure 31.1.
For example, the field next has three assignments in nodes of N (f (wl )). Thus, there are three assignment pedigrees in P (next, f (wl )):
You can see all three entries in the table for next in the fat node f (wl ) (Figure 31.2). Similarly, we give the table for field x in f (wl ) as well as the tables for both fields in fat nodes f (w0) and f (wll).
When we traverse the data structure we are pointing to some fat node f and hold a pedigree q of some node w whose seminal node corresponds to f and we would like to retrieve the value of field A in node w from the table representing field A in f . We do that as follows. First we identify the assignment pedigree p(A, w) of field A in node w. This is the longest pedigree which is a prefix of q and has an entry in this table. In case A is a data field, the value we are after is simply the value associated with p(A, w). However if A is a pointer field then the value stored with p(A, w) may not be the value of A in w. This value identifies a node in the version where the assignment occurred, whereas we are interested in a node in the version of w that this pointer field points to.
Let q = (q0,..., qk ) and let p(A, w) = (q0, q1,... , qj ). Let the value of p(A, w) be (t, f ), where t is the pedigree of the target node in Dqj and f is the fat node representing the seminal node of this target node. The nodes identified by the pedigrees p(A, w) and t were copied in versions qj+1, .. ., qk without any assignment made to field A in the nodes derived from the node whose pedigree is p(A, w). Thus the pedigree of the target node of field A of node w in Dqk is t (qj+1,..., qk ), where represents concatenation.
It follows that we need representations for pedigrees and the tables representing field values that support an efficient implementation of the followings.
1. Given a pedigree q find the longest prefix of q stored in a table.
2. Given a pedigree q, replace a prefix of q with another pedigree p.
3. To facilitate updates we also need to be able to add a pedigree to a table representing some field with a corresponding value.
In their simplest simulation Fiat and Kaplan suggested to represent pedigrees as linked lists of version numbers, and to represent tables with field values as tries. Each assignment pedigree contained in the table is represented by a path in the corresponding trie. The last node of the path stores the associated value. Nodes in the trie can have large degrees so for efficiency we represent the children of each node in a trie by a splay tree.
Let U be the total number of assignments the simulation performs and consider the update creating version v. Then with this implementation each assignment performed during this update requires O(d(v)) words of size O(log U ) bits, and takes O(d(v) + log U ) time, where d(v) is the depth of v in the DAG. Field retrieval also takes O(d(v) + log U ) time.
The second method suggested by Fiat and Kaplan is the compressed path method. The essence of the compressed path method is a particular partition of our DAG into disjoint trees. This partition is defined such that every path enters and leaves any specific tree at most once. The compressed path method encodes paths in the DAG as a sequence of pairs of versions. Each such pair contains a version where the path enters a tree T and the version where the path leaves the tree T . The length of each such representation is O(e(D)).I
Each value of a field in a fat node is now associated with the compressed representation of the path of the node in N (f ) in which the corresponding assignment occurred. A key property of these compressed path representations is that they allow easy implementation of the operations we need to perform on pedigree, like replacing a prefix of a pedigree with another pedigree when traversing a pointer. With the compressed path method each assignment requires up to O(e(D)) words each of O(log U ) bits. Searching or updating the trie representing all values of a field in a fat node requires O(e(D) + log U ) time. For the case where the DAG is a tree this method degenerates to the fat node simulation of [18].
Fiat and Kaplan also suggested how to use randomization to speed up their two basic methods at the expense of (slightly) larger space expansion and polynomially small error probability. The basic idea is encode each path (or compressed path) in the DAG by an integer. We assign to each version a random integer, and the encoding of a path p is simply the sum of the integers that correspond to the versions on p. Each value of a field in a fat node is now associated with the integer encoding the path of the node in N (f ) in which the corresponding assignment occurred. To index the values of each field we use a hash table storing all the integers corresponding to these values.
To deal with values of pointer fields we have to combine this encoding with a representation of paths in the DAG (or compressed paths) as balanced search trees, whose leaves (in left to right order) contain the random integers associated with the vertices along the path (or compressed path). This representation allows us to perform certain operations on these paths in logarithmic (or poly-logarithmic) time whereas the same operations required linear time using the simpler representation of paths in the non-randomized methods.
Comments
Post a Comment