String Searching:The Position Heap
We now give a data structure that gives much simpler algorithms, at a cost of slightly in- creasing the worst-case time required for a query. The algorithms can easily be programmed by undergraduate data-structures students.
The data structure is a trie, and has one node for each position in the text. The data structures and algorithms can be modified to give the same bounds for construction and searching, but this undermines the principal advantages, which are simplicity and low mem- ory requirements.
The data structure is closely related to trees that are used for storing hash keys in [3].
Building the Position Heap
Let a string be represented by a trie if it is the label of a path from the root in the trie.
For analyzing the position heap us adopt the convention of indexing the characters of t in descending order, so t = tntn−1...t1. In this case, we let t[i : j] denote titi−1...tj .
The algorithm for constructing the position heap can be described informally as follows.
The positions of t are visited from right to left as a trie is built. At each position i, a new substring z is added to the set of words represented by the trie. To do this, the longest prefix t[i : j] of t[i : 1] that is already represented in the trie is found by indexing into the trie from the root, using the leading letters of t[i : 1], until one can advance no further. A leaf child of the last node of this path is added, and the edge to it is labeled ti+1.
The procedure, PHBuild, is given in Table 30.1. Figure 30.10 gives an illustration.
Querying the Position Heap
Table 30.2 gives a procedure, PHFind, to find all starting positions of p in t, and Figure 30.11 gives an illustration. The worst-case running time of O(|p|2 + k) to find the k occurrences of p is worse than the O(|p| + k) bound for the suffix tree or DAWG.
TABLE 30.1 Constructing the position heap for a string t = titi−1...t1.
PHBuild(t, i) If i = 1 return a single tree node labeled 1 Else Recursively construct the position heap Hl for the suffix t[i − 1, 1].
Let tl = t[i : k] be the maximal prefix of t that is the label of a path originating at the root in the tree. Let u be the last node of this path.
Add a child of u to the tree on edge labeled tk−1, and give it label i.
FIGURE 30.10: Construction of the position heap with PHBuild (Table 30.1). The solid edges reflect the recursively-constructed position heap for the suffix t[9 : 1] of t. To get the position heap for t[10 : 1], we use the heap to find the largest prefix bb of t[10 : 1] that is the label of a path in the tree, and add a new child at this spot to record the next larger prefix bba.
TABLE 30.2 Find all places in a text t where a substring p occurs, given the position heap H for t.
LEMMA 30.9 PHFind returns all positions where p occurs in t.
FIGURE 30.11: Searching for occurrences of a string in the text t in O(|p|2 + k) time with PHFind (Table 30.2).
How the search is conducted depends on whether the search string is the path label of a node in the position heap. One case is illustrated by search string aba, which is the path label of position 11. The only places where aba may match t are at positions given by ancestors and descendants of t. The descendants {11, 15} do not need to be verified, but the proper ancestors {1, 3, 6} must be verified against t at those positions. Of these, only 3 and 6 are matches. The algorithm returns {3, 6, 11, 15}. The other case is illustrated by baba, which is not the path label of a node. Indexing on it yields position 7 and path label bab /= baba. Only the ancestors {2, 5, 7} are candidates, and they must be verified against t. Of these, only 7 is a match. The algorithm returns {7}. Since the ancestor positions occur on the search path, there are O(|p|) of them, and each takes O(|p|) time to verify each of them against t. Descendants can be more numerous, but take O(1) time apiece to retrieve and return, since they do not need to be verified.
Proof Let p = p1p2...pm and let t = tntn−1...t1. Suppose that i is a position in t where p does not occur. Then i /∈ S2. Any node u with position label i has a path label that is a prefix of t[i : 1]. Since p is not a prefix of this string, it is not a prefix of the path label of u, so i /∈ S3. We conclude that i is not among the positions returned by PHFind.
Next, let h be the position of an occurrence of p. Let x = p[1 : j] be the maximal prefix of p that is represented in the position heap of t[h − 1 : 1], where j = 0 if x = λ. If x /= p, then PHBuild created a node with position label h and path label xpj+1 . This is a prefix of p, so h ∈ S1, and, since p occurs at position h, h ∈ S2. If x = p, let y = t[h : k] be the largest prefix of t[h : 1] that is active in t[h − 1 : 1]. Then PHBuild created a node with position label h and path label ytk−1, and h ∈ S3. In either case, h is returned as a position where P occurs. ♦
Time Bounds
LEMMA 30.10 PHFind takes O(|p|2 +k) worst-case time time to return the k occurrences of p in t.
Proof The members of S3 can be retrieved in O(1) time apiece using a depth-first traversal of the subtree rooted at the last node on path P l. Since all nodes of S1 occur on a path whose label is a prefix of p, there are at most m +1 members of S1. Checking them against t to see which are members of S2 takes O(|p|) time apiece, for a total of O(|p|2) time in the worst case. ♦
This time bound overstates what can be expected in practice, since, in most cases, the string is known to match on a prefix, but there is no reason to expect that it will be similar to the position that it is supposed to match in the region beyond this prefix. A good heuristic is to match the string from the end, rather than from the beginning, since the string has a prefix that is already known to match at the position. Checking to see whether a string matches at a given position will usually require examining one or two characters, discovering a mismatch, and rejecting the string.
LEMMA 30.11 PHBuild takes O(|t|h(tR )) time.
Proof If P = (v0, v1, ..., vk ) be a path from the root v0 in the position heap, let P1 = (v0, v1, ..., vLk/2J ), and let P2 = (vLk/2J+1 , vLk/2J+2 , ..., vk ) be the remainder of the path. Let i be the position label of vk , and let hl(i) denote the length of the maximum prefix x of t[i : 1] that occurs at least |x| times in t. The path label y of P1 has an occurrence at the positions labels of each of its descendants, including those on P2, of which there are at least |y|. Therefore, Therefore, |y| = O(hl(i)). The time spent by the algorithm at position I of t is proportional to the length of P , which is O(|y|). Therefore, the time spent by the algorithm adding the node for position i is O(hl(i)), hence the time to build the whole heap
As with the O(|t|h(t)) algorithm for building the DAWG, this time bound is a practical one in most settings, since h(t) is relatively insensitive to long repeated strings or localized areas of the string with many repetitions. Only strings where most areas of the string are repeated many times elsewhere have high values of h(t), and h(t) can be expected to behave like an O(log n) function in most settings.
Improvements to the Time Bounds
In this section, we have given an algorithm for constructing the position heap to O(|t|). We also sketch an approach for finding the occurrences of a string p in t to O(|p| + k) using position heaps. Each of these have tradeoff costs, such as having greater space requirements and being harder to understand.
The position heap has a dual, which we may call the dual heap (see Figure 30.12). They have the same node set: a node has path label x in the heap iff its path label in the dual is the reversal xR of x. We will refer to the position heap as the primal heap when we wish to contrast it to the dual.
It is tempting to think that the dual is just the position heap of the reversal tR of t, but this is not the case. As in the primal heap, the rightmost positions of t are near the root of the dual, but in the primal heap of tR , the leftmost positions of t are near the root. In the primal heap of tR the heap order is inverted, which affects the shape of the tree. Neither the primal nor the dual heap of t is necessarily isomorphic to the primal or dual heap of tR.
FIGURE 30.12: The position heap and the dual heap of the string abbabbb. The node set of both heaps is the same, but the label of the path leading to a node in the dual heap is the reverse of the label of the path leading to it in the position heap.
For PHBuild, the bottleneck is finding the node u whose path label is tl = titi+1...tk . The dual heap allows us to carry out this step more efficiently. We get an O(|t|) time bound for constructing the position heap by simultaneously constructing the position heap and its dual. It is also necessary to label each node v with its depth dv during construction, in addition to its position label, pv . This gives a compact representation of the path label of v if v is not the root: it is t[pv : pv − dv + 1].
During construction, the primal edges are directed from child to parent, while the dual edges are directed from parent to child. The modified procedure, FastPHBuild, is given in Table 30.3.
LEMMA 30.12 FastPHBuild is correct.
TABLE 30.3 Construct the position heap H and its dual D for a string t[i : 1]. Return (H, D, y), where y is a pointer to the node with position label i.
It follows that updating the primal heap to reflect t[i : 1] requires adding a new child y labeled ti−d2 to u in the primal heap. Since w’s path label is the longest proper suffix of y’s path label, w must be the parent of y in the dual. Since its depth is one greater than w’s, dy = d + 1. ♦
LEMMA 30.13 FastPHBuild takes O(|t|) time.
Proof The inductive step takes O(1) time, except for the cost of searching upward from v to find vl. Let k be the distance from vl to v and let kl = k − 1. The cost of searching upward is O(k). The depth of the new node y is dvt + 2, so it is dv − k + 2 ≤ dv + 1. Since v is the node added just before y, the depth of each successive node added increases by at most one and decreases by Θ(k). The total increases are O(|t|), so the sum of k’s over all recursive calls is bounded by this, hence also O(|t|). ♦
On tests we have run on several-megabyte texts, FastPHBuild is noticeably faster than PHBuild. This advantage must be weighed against the fact that the algorithm is slightly more difficult to understand, and uses more memory during construction, to store the dual edges.
By contrast, the algorithm we describe next for finding the positions of p in t in O(|p| + k) time is unlikely to compete in practice with PHFind, since the worst case bound of O(|p|2 + k) for PHFind overstates the typical case. However, it is interesting from a theoretical standpoint.
Let # be a character that is not in Σ. Let t#t denote the concatenation of two copies of t with the special character # in between. To obtain the time bound for PHFind, we may build the position heap of t#t in O(|t|) time using FastPHBuild. Index the positions from |t| to −|t| in descending order. This gives 0 as the position of the # character (see Figure 30.13).
To find the starting positions of p in t, it suffices to find only its positive starting positions in t#t. Suppose that there is a path labeled p that has at most one node with a positive position number. Finding the last node v of the path takes O(|p|) time, and all k positive
FIGURE 30.13: Finding occurrences of p in t in O(|p| + k) time, using a position heap. Because of the extra memory requirements and the good expected performance of the O(|p|2 + k) approach, the algorithm is of theoretical interest only. The trick is to build the position heap of t#t, indexing so that positions in the second occurrence are indexed with negative numbers. To find the occurrences of p in t, it suffices to return only its positive positions in t#t. Indexing into the heap is organized so that positive positions are descendants of nodes that are indexed to. Negative occurrences, which are ancestors, do not need to be verified against the text, eliminating the Θ(|p 2) step of the simpler algorithm.
starting positions are descendants. We can retrieve them in O(k) time. Since we are not required to find negative position numbers where p occurs, we do not have the Θ(|p| ) cost of finding which ancestors of v are actual matches. This gives an O(|p| + k) bound in this case.
Otherwise, the problem can be solved by chopping p into segments {x1, x2, ..., xk } such that each xi is the label of a path from the root in the heap that has exactly one node vi with a positive position number, namely, the last node of the path. Every positive position of xi is matched by a negative position, which must correspond to an ancestor of vi. Since there are at most |xi | ancestors of vi, vi has at most |xi| (positive) descendants, which can be retrieved in O(|xi |) time.
To see that this implies an O(|p|) time bound to return all occurrences of p in t, the reader should first note that a family F of k sets X1, X2, ..., Xk of integers are represented with sorted lists, it takes O(|X1| + |X2| + ...|Xk |) time to find their intersection. The key to this insight is that when two sets in F are merged, replacing them with their intersection, the sum of cardinalities of sets in F drops by an amount proportional to the time to perform the intersection. Therefore, the bound for all intersections is proportional to the sum of cardinalities of the initial lists. The problem of finding the occurrences of p reduces to this one as follows. Let Xi denote the positive positions of segment xi of p. Shift these positions to the left by |x1x2...xi−1 | to find the candidate positions they imply for the left endpoint of p in t. Intersecting the sets of candidates gives the locations of p in t.
To find the substrings {x1, x2, ..., xk } of p, index from the root of the position heap on the leading characters of p until a positive node is reached. The label of this path is x1, and recursing on the remaining suffix of p gives {x2, x3, ..., xk−1 }. It doesn’t give xk , since an attempt to produce xk in this way it may run out of characters of p before a node with a positive position number is reached. Instead, find xk by indexing from the right end of p using the dual heap until a positive position number is reached. Therefore, {x1, x2, ..., xk−1} represent disjoint intervals p, while xk−1 and xk can represent overlapping intervals of p. The sum of their lengths is therefore O(|p|), giving an O(|p|) bound to find all occurrences of p in t in this case.
Comments
Post a Comment