String Searching:The Compact DAWG
By Theorem 30.1 and Lemma 30.3, we cannot assume that the DAWG requires linear space if the nodes are explicitly labeled with their position sets. The algorithm for building the DAWG in linear time does not label the nodes with their position sets. However, without the labels, it is not possible to use the DAWG to find the k locations where a substring p occurs in t in O(|p| + k) time.
One remedy for this problem is to label a node with a position i if it represents the smallest position set that contains i as a member. The total number of these labels is n.
We can reverse the directions of the suffix pointers that are installed during the DAWG construction algorithm, yielding a tree on the position sets. If a node represents a set X of positions, the members of X can be returned in O(|X|) time by traversing the subtree rooted at X, assembling a list of these labels. (This tree is isomorphic to the suffix tree of the reverse of the text, but there is no need to adopt the common practice of labeling each of its edges with a string.)
Another alternative, which has considerable advantage in space requirements over the suffix tree, is to “compact” the DAWG, yielding a smaller data structure that still supports a query about the positions of a substring O(|p| + k) time. The algorithm for compacting it runs in O(|t|) time.
If x is a substring of t, let α(x) denote the longest string y such every ending position of x is also an ending position of yx. That is, y is the maximal string that precedes every occurrence of x in t. Note that α(x) may be the null string. Similarly, let β(x) denote the longest string z such that every starting position of x is a starting position of xz. This is the longest string that follows every occurrence of x.
For instance, if t = aabcabcaac and x = b, then α(x) = a and β(x) = ca.
LEMMA 30.6
1. For x and y in a right-equivalence class, α(x)x = α(y)y is the longest string in the class.
2. For x and y in a right-equivalence class, β(x) = β(y).
Let a substring x of t be prime if α(x) and β(x) are both the empty string. For any substring x of t, α(x)xβ(x) is prime; this is the prime implicant of x. If x is prime, it is its own prime implicant.
DEFINITION 30.3 The compact DAWG of a text t is defined as follows. The nodes are the prime substrings of t. If x is a prime substring, then for each a ∈ Σ such that xa is a substring of t, let y = α(xa) and z = aβ(xa). There is an edge labeled z from x to the prime implicant yxz of xa.
If a right-equivalence class contains a prime substring x, then x is the longest member of the class. Stretching the terminology slightly, let us call a class prime if it contains a prime substring. If C is a right-equivalence class in t, we may define β(C) = β(x) such that x ∈ C.
By Part 2 of Lemma 30.6, β(C) is uniquely defined. We may define C’s prime implicant to be the right-equivalence class D that contains xβ(x) for x ∈ C. D is also uniquely defined and contains the prime implicant of the members of C.
The nodes of the DAWG may therefore be partitioned into groups that have the same prime implicant. This is illustrated in Figure 30.8.
LEMMA 30.7 A right-equivalence class is non-prime if and only if it has exactly one outgoing edge in the DAWG.
We now describe how to obtain the compact DAWG from the DAWG in linear time. For ease of presentation, we describe how to carry it out in four depth-first traversals of the DAWG. However, in practice, only two depth-first traversals are required, since the operations of the first three traversals can be carried out during a single depth-first traversal.
In the first depth-first traversal, we may label each class with a single position from its set of ending positions. This is done in postorder: when retreating from a node, copy its label from the label of any of its successors, which have already been labeled, and subtract
1 from it.
By Lemma 30.7, the prime implicant of a class is the class itself if it is prime; otherwise, it is the unique successor that is prime. Let the distance to its prime implicant be the length of this unique path.
In postorder during the second traversal, we may label each node with a pointer to its prime implicant and label this pointer with the distance to the prime implicant. If the class C is a sink or has more than one outgoing edge, this is just a pointer from C to itself with distance label 0. Otherwise, C has a unique successor D, which is already labeled with a pointer to D’s prime implicant A with distance label i. Label C with a pointer to A with distance label i + 1.
In the third traversal, we install the compact DAWG edges. If we label the edges explicitly with their string labels, we will exceed linear time and storage. Instead, we may take advantage of the fact that the label of every edge is a substring of t. We label each edge with the length of its label. (See Figure 30.9.) When retreating from a prime node B during the traversal, for each DAWG edge (BC) out of B, let D be C’s prime implicant, let i be the distance of D from C. Install a compact DAWG edge from B to D that has length label i + 1.
FIGURE 30.9: Representing the edge labels of the compact DAWG. (Compare to Fig-ure 30.2.) Each edge label is a substring of t with end positions at the end position labels of the principal nodes. The label of the edge can therefore be represented implicitly, by labeling each node with one member of its position set, and labeling each edge with the length of its label. For instance, the edge labeled 3 from the source to the node labeled “5” is labeled with the substring of length 3 that ends at position 5, hence, the one occupying positions 3, 4, and 5 of the text. Since the text can be randomly accessed, the text can be used to look up the label of the edge. This ensures that the edge labels take O(|t|) space, since they take O(1) for each node and edge.
On the final traversal, we may remove the DAWG nodes, DAWG edges, and the prime implication pointers.
Using the Compact DAWG to Find the Locations of a String in the Text
Let v be a node of the compact DAWG, and let x be the corresponding prime implicant. Let the total length of a path from v to the sink be the sum of the length labels of the edges on the path. Observe that there is a path of total length i from v to the sink iff x has an ending position at n − i + 1.
LEMMA 30.8 Let x be a prime substring of t, and let k be the number of occurrences of x in t. Given x’s node in the compact DAWG of t, it takes O(k) time to retrieve the ending positions of x in t.
Proof Recursively explore all paths out of the node, and whenever the sink is encountered, subtract the total length of the current path from n + 1 and return it.
The running time follows from the following observations: One position is returned for each leaf of the recursion tree; the sets of positions returned by recursive calls are disjoint; and every internal node of the recursion tree has at least two children since every node of the compact DAWG has at least two outgoing edges. ♦
The representation of Figure 30.9 is therefore just as powerful as that of of Figure 30.2: the edge labels are implied by accessing t using the numbers on edges and nodes, while the position labels of the vertices can be retrieved in linear time by the algorithm of Lemma 30.8.
The representation of Figure 30.9 now gives an O(|p| + k) algorithm for finding the k occurrences of a substring p in a text t. One must index into the compact DAWG from the source, matching letters of p with letters of the implicit word labels of the compact edges.
If a letter of p cannot be matched, then p does not occur as a subword of t. Otherwise, p is the concatenation of a set of word labels on a path, followed by part of the word label of a final edge (u, v). This takes O(|p|) time. Let i be the number of remaining unmatched letters of the word label of (u, v). The k ending positions of p are given by subtracting i from the k ending positions of v, which can be retrieved in O(k) time using the algorithm of Lemma 30.8.
For instance, using the compact DAWG of Figure 30.2 to find the locations where abc occurs, we match a to the label a of an edge out of the source to the node with position set {1, 2, 5, 8, 9}, then bc to the word label of the edge to the node labeled {5, 8}. Though the node is labeled with the position set in the picture, this position set is not available in the linear-space data structure. Instead, we find two paths of length 2 and 5 from this node to the sink, and subtracting 2 and 5 from n = 10 yields the position set {5, 8}. Then, since one letter in the word label bca remains unmatched, we subtract 1 from each member of {5, 8} to obtain {4, 7}, which is the desired answer.
Variations and Applications
In [1], a variation of the compact DAWG is given for a collection {t1, t2, ..., tk } of texts, and can be used to find the k occurrences of a string p in the texts in O(|p| + k) time.
That paper also gives a symmetric version of the compact DAWG. By the symmetry in the definition of the prime subwords of t, the set of prime subwords of the reversal of t are given by reversing the set of prime subwords of t. The compact DAWG of t and of the reversal of t therefore have essentially the same set of nodes; only the edges are affected by the reversal. The symmetric version has a single set of nodes and two sets of edges, one corresponding to the edges of the compact DAWG of t and one corresponding to the edges of the reversal of t. The utility of this structure as a tool for exploring the subword structure of t is described in the paper.
Another variant occurs when t is a cyclic ordering of characters, rather than a linear one. A string p has an occurrence anywhere where it matches the subword contained in an interval on this cycle. A variant of the DAWG, compact DAWG, and compact symmetric DAWG for retrieving occurrences of subwords for t in this case is given in [1]. The paper gives algorithms that have time bounds analogous to those given here.
Variations of landscapes, such as that of Figure 30.6 are explored in [2].
They give a graphical display of the structure of repetitions in a text. The suffix tree can be used to find the longest common substring of two texts t1 and t2 efficiently. The paper gives O(|t|h(t)) algorithms that use the DAWG to generate the landscape of t (see Definition 30.2), which can be used to help identify functional units in a genomic sequence. One variation of the landscape explored in the paper inputs two texts t1 and t2, and gives a graphical display of the number of occurrences of every substring of t1 in t2, which has obvious applications to the study of evolutionary relationships among organisms.
Mehta and Sahni give a generalization of the compact DAWG and the compact symmetric DAWG to circular sequences is given in [6], and give techniques for analyzing and displaying the structure of strings using the compact symmetric DAWG in [7, 8].
Comments
Post a Comment