String Searching:The DAWG

The DAWG

LEMMA 30.1 Let x and y be two strings such that EndP ositions(x)EndP ositions(y) /= . One of x and y must be a suffix of the other, and either EndP ositions(x) = EndP ositions(y), EndP ositions(x) EndP ositions(y) or EndP ositions(y) EndP ositions(x).

Proof If x and y have a common ending position i, then the two occurrences coincide in a way that forces one to be a suffix of the other. Suppose without loss of generality that y is a suffix of x. Then every occurrence of x contains an occurrence of y inside of it that ends at the same position, so Endpositions(x) Endpositions(y).

For instance, in the string aabcabcaac, the string ca has ending positions {5, 8}, while the string aabca has ending positions {5}, and ca is a suffix of aabca.

Let x’s right-equivalence class in t be the set {y|EndP ositions(y) = EndP ositions(x)}.

The only infinite class is degenerate class of strings with the empty set as ending positions, namely those elements of Σthat are not substrings of t.

The right-equivalence classes on t are a partition of Σ: each member of Σis in one and only one right-equivalence class. By Lemma 30.1, whenever two strings are in the same nondegenerate right-equivalence class, then one of them is a suffix of the other. It is easily seen that if y is the shortest string in the class and x is the longest, then the class consists of the suffixes of x whose length is at least |y|.

For instance, in Figure 30.1, the class of strings with end positions {5, 8} consists of y = ca, x = abca, and since bca is a longer suffix of x than y is.

LEMMA 30.2 A text t of length n has at most 2n right-equivalence classes.

Proof The degenerate class is one right equivalence class. All others have nonempty ending positions, and we must show that there are at most 2n 1 of them. The set V = {0, 1, 2, ..., n} is the set of ending positions of the empty string. If X and Y are sets of ending positions of two right-equivalence classes, then X Y , Y X, or Y X = , by Lemma 30.1. Therefore, the transitive reduction (Hasse diagram) of the subset relation on the nonempty position sets is a tree rooted at V . For any i such that {i} is not a leaf, we can add {i} as a child of the lowest set that contains i as a member. The leaves are now a partition of {1, 2, ..., n} so it has at most n leaves. Since each node of the tree has at least two children, there are at most 2n 1 nodes.

DEFINITION 30.1 The DAWG is defined as follows. The states of the DAWG are the nondegenerate right-equivalence classes that t induces on its substrings. For each a Σ and x Σsuch that xa is a substring of t, there is an edge labeled a from x’s right-equivalence class to xa’s right-equivalence class.

Figure 30.1 depicts the DAWG by labeling each right-equivalence class with its set of ending positions. The set of words in a class is just the set of path labels of paths leading from the source to a class. For instance, the right-equivalence class represented by the node labeled {5, 8} is {ca, bca, abca}.

It would be natural to include the infinite degenerate class of strings that do not occur in t. This would ensure that every state had an outgoing edge for every letter of Σ. However, it is convenient to omit this state when representing the DAWG: for each a Σ, there is an edge from the degenerate class to itself, and this does not need to be represented explicitly. An edge labeled a from a nondegenerate class to the degenerate class is implied by the absence of an edge out of the state labeled a in the representation.

For each node X and each a Σ, there is at most one transition out of X that is labeled a. Therefore, the DAWG is a deterministic finite automaton. Any word p such that EndP ositions(p) /= spells out the labels of a path to the state corresponding to EndP ositions(p). Therefore, all states of the DAWG are reachable from the start state.

The DAWG cannot have a directed cycle, as this would allow an infinite set of words to spell out a path, and the set of subwords of t is finite. Therefore, it can be represented by a directed acyclic graph.

A state is a sink if it has no outgoing edges. A sink must be the right-equivalence class containing position n, so there is exactly one sink.

THEOREM 30.1 The DAWG for a text of length n has at most 2n 1 nodes and 3n 3edges.

Proof The number of nodes follows from Lemma 30.2. There is a single sink, namely, the one that has position set {|t|}, this represents the equivalence class containing those suffixes of t that have a unique occurrence in t. Let T be a directed spanning tree of the DAWG rooted at the start state. T has one fewer edges than the number of states, hence 2n 2 edges. For every e /∈ T , let P1(e) denote the path in T from the start state to the tail of e, let P2(e) denote an arbitrary path in the DAWG from the head of e to the sink, and let P (e) denote the concatenation of (P1(e), e, P2(e)). Since P (e) ends at the sink, the labels of its edges yield a suffix of t. For e1, e2 /∈ T with e1 /= e2, P (e1) /= P (e2), since they differ in their first edge that is not in T . One suffix is given by the labels of the path in T to the sink. Each of the remaining n 1 suffixes is the sequence of labels of P (e) for at most one edge e /∈ T , so there are at most n 1 edges not in T .

The total number of edges of the DAWG is bounded by 2n 2 tree edges and n 1 non-tree edges.

To determine whether a string p occurs as a substring of t, one may begin at the start state and either find the path that spells out the letters of p, thereby accepting p, or else reject p if there is no such path. This requires finding, at each node x, the transition labeled a leaving x, where a is the next letter of p. If |Σ| = O(1), this takes O(1) time, so it takes O(|p|) time to determine whether p is a subword of t. Note that, in contrast to naïve approaches to this problem, this time bound is independent of the length of t.

If the nodes of the DAWG are explicitly labeled with the corresponding end positions, as in Figure 30.1, then it is easy to find the positions where a substring occurs:

it is the label of the state reached on the substring. However, doing this is infeasible if one wishes to build the DAWG in O(|t|) time and use O(|t|) storage, since the sum of cardinalities of the position sets can be greater than this. For this problem, it is preferable to use the compact DAWG that is described below.

For the problem of finding the number of occurrences of a substring in t, it suffices to label each node with the number of positions in its position set. This may be done in postorder in a depth-first search, starting at the start node, and applying the following rule: the label of a node v is the sum of labels of its out-neighbors, which have already been labeled by the time one must label v. Handling v takes time proportional to the number of edges originating at v, which we have already shown is O(|t|).

A Simple Algorithm for Constructing the DAWG

DEFINITION 30.2 If x is a substring of t, let us say that x’s redundancy in t in t is the number of ending (or beginning) positions it has in t. If i is a position in t, let h(i) be the longest substring x of t with an ending position at i whose redundancy is at least as great as its length, |x|. Let h(t) be the average of h(i) over all i, namely (),|t| h(i))/|t|.

Clearly, h(t) is a measure of how redundant t is; the greater the value of h(t), the less information it can contain.

In this section, we given an O(|t|h(t)) algorithm for constructing the DAWG of a string t. This is quadratic in the worst case, which is illustrated by the string t = an, consisting of n copies of one letter. However, we claim that the algorithm is a practical one for most applications, where h(t) is rarely large even when t has a long repeated substring. In most applications, h(t) can be expected to behave like an O(log|t|) function.

The algorithm explicitly labels the nodes of the DAWG with their ending positions, as illustrated in Figure 30.1. Each set of ending positions is represented with a list, where the positions appear in ascending order. It begins by creating a start node, and then iteratively processes an unprocessed node by creating its neighbors. To identify an unprocessed node, it is convenient to keep a list of the unprocessed nodes, insert a node in this list, and remove a node from the front of the list when it is time to process a new node.

image

Figure 30.4 gives an illustration. For the correctness, it is easy to see by induction on k that every substring w of the text that has length k leads to a node whose position set is the ending positions of w.

LEMMA 30.3 The sum of cardinalities of the position sets of the nodes of the DAWG is O(|t|h(t)).

Proof For a position i, let N (i) be the number of ending position sets in which position i appears. By Lemma 30.1, position sets that contain i form a chain {X1, X2, ..., XN (i)}, where for each i from 1 to N (i)1, |Xi| > |Xi+1|, and a string with Xi as its ending positions must be shorter than one with Xi+1 as its ending positions. Therefore, |XLN (i)/2J|N (i)/2, and any string with this set as its ending position set must have length at least l(N (i)/2J1.

This is a string whose set of ending positions is at least as large as its length, so N (i) = O(h(t)), The sum of cardinalities of the position sets is given by ),|t| N (i), since each appearance of i in a position set contributes 1 to the sum, and this sum is O(|t|h(T )).

It is easy to keep the classes as sorted linked lists. When a class X is partitioned into smaller classes, these fall naturally into smaller sorted lists in time linear in the size of

image

FIGURE 30.4: Illustration of the first three iterations of Algorithm 30.2 on aabcabcaac. Unprocessed nodes are drawn with dashed outlines. The algorithm initially creates a start state with position set {0, 1, ..., n} (left figure). To process the start node, it creates a copy of this position set, and adds 1 to each element, yielding {1, 2, ..., n + 1}. It discards n + 1, yielding {1, 2, ..., n}. It partitions this into the set {1, 2, 5, 8, 9} of positions that contain a, the set {3, 6} of positions that contain b, and the set {4, 7, 10} of positions that contain c, creates a node for each, and installs edges labeled with the corresponding letters to the new nodes (middle figure). To process the node v labeled {1, 2, 5, 8, 9}, it adds 1 to each element of this set to obtain {2, 3, 6, 9, 10}, and partitions them into {2, 9}, {3, 6}, and {10}. Of these, {2, 9} and {10} are new position sets, so a new node is created for each. It then installs edges from v to the nodes with these three position sets.

X. A variety of data structures, such as tries, are suitable for looking up whether the sorted representation of a class W already occurs as the position set of a node. The time is therefore linear in the sum of cardinalities of the position sets, which is O(|t|h(t)) by Lemma 30.3.

Constructing the DAWG in Linear Time

The linear-time algorithm given in [1] to construct the DAWG works incrementally by induction on the length of the string. The DAWG of a string of length 0 (the null string) is just a single start node. For k = 0 to n 1, it iteratively performs an induction step that modifies the DAWG of t[1 : k] to obtain the DAWG of t[1 : k + 1].

To gain insight into how the induction step must be performed, consider Figure 30.5. An occurrence of a substring of t can be specified by giving its ending position and its length. For each occurrence of a substring, it gives the number of times the substring occurs up to that point in the text, indexed by length and position. For instance, the string that has length 3 and ends at position 5 is bca. The entry in row 3, column 5 indicates that there is one occurrence of it up through position 5 of the text. There is another at position 8, and the entry at row 3 column 8 indicates that it his two occurrences up through position 8.

The lower figure, which we may call the incremental landscape, gives a simplified representation of the table, by giving an entry only if it differs from the entry immediately above it. Let L[i, j] denote the entry in row i, column k of the incremental landscape. Some of these entries are blank; the implicit value of such an entry is the value of the first non-blank entry above it.

image

FIGURE 30.5: Displaying the number of occurrences of substrings in a text. In the upper figure, the entry in row i column j corresponds to the substring of length j that ends at position i in the text t, and gives the number of occurrences of the substring at position i or before. That is, it gives the number of occurrences of the substring in t[1 : i]. Row 0 is included to reflect occurrences of the null substring, which has occurrences at {0, 1, ..., n}.

image

FIGURE 30.6: The incremental landscape is a simplification of the table of Figure 30.5, where an entry is displayed only if it differs from the entry above it. The entries in column i are right-equivalence classes of t[1 : i]. These are right-equivalence classes that may be affected during the induction step, when the DAWG of t[1 : i 1] is modified to obtain the DAWG of t[1 : i]. Equivalence classes of t[i : 1] that are not right-equivalence classes in t[1 : i] are circled; these correspond to nodes of the DAWG that must be created during the induction step. Edges of the DAWG of t[1 : i] from classes in column i 1 are depicted as arrows. (The distinction between solid and dashed arrows is used in the proof of Theorem 30.4.)

Column k has one entry for each right-equivalence class of t[1 : k] that has k as an ending position. For instance, in column 8, we see the following:

1. L[0, 8]: A right-equivalence class for the suffix of t[1 : 8] of length 0, namely, the empty string, which has 9 occurrences ({0, 1, ..., 8}) in t[1 : 8].

2. L[1, 8]: A right-equivalence class for the suffix of t[1 : 8] of length 1, namely, the suffix a, which has four occurrences ({1, 2, 5, 8}) in t[1 : 8].

3. L[4, 8]: A right-equivalence class for suffixes of t[1 : 8] of lengths 2 through 4, namely, {ca, bca, abca}, which have two occurrences ({5, 8}) in t[1 : 8]. The longest of these, abca, is given by the non-blank entry at L[4, 8], and membership of the others in the class is given implicitly by the blank entries immediately below it.

4. L[8, 8]: A right-equivalence class for suffixes of t[1 : k] of lengths 5 through 8, namely, {cabca, bcabca, abcabca, abcabca} that have one occurrence in t[1 : 8].

We may therefore treat non-blank entries in the incremental landscape as nodes of the DAWG. Let the height of a node denote the length of the longest substring in its right- equivalence class; this is the height (row number) where it appears in the incremental landscape.

When modifying the DAWG of t[1 : k] to obtain the DAWG of t[1 : k + 1], all new nodes that must be added to the DAWG appear in column k + 1. However, not every node in column k + 1 is a new node, as some of the entries reflect nodes that were created earlier. For instance, consider Figure 30.7, which shows the incremental step from t[1 : 6] to t[1 :

7]. One of the nodes, which represents the class {cabc, bcabc, abcabc, aabcabc} of substrings of t[1 : 7] that are not substrings of t[1 : 6]. It is the top circled node of column 7 in Figure 30.6.

Another represents the class Z2 = {c, bc, abc}. This appears in L[3, 7]. To see why this is new, look at the previous occurrence of its longest substring, abc, which is represented by L[3, 4], which is blank. Therefore, in the DAWG of t[1 : 6], it is part of a right-equivalence Z, which appears at L[4, 4], and which contains a longer word, aabc. Since {c, bc, abc} are suffixes of t[1 : 7] and aabc is not, they cease to be right-equivalent in t[1 : 7]. Therefore, Z must be split into two right-equivalence classes, Z2 = {c, bc, abc} and Z1 = Z Z2 = {aabc}. Let us call this operation a split.

Let us say that a node is new in column k if it is created when the DAWG of t[1 : k] is modified to obtain the DAWG of t[1 : k + 1]. In Figure 30.6, a node in a column is circled if it is new in that column. In general, a node is new in column k iff it is the top node of the column or the previous occurrence of its longest member corresponds to a blank space in the incremental landscape.

An important point is that only the top two nodes of a column can be new:

LEMMA 30.4 If a new node is the result of a split, only one node lies above it in its column.

Proof Let a be the character that causes the split, and let xa be the largest string in Z2, and let bxa be the smallest string in Z1 = Z Z2. Since bxa previously had the same set of ending positions as xa and now it does not, it must be that xa occurs as a suffix of tk , but bxa does not. Let cxa be the smallest string in the next higher class C in column k + 1. On all previous occurrences of xa, it was inside bxa, so the first occurrence of cxa is at position k + 1. The frequency of the strings in C must be 1, so C is the top class of the column.

image

FIGURE 30.7: Modifying the DAWG of t[1 : 6] = aabcab to obtain the DAWG of t[1 : 7] = aabcabc. New nodes are shown with bold outlines. The sixth column of the incremental landscape, from top to bottom, consists of the nodes {6}, {3, 6}, and the start node. The seventh column, from top to bottom, consists of {7}, {4, 7}, and the start node. The node {4, 7} is split from {4}; of the strings {c, bc, abc, aabc} that end at node {4}, only {c, bc, abc} also occur at position 7, so these must now be handled by a different node from the one that handles aabc. All edges from nodes in the previous column to {4} are redirected to the new node {4, 7}.

The foregoing shows how nodes must be added to the DAWG in the inductive step. In order to understand how the edges of the DAWG must be modified, note that every edge directed into a node in column k + 1 comes from a node in column k. These edges are given by the following rule:

LEMMA 30.5 In the DAWG of t[1 : k + 1], a node of height i in column k has an edge labeled tk+1 to the lowest node of column k + 1 that has height greater than i.

These edges are drawn in as solid and dashed arrows in Figure 30.6.

According to the figure, when the DAWG of t[1 : 7] is obtained from t[1 : 6], the new top node in the column must have an incoming edge from the top node of column 6, which is labeled {6} in Figure 30.7. The second new node in column 7, which is labeled {4, 7} in the figure, must have edges from the nodes at L[0, 6] and L[2, 6], which are the source and the node labeled {3, 6}. These are obtained by diverting edges into Z.

The induction step consists of implicitly marching in parallel down columns k and k + 1, creating the required nodes in column tk+1 and installing the appropriate edges from right- equivalence classes in column k to right-equivalence classes of column k + 1, as well as the appropriate outgoing edges and parent pointers on each right-equivalence class in column k + 1 that is new in tk+1. The new nodes in column k + 1 are the one with frequency one, and possibly one other, Z2, that results from a split. By Lemma 30.4, this requires marching down through at most two nodes of column k + 1, but possibly through many nodes of column k.

To facilitate marching down a column k efficiently, the algorithm needs a suffix pointer Suffix(x) on each node x of column k to the next lower node in column k. If y = y1y2y3...yj is the shortest string in the x’s right-equivalence class, then Suffix(x) points to the right- equivalence class that contains the longest proper suffix y2y3...yj of y. The suffix pointer for each node is uniquely defined, so the algorithm ensures that suffix pointers are available on nodes of column k by keeping suffix pointers current on all nodes of the DAWG.

The induction step is given in Algorithm 30.3. The algorithm does not build the incremental landscape. However, we may identify the nodes by where they would go in the incremental landscape. The meanings of the variables can be summarized as follows. Topk is the top node of column k, and Topk+1 is the top node of column k + 1. Mid denotes the highest node of column k that already has an outgoing labeled with the (k + 1)th letter. The variable curNode is a looping variable that travels down through nodes of column k, becoming undefined if it travels past the bottom node of the column.

Algorithm 30.3 Update(Topk ): Update the DAWG of t[1 : k] to obtain the DAWG of t[1 : k + 1].

image

THEOREM 30.4 It takes O(|t|) time to build the DAWG of a text t of length n.

Proof No node is ever discarded once it is created, and the final DAWG has O(|t|) nodes.

Therefore, the cost of creating nodes is O(|t|). Once an edge is created it remains in the DAWG, though it might be diverted in calls to Split. No edge is ever discarded and the final DAWG has O(|t|) edges, so the cost of creating edges is O(|t|).

It remains to bound the cost of diverting edges in calls to Split. Let an edge that appears in the incremental landscape be solid if it goes from a node of height i to one of height i + 1, and dashed otherwise. (See Figure 30.6.) We may partition the edges in the landscape into terminating paths, each of which starts in row 0, and contains zero or more solid edges, and either followed by a dashed edge or ending in the last column. At most one terminating path begins in any column, and every dashed edge terminates a path. Thus, there are at most n dashed edges.

When Z2 is created in Split, at most one of the edges diverted into it is solid. The cost of diverting this edge is subsumed in the cost of creating Z2. The cost of diverting other edges is O(|t|) over all calls to Split, since each of them is one of the at most n dashed edges that appear in the incremental landscape.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout