Suffix Trees and Suffix Arrays:Applications

Applications

In this section, we present algorithms for several string problems using suﬃx trees and suﬃx arrays. While the same run-time bounds can be achieved for many interesting applications with either a suﬃx tree or a suﬃx array, there are others which involve a space vs. time trade oﬀ. Even in cases where the same run-time bound can be achieved, it is often easier to design the algorithm ﬁrst for a suﬃx tree, and then think if the implementation can be done using a suﬃx array. For this reason, we largely concentrate on suﬃx trees. The reader interested in reading more on applications of suﬃx arrays is referred to [1, 2].

Pattern Matching

Given a pattern P and a text T , the pattern matching problem is to ﬁnd all occurrences of P in T . Let |P | = m and |T | = n. Typically, n >> m. Moreover, T remains ﬁxed in many applications and the query is repeated for many diﬀerent patterns. For example, T could be a text document and P could represent a word search. Or, T could be an entire database of DNA sequences and P denote a substring of a query sequence for homology (similarity) search. Thus, it is beneﬁcial to preprocess the text T so that queries can be answered as eﬃciently as possible.

Pattern Matching using Suﬃx Trees

The pattern matching problem can be solved in optimal O(m + k) time using ST (T ), where k is the number of occurrences of P in T . Suppose P occurs in T starting from position i. Then, P is a preﬁx of suf fi in T . It follows that P matches the path from root to leaf labeled i in ST . This property results in the following simple algorithm: Start from the root of ST and follow the path matching characters in P , until P is completely matched or a mismatch occurs. If P is not fully matched, it does not occur in T . Otherwise, each leaf in the subtree below the matching position gives an occurrence of P . The positions can be enumerated by traversing the subtree in time proportional to the size of the subtree. As the number of leaves in the subtree is k, this takes O(k) time. If only one occurrence is of interest, the suﬃx tree can be preprocessed in O(n) time such that each internal node contains the label of one of the leaves in its subtree. Thus, the problem of whether P occurs in T or the problem of ﬁnding one occurrence can be answered in O(m) time.

Pattern Matching using Suﬃx Arrays

Consider the problem of pattern matching when the suﬃx array of the text, SA(T ), is available. As before, we need to ﬁnd all the suﬃxes that have P as a preﬁx. As SA is a lexicographically sorted order of the suﬃxes of T , all such suﬃxes will appear in consecutive positions in it. The sorted order in SA allows easy identiﬁcation of these suﬃxes using binary search. Using a binary search, ﬁnd the smallest index i in SA such that suf fSA[i] contains P as a preﬁx, or determine that no such suﬃx is present. If no suﬃx is found, P does not occur in T . Otherwise, ﬁnd the largest index j(≥ i) such that suf fSA[j] contains P as a preﬁx. All the elements in the range SA[i..j] give the starting positions of the occurrences of P in T .

A binary search in SA takes O(log n) comparisons. In each comparison, P is compared with a suﬃx to determine their lexicographic order. This requires comparing at most |P | = m characters. Thus, the run-time of this algorithm is O(m log n). Note that while this run-time is inferior to the run-time using suﬃx trees, the space required by this algorithm is only n words for SA apart from the space required to store the string. Note that the Lcp array is not required. Assuming 4 bytes per suﬃx array entry and one byte per character in the string, the total space required is only 5n bytes.

The run-time can be improved to O(m + log n), by using slightly more space and keeping track of appropriate lcp information. Consider an iteration of the binary search. Let SA[L..R] denote the range in the suﬃx array where the binary search is focused. To be- gin with, L = 1 and R = n. At the beginning of an iteration, the pattern P is known to be lexicographically greater than or equal to suf fSA[L] and lexicographically smaller than or equal to suf fSA[R]. Let M = I L+R l. During the iteration, a lexicographic com-

parison between P and suf fSA[M ] is made. Depending on the result, the search range is narrowed to either SA[L..M ] or SA[M..R]. Assume that l = |lcp(P, suf fSA[L])| and r = |lcp(P, suf fSA[R])| are known at the beginning of the iteration. Also, assume that |lcp(suf fSA[L], suf fSA[M ])| and |lcp(suf fSA[M ], suf fSA[R])| are known. From these val- ues, we wish to determine |lcp(P, suf fSA[M ])| for use in next iteration, and consequently determine the relative lexicographic order between P and suf fSA[M ]. As SA is a lexico- graphically sorted array, P and suf fSA[M ] must agree on at least min(l, r) characters. If l and r are equal, then comparison between P and suf fSA[M ] is done by starting from the (l + 1)th character. If l and r are unequal, consider the case when l > r.

Similarly, the case when r > l can be handled such that comparisons between P and suf fSA[M ], if at all needed, start from (r + 1)th character. To start the execution of the algorithm, lcp(P, suf fSA[1]) and lcp(P, suf fSA[n]) are computed directly using at most 2|P | character comparisons. This ensures |lcp(P, suf fSA[L])| and |lcp(P, suf fSA[R])| are known at the beginning of the ﬁrst iteration. This property is maintained for each iteration as L or R is shifted to M but |lcp(P, suf fSA[M ])| is computed. For now, assume that the required |lcp(suf fSA[L], suf fSA[M ])| and |lcp(suf fSA[R], suf fSA[M ])| values are available.

LEMMA 29.4 The total number of character comparisons made by the algorithm is O(m + log n).

Proof The algorithm makes at most 2m comparisons in determining the longest common preﬁxes between P and suf fSA[1] and between P and suf fSA[n]. Classify the comparisons made in each iteration to determine the longest common preﬁx between P and suf fSA[M ] into successful and failed comparisons. A comparison is considered successful if it contributes the longest common preﬁx. There is at most one failed comparison per iteration, for a total of at most log n such comparisons over all iterations. As for successful comparisons, note that the comparisons start with (max(l, r) + 1)th character of P , and each successful comparison increases the value of max(l, r) for next iteration. Thus, each character of P is involved only once in a successful comparison. The total number of character comparisons is at most 3m + log n = O(m + log n).

It remains to be described how the |lcp(suf fSA[L], suf fSA[M ])| and |lcp(suf fSA[R], suf fSA[M ])|

values required in each iteration are computed. Suppose the Lcp array of T is known. For

The lcp of two suﬃxes can be computed in time proportional to the distance between them in the suﬃx array. In order to ﬁnd the lcp values required by the algorithm in constant time, consider the binary tree corresponding to all possible search intervals used by any execution of the binary search algorithm. The root of the tree denotes the interval [1..n].

If [i..j] (j − i ≥ 2) is the interval at an internal node of the tree, its left child is given by [i..I i+j l] and its right child is given by [I i+j l..j]. The execution of the binary search tree algorithm can be visualized as traversing a path in the binary tree from root to a leaf. If lcp value for each interval in the tree is precomputed and recorded, any required lcp value during the execution of the algorithm can be retrieved in constant time. The leaf level in the binary tree consists of intervals of the type [i..i + 1]. The lcp values for these n − 1

intervals is already given by the Lcp array. The lcp value corresponding to an interval at an internal node is given by the smaller of the lcp values at the children. Using a bottom-up traversal, the lcp values can be computed in O(n) time. In addition to the Lcp array, n − 2 additional lcp values are required to be stored. Assuming approximately 1 byte per lcp value, the algorithm requires approximately 2n bytes of additional space. As usual, lcp values larger than or equal to 255, if any, are stored in an exceptions list and the size of such list should be very small in practical applications.

Thus, pattern matching can be solved in O(m log n) time using 5n bytes of space, or in O(m + log n) time using 7n bytes of space. Abouelhoda et al. [2] reduce this time further to O(m) time by mimicking the suﬃx tree algorithm on a suﬃx array with some auxiliary information. Using clever implementation techniques, the space is reduced to approximately 6n bytes. An interesting feature of their algorithm is that it can be used in other applications based on a top-down traversal of suﬃx tree.

Longest Common Substrings

Consider the problem of ﬁnding a longest substring common to two given strings s1 of size m and s2 of size n. To solve this problem, ﬁrst construct the GST of strings s1 and s2. A longest substring common to s1 and s2 will be the path-label of an internal node with the greatest string depth in the suﬃx tree which has leaves labeled with suﬃxes from both the strings. Using a traversal of the GST , record the string-depth of each node, and mark each node if it has suﬃxes from both the strings. Find the largest string-depth of any marked node. Each marked internal node at that depth gives a longest common substring. The total run-time of this algorithm is O(m + n).

The problem can also be solved by using the suﬃx tree of one of the strings and suﬃx links. Without loss of generality, suppose the suﬃx tree of s2 is given. For each position i in s1, we ﬁnd the largest substring of s1 starting at that position that is also a substring of s2. For position 1, this is directly computed by matching suf f1 of s1 starting from the root of the suﬃx tree until no more matches are possible. To determine the longest substring match from position 2, simply walk up to the ﬁrst internal node, follow the suﬃx link, and walk down as done in McCreight’s algorithm. A similar proof shows that this algorithm runs in O(m + n) time.

Now consider solving the longest common substring problem using the GSA and Lcp array for strings s1 and s2. First, consider a one string variant of this problem − that of computing the longest repeat in a string. This is given by the string depth of the deepest internal node in the corresponding suﬃx tree. All children of such a node must be leaves.

Any consecutive pair of such leaves have the longest repeat as their longest common preﬁx.

Thus, each largest value in the Lcp array reveals a longest repeat in the string. The number of occurrences of a repeat is one more than the number of consecutive occurrences of the corresponding largest value in the Lcp array. Thus, all distinct longest repeats, and the number and positions of their occurrences can be determined by a linear scan of the Lcp array.

To solve the longest common substring problem, let v denote an internal node with the greatest string depth that contains a suﬃx from each of the strings. Because such a pair of suﬃxes need not be consecutive in the suﬃx array, it might appear that one has to look at nonconsecutive entries in the Lcp array. However, the subtree of any internal node that is a child of v can only consist of suﬃxes from one of the strings. Thus, there will be two consecutive suﬃxes in the subtree under v, one from each string. Therefore, it is enough to look at consecutive entries in the GSA. In a linear scan of the GSA and Lcp arrays, ﬁnd the largest lcp value that corresponds to two consecutive suﬃxes, one from each string. This gives the length of a longest common substring. The starting positions of the suﬃxes reveals the positions in the strings where the longest common substring occurs. The algorithm runs in O(m + n) time.

Text Compression

Compression of text data is useful for data transmission and for compact storage. A simple, not necessarily optimal, data compression method is the Ziv-Lempel compression [24, 25]. In this method, the text to be compressed is considered a large string, and a compact representation is obtained by identifying repeats in the string. A simple algorithm following this strategy is as follows: Let T denote the text to be compressed and let |T | = n. At some stage during the execution of the compression algorithm, suppose that the string T [1..i − 1] is already compressed. The compression is extended by ﬁnding the length li of a largest preﬁx of suf fi that is a substring of T [1..i − 1]. Two cases arise:

The algorithm is initiated by setting T [1] to be the compressed representation of T [1..1], and continuing the iterations until the entire string is compressed. For example, executing the above algorithm on the string mississippi yields the compressed string mis(3, 1)(2, 3)(2, 1)p (9, 1)(2, 1). The decompression method for such a compressed string is immediate.

Suﬃx trees can be used to carry out the compression in O(n) time [20]. They can be used in obtaining li, the length of the longest preﬁx of suf fi that is a substring of the portion of the string already seen, T [1..i − 1]. If j is the starting position of such a substring, then T [j..j + li − 1] = T [i..i + li − 1] and i ≥ j + li. It follows that |lcp(suf fj , suf fi)|≥ li. Let v = lca(i, j), where i and j are leaves corresponding to suf fi and suf fj , respectively. It follows that T [i..i + li − 1] is a preﬁx of path-label(v). Consider the unique path from the root of ST (T ) that matches T [i..i + li − 1]. Node v is an internal node in the subtree below, and hence j is a leaf in the subtree below. Thus, li is the largest number of characters along the path T [i..n] such that ∃ leaf j in the subtree below with j + li ≤ i. Note that any j in the subtree below that satisﬁes the property j + li ≤ i is acceptable. If such a j exists, the smallest leaf number in the subtree below certainly satisﬁes this property, and hence can be chosen as the starting position j.

This strategy results in the following algorithm for ﬁnding li: First, build the suﬃx tree of T . Using an appropriate linear time tree traversal method, record the string depth of each node and mark each internal node with the smallest leaf label in its subtree. Let min(v) denote the smallest leaf label under internal node v. To ﬁnd li, walk along the path T [i..n] to identify two consecutive internal nodes u and v such that min(u) + string-depth(u) < I and min(v) + string-depth(v) ≥ i. If min(v) + string-depth(u) > i, then set li = string depth(u) and set the starting position to be min(u). Otherwise, set li = i − min(v) and set the starting position to be min(v).

To obtain O(n) run-time, it is enough to ﬁnd li in O(li ) time as the next li characters of the string are compressed into an O(1) space representation of an already seen substring.

Therefore, it is enough to traverse the path matching T [i..n] using individual character comparisons. However, as the path is guaranteed to exist, it can be traversed in O(1) time per edge, irrespective of the length of the edge label.

String Containment

Given a set of strings S = {s1, s2,..., sk } of total length N , the string containment problem is to identify each string that is a substring of some other string. An example application could be that the strings represent DNA sequence fragments, and we wish to remove redun- dancy. This problem can be easily solved using suﬃx trees in O(N ) time. First, construct the GST (S) in O(N ) time. To ﬁnd if a string si is contained in another, locate the leaf labeled (si, 1). If the label of the edge connecting the leaf to its parent is labeled with the string $, si is contained in another string. Otherwise, it is not. This can be determined in O(1) time per string.

Suﬃx-Preﬁx Overlaps

Suppose we are given a set of strings S = {s1, s2,... , sk } of total length N . The suﬃx- preﬁx overlap problem is to identify, for each pair of strings (si, sj ), the longest suﬃx of si that is a preﬁx of sj . Suﬃx-preﬁx overlaps are useful in algorithms for ﬁnding the shortest common superstring of a given set of strings. They are also useful in applications such as genome assembly where signiﬁcant suﬃx-preﬁx overlaps between pairs of fragments are used to assemble fragments into much larger sequences.

The suﬃx-preﬁx overlap problem can be solved using GST (S) in optimal O(N +k2) time.

Consider the longest suﬃx α of si that is a preﬁx of sj . In GST (S), α is an initial part of the path from the root to leaf labeled (j, 1) that culminates in an internal node. A leaf that corresponds to a suﬃx from si should be a child of the internal node, with the edge label $. Moreover, it must be the deepest internal node on the path from root to leaf (j, 1)

that has a suﬃx from si attached in this way. The length of the corresponding suﬃx-preﬁx overlap is given by the string depth of the internal node.

Let M be a k × k output matrix such that M [i, j] should contain the length of the longest suﬃx of si that overlaps a preﬁx of sj . The matrix is computed using a depth ﬁrst search (DFS) traversal of GST (S). The GST is preprocessed to record the string depth of every node. During the DFS traversal, k stacks A1, A2,... , Ak are maintained, one for each string. The top of the stack Ai contains the string depth of the deepest node along the current DFS path that is connected with edge label $ to a leaf corresponding to a suﬃx from si. If no such node exists, the top of the stack contains zero. Each stack Ai is initialized by pushing zero onto an empty stack, and is maintained during the DFS as follows: When the DFS traversal visits a node v from its parent, check to see if v is attached to a leaf with edge label $. If so, for each i such that string si contributes a suﬃx labeling the leaf, push string-depth(v) on to stack Ai. The string depth of the current node can be easily maintained during the DFS traversal. When the DFS traversal leaves the node v to return back to its parent, again identify each i that has the above property and pop the top element from the corresponding stack Ai.

The output matrix M is built one column at a time. When the DFS traversal reaches a leaf labeled (j, 1), the top of stack Ai contains the longest suﬃx of si that matches a preﬁx of sj . Thus, column j of matrix M is obtained by setting M [i, j] to the top element of stack Si. To analyze the run-time of the algorithm, note that each push (similarly, pop) operation on a stack corresponds to a distinct suﬃx of one of the input strings. Thus, the total number of push and pop operations is bounded by O(N ). The matrix M is ﬁlled in O(1) time per element, taking O(k2) time. Hence, all suﬃx-preﬁx overlaps can be identiﬁed in optimal O(N + k2) time.

Search This Blog

algorithms

Suffix Trees and Suffix Arrays:Applications

Comments

Post a Comment

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Warshall’s Algorithm -to find TRANSITIVE CLOSURE