Suffix Trees and Suffix Arrays:Applications

Applications

In this section, we present algorithms for several string problems using suffix trees and suffix arrays. While the same run-time bounds can be achieved for many interesting applications with either a suffix tree or a suffix array, there are others which involve a space vs. time trade off. Even in cases where the same run-time bound can be achieved, it is often easier to design the algorithm first for a suffix tree, and then think if the implementation can be done using a suffix array. For this reason, we largely concentrate on suffix trees. The reader interested in reading more on applications of suffix arrays is referred to [1, 2].

Pattern Matching

Given a pattern P and a text T , the pattern matching problem is to find all occurrences of P in T . Let |P | = m and |T | = n. Typically, n >> m. Moreover, T remains fixed in many applications and the query is repeated for many different patterns. For example, T could be a text document and P could represent a word search. Or, T could be an entire database of DNA sequences and P denote a substring of a query sequence for homology (similarity) search. Thus, it is beneficial to preprocess the text T so that queries can be answered as efficiently as possible.

Pattern Matching using Suffix Trees

The pattern matching problem can be solved in optimal O(m + k) time using ST (T ), where k is the number of occurrences of P in T . Suppose P occurs in T starting from position i. Then, P is a prefix of suf fi in T . It follows that P matches the path from root to leaf labeled i in ST . This property results in the following simple algorithm: Start from the root of ST and follow the path matching characters in P , until P is completely matched or a mismatch occurs. If P is not fully matched, it does not occur in T . Otherwise, each leaf in the subtree below the matching position gives an occurrence of P . The positions can be enumerated by traversing the subtree in time proportional to the size of the subtree. As the number of leaves in the subtree is k, this takes O(k) time. If only one occurrence is of interest, the suffix tree can be preprocessed in O(n) time such that each internal node contains the label of one of the leaves in its subtree. Thus, the problem of whether P occurs in T or the problem of finding one occurrence can be answered in O(m) time.

Pattern Matching using Suffix Arrays

Consider the problem of pattern matching when the suffix array of the text, SA(T ), is available. As before, we need to find all the suffixes that have P as a prefix. As SA is a lexicographically sorted order of the suffixes of T , all such suffixes will appear in consecutive positions in it. The sorted order in SA allows easy identification of these suffixes using binary search. Using a binary search, find the smallest index i in SA such that suf fSA[i] contains P as a prefix, or determine that no such suffix is present. If no suffix is found, P does not occur in T . Otherwise, find the largest index j(i) such that suf fSA[j] contains P as a prefix. All the elements in the range SA[i..j] give the starting positions of the occurrences of P in T .

A binary search in SA takes O(log n) comparisons. In each comparison, P is compared with a suffix to determine their lexicographic order. This requires comparing at most |P | = m characters. Thus, the run-time of this algorithm is O(m log n). Note that while this run-time is inferior to the run-time using suffix trees, the space required by this algorithm is only n words for SA apart from the space required to store the string. Note that the Lcp array is not required. Assuming 4 bytes per suffix array entry and one byte per character in the string, the total space required is only 5n bytes.

The run-time can be improved to O(m + log n), by using slightly more space and keeping track of appropriate lcp information. Consider an iteration of the binary search. Let SA[L..R] denote the range in the suffix array where the binary search is focused. To be- gin with, L = 1 and R = n. At the beginning of an iteration, the pattern P is known to be lexicographically greater than or equal to suf fSA[L] and lexicographically smaller than or equal to suf fSA[R]. Let M = I L+R l. During the iteration, a lexicographic com-

parison between P and suf fSA[M ] is made. Depending on the result, the search range is narrowed to either SA[L..M ] or SA[M..R]. Assume that l = |lcp(P, suf fSA[L])| and r = |lcp(P, suf fSA[R])| are known at the beginning of the iteration. Also, assume that |lcp(suf fSA[L], suf fSA[M ])| and |lcp(suf fSA[M ], suf fSA[R])| are known. From these val- ues, we wish to determine |lcp(P, suf fSA[M ])| for use in next iteration, and consequently determine the relative lexicographic order between P and suf fSA[M ]. As SA is a lexico- graphically sorted array, P and suf fSA[M ] must agree on at least min(l, r) characters. If l and r are equal, then comparison between P and suf fSA[M ] is done by starting from the (l + 1)th character. If l and r are unequal, consider the case when l > r.

image

Similarly, the case when r > l can be handled such that comparisons between P and suf fSA[M ], if at all needed, start from (r + 1)th character. To start the execution of the algorithm, lcp(P, suf fSA[1]) and lcp(P, suf fSA[n]) are computed directly using at most 2|P | character comparisons. This ensures |lcp(P, suf fSA[L])| and |lcp(P, suf fSA[R])| are known at the beginning of the first iteration. This property is maintained for each iteration as L or R is shifted to M but |lcp(P, suf fSA[M ])| is computed. For now, assume that the required |lcp(suf fSA[L], suf fSA[M ])| and |lcp(suf fSA[R], suf fSA[M ])| values are available.

LEMMA 29.4 The total number of character comparisons made by the algorithm is O(m + log n).

Proof The algorithm makes at most 2m comparisons in determining the longest common prefixes between P and suf fSA[1] and between P and suf fSA[n]. Classify the comparisons made in each iteration to determine the longest common prefix between P and suf fSA[M ] into successful and failed comparisons. A comparison is considered successful if it contributes the longest common prefix. There is at most one failed comparison per iteration, for a total of at most log n such comparisons over all iterations. As for successful comparisons, note that the comparisons start with (max(l, r) + 1)th character of P , and each successful comparison increases the value of max(l, r) for next iteration. Thus, each character of P is involved only once in a successful comparison. The total number of character comparisons is at most 3m + log n = O(m + log n).

It remains to be described how the |lcp(suf fSA[L], suf fSA[M ])| and |lcp(suf fSA[R], suf fSA[M ])|

values required in each iteration are computed. Suppose the Lcp array of T is known. For

image

The lcp of two suffixes can be computed in time proportional to the distance between them in the suffix array. In order to find the lcp values required by the algorithm in constant time, consider the binary tree corresponding to all possible search intervals used by any execution of the binary search algorithm. The root of the tree denotes the interval [1..n].

If [i..j] (j i 2) is the interval at an internal node of the tree, its left child is given by [i..I i+j l] and its right child is given by [I i+j l..j]. The execution of the binary search tree algorithm can be visualized as traversing a path in the binary tree from root to a leaf. If lcp value for each interval in the tree is precomputed and recorded, any required lcp value during the execution of the algorithm can be retrieved in constant time. The leaf level in the binary tree consists of intervals of the type [i..i + 1]. The lcp values for these n 1

intervals is already given by the Lcp array. The lcp value corresponding to an interval at an internal node is given by the smaller of the lcp values at the children. Using a bottom-up traversal, the lcp values can be computed in O(n) time. In addition to the Lcp array, n 2 additional lcp values are required to be stored. Assuming approximately 1 byte per lcp value, the algorithm requires approximately 2n bytes of additional space. As usual, lcp values larger than or equal to 255, if any, are stored in an exceptions list and the size of such list should be very small in practical applications.

Thus, pattern matching can be solved in O(m log n) time using 5n bytes of space, or in O(m + log n) time using 7n bytes of space. Abouelhoda et al. [2] reduce this time further to O(m) time by mimicking the suffix tree algorithm on a suffix array with some auxiliary information. Using clever implementation techniques, the space is reduced to approximately 6n bytes. An interesting feature of their algorithm is that it can be used in other applications based on a top-down traversal of suffix tree.

Longest Common Substrings

Consider the problem of finding a longest substring common to two given strings s1 of size m and s2 of size n. To solve this problem, first construct the GST of strings s1 and s2. A longest substring common to s1 and s2 will be the path-label of an internal node with the greatest string depth in the suffix tree which has leaves labeled with suffixes from both the strings. Using a traversal of the GST , record the string-depth of each node, and mark each node if it has suffixes from both the strings. Find the largest string-depth of any marked node. Each marked internal node at that depth gives a longest common substring. The total run-time of this algorithm is O(m + n).

The problem can also be solved by using the suffix tree of one of the strings and suffix links. Without loss of generality, suppose the suffix tree of s2 is given. For each position i in s1, we find the largest substring of s1 starting at that position that is also a substring of s2. For position 1, this is directly computed by matching suf f1 of s1 starting from the root of the suffix tree until no more matches are possible. To determine the longest substring match from position 2, simply walk up to the first internal node, follow the suffix link, and walk down as done in McCreight’s algorithm. A similar proof shows that this algorithm runs in O(m + n) time.

Now consider solving the longest common substring problem using the GSA and Lcp array for strings s1 and s2. First, consider a one string variant of this problem that of  computing the longest repeat in a string. This is given by the string depth of the deepest internal node in the corresponding suffix tree. All children of such a node must be leaves.

Any consecutive pair of such leaves have the longest repeat as their longest common prefix.

Thus, each largest value in the Lcp array reveals a longest repeat in the string. The number of occurrences of a repeat is one more than the number of consecutive occurrences of the corresponding largest value in the Lcp array. Thus, all distinct longest repeats, and the number and positions of their occurrences can be determined by a linear scan of the Lcp array.

To solve the longest common substring problem, let v denote an internal node with the greatest string depth that contains a suffix from each of the strings. Because such a pair of suffixes need not be consecutive in the suffix array, it might appear that one has to look at nonconsecutive entries in the Lcp array. However, the subtree of any internal node that is a child of v can only consist of suffixes from one of the strings. Thus, there will be two consecutive suffixes in the subtree under v, one from each string. Therefore, it is enough to look at consecutive entries in the GSA. In a linear scan of the GSA and Lcp arrays, find the largest lcp value that corresponds to two consecutive suffixes, one from each string. This gives the length of a longest common substring. The starting positions of the suffixes reveals the positions in the strings where the longest common substring occurs. The algorithm runs in O(m + n) time.

Text Compression

Compression of text data is useful for data transmission and for compact storage. A simple, not necessarily optimal, data compression method is the Ziv-Lempel compression [24, 25]. In this method, the text to be compressed is considered a large string, and a compact representation is obtained by identifying repeats in the string. A simple algorithm following this strategy is as follows: Let T denote the text to be compressed and let |T | = n. At some stage during the execution of the compression algorithm, suppose that the string T [1..i 1] is already compressed. The compression is extended by finding the length li of a largest prefix of suf fi that is a substring of T [1..i 1]. Two cases arise:

image

The algorithm is initiated by setting T [1] to be the compressed representation of T [1..1], and continuing the iterations until the entire string is compressed. For example, executing the above algorithm on the string mississippi yields the compressed string mis(3, 1)(2, 3)(2, 1)p (9, 1)(2, 1). The decompression method for such a compressed string is immediate.

Suffix trees can be used to carry out the compression in O(n) time [20]. They can be used in obtaining li, the length of the longest prefix of suf fi that is a substring of the portion of the string already seen, T [1..i 1]. If j is the starting position of such a substring, then T [j..j + li 1] = T [i..i + li 1] and i j + li. It follows that |lcp(suf fj , suf fi)|li. Let v = lca(i, j), where i and j are leaves corresponding to suf fi and suf fj , respectively. It follows that T [i..i + li 1] is a prefix of path-label(v). Consider the unique path from the root of ST (T ) that matches T [i..i + li 1]. Node v is an internal node in the subtree below, and hence j is a leaf in the subtree below. Thus, li is the largest number of characters along the path T [i..n] such that leaf j in the subtree below with j + li i. Note that any j in the subtree below that satisfies the property j + li i is acceptable. If such a j exists, the smallest leaf number in the subtree below certainly satisfies this property, and hence can be chosen as the starting position j.

This strategy results in the following algorithm for finding li: First, build the suffix tree of T . Using an appropriate linear time tree traversal method, record the string depth of each node and mark each internal node with the smallest leaf label in its subtree. Let min(v) denote the smallest leaf label under internal node v. To find li, walk along the path T [i..n] to identify two consecutive internal nodes u and v such that min(u) + string-depth(u) < I and min(v) + string-depth(v) i. If min(v) + string-depth(u) > i, then set li = string depth(u) and set the starting position to be min(u). Otherwise, set li = i min(v) and set the starting position to be min(v).

To obtain O(n) run-time, it is enough to find li in O(li ) time as the next li characters of the string are compressed into an O(1) space representation of an already seen substring.

Therefore, it is enough to traverse the path matching T [i..n] using individual character comparisons. However, as the path is guaranteed to exist, it can be traversed in O(1) time per edge, irrespective of the length of the edge label.

String Containment

Given a set of strings S = {s1, s2,..., sk } of total length N , the string containment problem is to identify each string that is a substring of some other string. An example application could be that the strings represent DNA sequence fragments, and we wish to remove redun- dancy. This problem can be easily solved using suffix trees in O(N ) time. First, construct the GST (S) in O(N ) time. To find if a string si is contained in another, locate the leaf labeled (si, 1). If the label of the edge connecting the leaf to its parent is labeled with the string $, si is contained in another string. Otherwise, it is not. This can be determined in O(1) time per string.

Sux-Prefix Overlaps

Suppose we are given a set of strings S = {s1, s2,... , sk } of total length N . The suffix- prefix overlap problem is to identify, for each pair of strings (si, sj ), the longest suffix of si that is a prefix of sj . Suffix-prefix overlaps are useful in algorithms for finding the shortest common superstring of a given set of strings. They are also useful in applications such as genome assembly where significant suffix-prefix overlaps between pairs of fragments are used to assemble fragments into much larger sequences.

The suffix-prefix overlap problem can be solved using GST (S) in optimal O(N +k2) time.

Consider the longest suffix α of si that is a prefix of sj . In GST (S), α is an initial part of the path from the root to leaf labeled (j, 1) that culminates in an internal node. A leaf that corresponds to a suffix from si should be a child of the internal node, with the edge label $. Moreover, it must be the deepest internal node on the path from root to leaf (j, 1)

that has a suffix from si attached in this way. The length of the corresponding suffix-prefix overlap is given by the string depth of the internal node.

Let M be a k × k output matrix such that M [i, j] should contain the length of the longest suffix of si that overlaps a prefix of sj . The matrix is computed using a depth first search (DFS) traversal of GST (S). The GST is preprocessed to record the string depth of every node. During the DFS traversal, k stacks A1, A2,... , Ak are maintained, one for each string. The top of the stack Ai contains the string depth of the deepest node along the current DFS path that is connected with edge label $ to a leaf corresponding to a suffix from si. If no such node exists, the top of the stack contains zero. Each stack Ai is initialized by pushing zero onto an empty stack, and is maintained during the DFS as follows: When the DFS traversal visits a node v from its parent, check to see if v is attached to a leaf with edge label $. If so, for each i such that string si contributes a suffix labeling the leaf, push string-depth(v) on to stack Ai. The string depth of the current node can be easily maintained during the DFS traversal. When the DFS traversal leaves the node v to return back to its parent, again identify each i that has the above property and pop the top element from the corresponding stack Ai.

The output matrix M is built one column at a time. When the DFS traversal reaches a leaf labeled (j, 1), the top of stack Ai contains the longest suffix of si that matches a prefix of sj . Thus, column j of matrix M is obtained by setting M [i, j] to the top element of stack Si. To analyze the run-time of the algorithm, note that each push (similarly, pop) operation on a stack corresponds to a distinct suffix of one of the input strings. Thus, the total number of push and pop operations is bounded by O(N ). The matrix M is filled in O(1) time per element, taking O(k2) time. Hence, all suffix-prefix overlaps can be identified in optimal O(N + k2) time.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout