Suffix Trees and Suffix Arrays:Advanced Applications

Advanced Applications

Suﬃx Links from Lowest Common Ancestors

Suppose we are given a suﬃx tree and it is required to establish suﬃx links for each internal node. This may become necessary if the suﬃx tree creation algorithm does not construct suﬃx links but they are needed for an application of interest. For example, the suﬃx tree may be constructed via suﬃx arrays, completely avoiding the construction and use of suﬃx links for building the tree. The links can be easily established if the tree is preprocessed for lca queries.

Mark each internal node v of the suﬃx tree with a pair of leaves (i, j) such that leaves labeled i and j are in the subtrees of diﬀerent children of v. The marking can be done in linear time by a bottom-up traversal of the tree. To ﬁnd the suﬃx link from an internal node v (other than the root) marked with (i, j), note that v = lca(i, j) and lcp(suf fi, suf fj ) = path-label(v). Let path-label(v) = cα, where c is the ﬁrst character and α is a string. To establish a suﬃx link from v, node u with path label α is needed. As lcp(suf fi+1, suf fj+1) = α, node u is given by lca(i + 1,j + 1), which can be determined in constant time. Thus, all suﬃx links can be determined in O(n) time. This method trivially extends to the case of a generalized suﬃx tree.

Approximate Pattern Matching

The simpler version of approximate pattern matching problem is as follows: Given a pattern P (|P | = m) and a text T (|T | = n), ﬁnd all substrings of length |P | in T that match P with at most k mismatches. To solve this problem, ﬁrst construct the GST of P and T .

Preprocess the GST to record the string-depth of each node, and to answer lca queries in constant time. For each position i in T , we will determine if T [i..i + m − 1] matches P with at most k mismatches. First, use an lca query lca((P, 1), (T, i)) to ﬁnd the largest substring from position i of T that matches a substring from position 1 and P . Suppose the length of this longest exact match is l. Thus, P [1..l] = T [i..i + l − 1], and P [l + 1] j= T [i + l]. Count this as a mismatch and continue by ﬁnding lca((P, l + 2), (T, i + l + 1)). This procedure is continued until either the end of P is reached or the number of mismatches crosses k. As each lca query takes constant time, the entire procedures takes O(k) time. This is repeated for each position i in T for a total run-time of O(kn).

Now, consider the more general problem of ﬁnding the substrings of T that can be derived from P by using at most k character insertions, deletions or substitutions. To solve this problem, we proceed as before by determining the possibility of such a match for every starting position i in T . Let l = string-depth(lca((P, 1), (T, i))). At this stage, we consider three possibilities:

After each lca computation, we have three possibilities corresponding to substitution, insertion and deletion, respectively. All possibilities are enumerated to ﬁnd if there is a sequence of k or less operations that will transform P into a substring starting from position i in T . This takes O(3k ) time. Repeating this algorithm for each position i in T takes O(3k n) time.

The above algorithm always uses the longest exact match possible from a given pair of positions in P and T before considering the possibility of an insertion or deletion. To prove the correctness of this algorithm, we show that if there is an approximate match of P starting from position i in T that does not use such a longest exact match, then there exists another approximate match that uses only longest exact matches. Consider an approximate match that does not use longest exact matches. Consider the leftmost position j in P and the corresponding position i + k in T where the longest exact match is violated. i.e., P [j] = T [i + k] but this is not used as part of an exact match. Instead, an insertion or deletion is used. Suppose that an exact match of length r is used after the insertion or deletion. We can come up with a corresponding approximate match where the longest match is used and the insertion/deletion is taken after that. This will either keep the number of insertions/deletions the same or reduce the count. If the value of k is small, the above algorithms provide a quick and easy way to solve the approximate pattern matching problem. For sophisticated algorithms with better run-times, see [4, 21].

Maximal Palindromes

A string is called a palindrome if it reads the same forwards or backwards. A substring s[i..j] of a string s is called a maximal palindrome of s, if s[i..j] is a palindrome and s[i − 1] j= s[j + 1] (unless i = 1 or j = n). The maximal palindrome problem is to ﬁnd all maximal palindromes of a string s.

For a palindrome of odd length, say 2k + 1, deﬁne the center of the palindrome to be the (k + 1)th character. For a palindrome of even length, say 2k, deﬁne the center to be the position between characters k and k + 1 of the palindrome. In either case, the palindrome is said to be of radius k. Starting from the center, a palindrome is a string that reads the same in both directions. Observe that each maximal palindrome in a string must have a distinct center. As the number of possible centers for a string of length n is 2n − 1, the total number of maximal palindromes of a string is 2n − 1. All such palindromes can be identiﬁed in linear time using the following algorithm.

Let sr denote the reverse of string s. Construct a GST of the strings s and sr and preprocess the GST to record string depths of internal nodes and for answering lca queries.

Now, consider a character s[i] in the string. The maximal odd length palindrome centered at s[i] is given by the length of the longest common preﬁx between suf fi+1 of s and suf fn−i+2 of sr . This is easily computed as the string-depth of lca((s, i + 1), (sr,n − i + 2)) in constant time. Similarly, the maximal even length palindrome centered between s[i] and s[i + 1] is given by the length of the longest common preﬁx between suf fi+1 of s and suf fn−i+1 of sr . This is computed as the string-depth of lca((s, i + 1), (sr ,n − i + 1)) in constant time.

These and many other applications involving strings can be solved eﬃciently using suﬃx trees and suﬃx arrays. A comprehensive treatise of suﬃx trees, suﬃx arrays and string algorithms can be found in the textbooks by Gusﬁeld [12], and Crochemore and Rytter [6].

Search This Blog

algorithms

Suffix Trees and Suffix Arrays:Advanced Applications

Comments

Post a Comment

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout