Suffix Trees and Suffix Arrays:Advanced Applications

Advanced Applications

Suffix Links from Lowest Common Ancestors

Suppose we are given a suffix tree and it is required to establish suffix links for each internal node. This may become necessary if the suffix tree creation algorithm does not construct suffix links but they are needed for an application of interest. For example, the suffix tree may be constructed via suffix arrays, completely avoiding the construction and use of suffix links for building the tree. The links can be easily established if the tree is preprocessed for lca queries.

Mark each internal node v of the suffix tree with a pair of leaves (i, j) such that leaves labeled i and j are in the subtrees of different children of v. The marking can be done in linear time by a bottom-up traversal of the tree. To find the suffix link from an internal node v (other than the root) marked with (i, j), note that v = lca(i, j) and lcp(suf fi, suf fj ) = path-label(v). Let path-label(v) = , where c is the first character and α is a string. To establish a suffix link from v, node u with path label α is needed. As lcp(suf fi+1, suf fj+1) = α, node u is given by lca(i + 1,j + 1), which can be determined in constant time. Thus, all suffix links can be determined in O(n) time. This method trivially extends to the case of a generalized suffix tree.

Approximate Pattern Matching

The simpler version of approximate pattern matching problem is as follows: Given a pattern P (|P | = m) and a text T (|T | = n), find all substrings of length |P | in T that match P with at most k mismatches. To solve this problem, first construct the GST of P and T .

Preprocess the GST to record the string-depth of each node, and to answer lca queries in constant time. For each position i in T , we will determine if T [i..i + m 1] matches P with at most k mismatches. First, use an lca query lca((P, 1), (T, i)) to find the largest substring from position i of T that matches a substring from position 1 and P . Suppose the length of this longest exact match is l. Thus, P [1..l] = T [i..i + l 1], and P [l + 1] j= T [i + l]. Count this as a mismatch and continue by finding lca((P, l + 2), (T, i + l + 1)). This procedure is continued until either the end of P is reached or the number of mismatches crosses k. As each lca query takes constant time, the entire procedures takes O(k) time. This is repeated for each position i in T for a total run-time of O(kn).

Now, consider the more general problem of finding the substrings of T that can be derived from P by using at most k character insertions, deletions or substitutions. To solve this problem, we proceed as before by determining the possibility of such a match for every starting position i in T . Let l = string-depth(lca((P, 1), (T, i))). At this stage, we consider three possibilities:

imageAfter each lca computation, we have three possibilities corresponding to substitution, insertion and deletion, respectively. All possibilities are enumerated to find if there is a sequence of k or less operations that will transform P into a substring starting from position i in T . This takes O(3k ) time. Repeating this algorithm for each position i in T takes O(3k n) time.

The above algorithm always uses the longest exact match possible from a given pair of positions in P and T before considering the possibility of an insertion or deletion. To prove the correctness of this algorithm, we show that if there is an approximate match of P starting from position i in T that does not use such a longest exact match, then there exists another approximate match that uses only longest exact matches. Consider an approximate match that does not use longest exact matches. Consider the leftmost position j in P and the corresponding position i + k in T where the longest exact match is violated. i.e., P [j] = T [i + k] but this is not used as part of an exact match. Instead, an insertion or deletion is used. Suppose that an exact match of length r is used after the insertion or deletion. We can come up with a corresponding approximate match where the longest match is used and the insertion/deletion is taken after that. This will either keep the number of insertions/deletions the same or reduce the count. If the value of k is small, the above algorithms provide a quick and easy way to solve the approximate pattern matching problem. For sophisticated algorithms with better run-times, see [4, 21].

Maximal Palindromes

A string is called a palindrome if it reads the same forwards or backwards. A substring s[i..j] of a string s is called a maximal palindrome of s, if s[i..j] is a palindrome and s[i 1] j= s[j + 1] (unless i = 1 or j = n). The maximal palindrome problem is to find all maximal palindromes of a string s.

For a palindrome of odd length, say 2k + 1, define the center of the palindrome to be the (k + 1)th character. For a palindrome of even length, say 2k, define the center to be the position between characters k and k + 1 of the palindrome. In either case, the palindrome is said to be of radius k. Starting from the center, a palindrome is a string that reads the same in both directions. Observe that each maximal palindrome in a string must have a distinct center. As the number of possible centers for a string of length n is 2n 1, the total number of maximal palindromes of a string is 2n 1. All such palindromes can be identified in linear time using the following algorithm.

Let sr denote the reverse of string s. Construct a GST of the strings s and sr and preprocess the GST to record string depths of internal nodes and for answering lca queries.

Now, consider a character s[i] in the string. The maximal odd length palindrome centered at s[i] is given by the length of the longest common prefix between suf fi+1 of s and suf fn−i+2 of sr . This is easily computed as the string-depth of lca((s, i + 1), (sr,n i + 2)) in constant time. Similarly, the maximal even length palindrome centered between s[i] and s[i + 1] is given by the length of the longest common prefix between suf fi+1 of s and suf fn−i+1 of sr . This is computed as the string-depth of lca((s, i + 1), (sr ,n i + 1)) in constant time.

These and many other applications involving strings can be solved efficiently using suffix trees and suffix arrays. A comprehensive treatise of suffix trees, suffix arrays and string algorithms can be found in the textbooks by Gusfield [12], and Crochemore and Rytter [6].

Comments

Popular posts from this blog

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Data Structure Visualization:Introduction and Value of Data Structure Rendering

0/1 Knapsack Problem Memory function.