Suffix Trees and Suffix Arrays:Basic Definitions and Properties.
Basic Definitions and Properties
Suffix trees and suffix arrays are versatile data structures fundamental to string processing applications. Let sf denote a string over the alphabet Σ. Let $ ∈/ Σ be a unique termination character, and s = sf$ be the string resulting from appending $ to sf. We use the following notation: |s| denotes the size of s, s[i] denotes the ith character of s, and s[i..j] denotes the substring s[i]s[i + 1] ... s[j]. Let suf fi = s[i]s[i + 1] ... s[|s|] be the suffix of s starting at ith position.
The suffix tree of s, denoted ST (s) or simply ST , is a compacted trie (See Chapter 28) of all suffixes of string s. Let |s| = n. It has the following properties:
1. The tree has n leaves, labeled 1 ... n, one corresponding to each suffix of s.
2. Each internal node has at least 2 children.
3. Each edge in the tree is labeled with a substring of s.
4. The concatenation of edge labels from the root to the leaf labeled i is suf fi.
5. The labels of the edges connecting a node with its children start with different characters.
The paths from root to the suffixes labeled i and j coincide up to their longest common prefix, at which point they bifurcate. If a suffix of the string is a prefix of another longer suffix, the shorter suffix must end in an internal node instead of a leaf, as desired. It is to avoid this possibility that the unique termination character is added to the end of the string. Keeping this in mind, we use the notation ST (sf) to denote the suffix tree of the string obtained by appending $ to sf.
As each internal node has at least 2 children, an n-leaf suffix tree has at most n − 1 internal nodes. Because of property (5), the maximum number of children per node is bounded by |Σ| + 1. Except for the edge labels, the size of the tree is O(n). In order to allow a linear space representation of the tree, each edge label is represented by a pair of integers denoting the starting and ending positions, respectively, of the substring describing the edge label. If the edge label corresponds to a repeat substring, the indices corresponding to any occurrence of the substring may be used. The suffix tree of the string mississippi is shown in Figure 29.1. For convenience of understanding, we show the actual edge labels.
The suffix array of s = sf$, denoted SA(s) or simply SA, is a lexicographically sorted array of all suffixes of s. Each suffix is represented by its starting position in s. SA[i] = j iff Suf fj is the ith lexicographically smallest suffix of s. The suffix array is often used in conjunction with an array termed Lcp array, containing the lengths of the longest common prefixes between every consecutive pair of suffixes in SA. We use lcp(α, β) to denote the longest common prefix between strings α and β. We also use the term lcp as an abbreviation for the term longest common prefix. Lcp[i] contains the length of the lcp between suf fSA[i] and suf fSA[i+1], i.e., Lcp[i] = lcp(suf fSA[i], suf fSA[i+1]). As with suffix trees, we use the notation SA(sf) to denote the suffix array of the string obtained by appending $ to sf. The suffix and Lcp arrays of the string mississippi are shown in Figure 29.1.
Let v be a node in the suffix tree. Let path-label(v) denote the concatenation of edge labels along the path from root to node v. Let string-depth(v) denote the length of path- label(v). To differentiate this with the usual notion of depth, we use the term tree-depth of a node to denote the number of edges on the path from root to the node. Note that the length of the longest common prefix between two suffixes is the string depth of the lowest common ancestor of the leaf nodes corresponding to the suffixes. A repeat substring of string S is right-maximal if there are two occurrences of the substring that are succeeded by different characters in the string. The path label of each internal node in the suffix tree corresponds to a right-maximal repeat substring and vice versa.
Let v be an internal node in the suffix tree with path-label cα where c is a character and α is a (possibly empty) string. Therefore, cα is a right-maximal repeat, which also implies that α is also a right maximal repeat. Let u be the internal node with path label α. A pointer from node v to node u is called a suffix link; we denote this by SL(v) = u. Each suffix suf fi in the subtree of v shares the common prefix cα. The corresponding suffix suf fi+1 with prefix α will be present in the subtree of u. The concatenation of edge labels along the path from v to leaf labeled i, and along the path from u to leaf labeled i + 1 will be the same. Similarly, each internal node in the subtree of v will have a corresponding internal node in the subtree of u. In this sense, the entire subtree under v is contained in the subtree under u.
Every internal node in the suffix tree other than the root has a suffix link from it. Let v be an internal node with SL(v) = u. Let vf be an ancestor of v other than the root and let uf = SL(vf). As path-label(vf) is a prefix of path-label(v), path-label(uf) is also a prefix of path-label(u). Thus, uf is an ancestor of u. Each proper ancestor of v except the root will have a suffix link to a distinct proper ancestor of u. It follows that tree-depth(u) ≥ tree- depth(v) − 1.
Suffix trees and suffix arrays can be generalized to multiple strings. The generalized suffix tree of a set of strings S = {s1, s2,... , sk }, denoted GST (S) or simply GST , is a compacted trie of all suffixes of each string in S. We assume that the unique termination character $ is appended to the end of each string. A leaf label now consists of a pair of integers (i, j), where i denotes the suffix is from string si and j denotes the starting position of the suffix in si. Similarly, an edge label in a GST is a substring of one of the strings. An edge label is represented by a triplet of integers (i, j, l), where i denotes the string number, and j and l denote the starting and ending positions of the substring in si. For convenience of understanding, we will continue to show the actual edge labels. Note that two strings may have identical suffixes. This is compensated by allowing leaves in the tree to have multiple labels. If a leaf is multiply labeled, each suffix should come from a different string. If N is the total number of characters (including the $ in each string) of all strings in S, the GST has at most N leaf nodes and takes up O(N ) space. The generalized suffix array of S, denoted GSA(S) or simply GSA, is a lexicographically sorted array of all suffixes of each string in S. Each suffix is represented by an integer pair (i, j) denoting suffix starting from position j in si. If suffixes from different strings are identical, they occupy consecutive positions in the GSA. For convenience, we make an exception for the suffix $ by listing it only once, though it occurs in each string. The GST and GSA of strings apple and maple are shown in Figure 29.2.
Suffix trees and suffix arrays can be constructed in time linear to the size of the input. Suffix trees are very useful in solving a plethora of string problems in optimal run-time bounds. Moreover, in many cases, the algorithms are very simple to design and understand. For example, consider the classic pattern matching problem of determining if a pattern P occurs in text T over a constant sized alphabet. Note that P occurs starting from position i in T iff P is a prefix of suf fi in T . Thus, whether P occurs in T or not can be determined by checking if P matches an initial part of a path from root to a leaf in ST (T ). Traversing
from the root matching characters in P , this can be determined in O(|P |) time, independent of the size of T . As another application, consider the problem of finding a longest common substring of a pair of strings. Once the GST of the two strings is constructed, all that is needed is to identify an internal node with the largest string depth that contains at least one leaf from each string. These and many other applications are explored in great detail in subsequent sections. Suffix arrays are of interest because they require much less space than suffix trees, and can be used to solve many of the same problems. We first concentrate on linear time construction algorithms for suffix trees and suffix arrays. The reader interested in applications can safely skip to Section 29.3.
Comments
Post a Comment