Suffix Trees and Suffix Arrays:Basic Definitions and Properties.

Basic Deﬁnitions and Properties

Suﬃx trees and suﬃx arrays are versatile data structures fundamental to string processing applications. Let sf denote a string over the alphabet Σ. Let $ ∈/ Σ be a unique termination character, and s = sf$ be the string resulting from appending $ to sf. We use the following notation: |s| denotes the size of s, s[i] denotes the ith character of s, and s[i..j] denotes the substring s[i]s[i + 1] ... s[j]. Let suf fi = s[i]s[i + 1] ... s[|s|] be the suﬃx of s starting at ith position.

The suﬃx tree of s, denoted ST (s) or simply ST , is a compacted trie (See Chapter 28) of all suﬃxes of string s. Let |s| = n. It has the following properties:

1. The tree has n leaves, labeled 1 ... n, one corresponding to each suﬃx of s.

2. Each internal node has at least 2 children.

3. Each edge in the tree is labeled with a substring of s.

4. The concatenation of edge labels from the root to the leaf labeled i is suf fi.

5. The labels of the edges connecting a node with its children start with diﬀerent characters.

The paths from root to the suﬃxes labeled i and j coincide up to their longest common preﬁx, at which point they bifurcate. If a suﬃx of the string is a preﬁx of another longer suﬃx, the shorter suﬃx must end in an internal node instead of a leaf, as desired. It is to avoid this possibility that the unique termination character is added to the end of the string. Keeping this in mind, we use the notation ST (sf) to denote the suﬃx tree of the string obtained by appending $ to sf.

As each internal node has at least 2 children, an n-leaf suﬃx tree has at most n − 1 internal nodes. Because of property (5), the maximum number of children per node is bounded by |Σ| + 1. Except for the edge labels, the size of the tree is O(n). In order to allow a linear space representation of the tree, each edge label is represented by a pair of integers denoting the starting and ending positions, respectively, of the substring describing the edge label. If the edge label corresponds to a repeat substring, the indices corresponding to any occurrence of the substring may be used. The suﬃx tree of the string mississippi is shown in Figure 29.1. For convenience of understanding, we show the actual edge labels.

The suﬃx array of s = sf$, denoted SA(s) or simply SA, is a lexicographically sorted array of all suﬃxes of s. Each suﬃx is represented by its starting position in s. SA[i] = j iﬀ Suf fj is the ith lexicographically smallest suﬃx of s. The suﬃx array is often used in conjunction with an array termed Lcp array, containing the lengths of the longest common preﬁxes between every consecutive pair of suﬃxes in SA. We use lcp(α, β) to denote the longest common preﬁx between strings α and β. We also use the term lcp as an abbreviation for the term longest common preﬁx. Lcp[i] contains the length of the lcp between suf fSA[i] and suf fSA[i+1], i.e., Lcp[i] = lcp(suf fSA[i], suf fSA[i+1]). As with suﬃx trees, we use the notation SA(sf) to denote the suﬃx array of the string obtained by appending $ to sf. The suﬃx and Lcp arrays of the string mississippi are shown in Figure 29.1.

Let v be a node in the suﬃx tree. Let path-label(v) denote the concatenation of edge labels along the path from root to node v. Let string-depth(v) denote the length of path- label(v). To diﬀerentiate this with the usual notion of depth, we use the term tree-depth of a node to denote the number of edges on the path from root to the node. Note that the length of the longest common preﬁx between two suﬃxes is the string depth of the lowest common ancestor of the leaf nodes corresponding to the suﬃxes. A repeat substring of string S is right-maximal if there are two occurrences of the substring that are succeeded by diﬀerent characters in the string. The path label of each internal node in the suﬃx tree corresponds to a right-maximal repeat substring and vice versa.

Let v be an internal node in the suﬃx tree with path-label cα where c is a character and α is a (possibly empty) string. Therefore, cα is a right-maximal repeat, which also implies that α is also a right maximal repeat. Let u be the internal node with path label α. A pointer from node v to node u is called a suﬃx link; we denote this by SL(v) = u. Each suﬃx suf fi in the subtree of v shares the common preﬁx cα. The corresponding suﬃx suf fi+1 with preﬁx α will be present in the subtree of u. The concatenation of edge labels along the path from v to leaf labeled i, and along the path from u to leaf labeled i + 1 will be the same. Similarly, each internal node in the subtree of v will have a corresponding internal node in the subtree of u. In this sense, the entire subtree under v is contained in the subtree under u.

Every internal node in the suﬃx tree other than the root has a suﬃx link from it. Let v be an internal node with SL(v) = u. Let vf be an ancestor of v other than the root and let uf = SL(vf). As path-label(vf) is a preﬁx of path-label(v), path-label(uf) is also a preﬁx of path-label(u). Thus, uf is an ancestor of u. Each proper ancestor of v except the root will have a suﬃx link to a distinct proper ancestor of u. It follows that tree-depth(u) ≥ tree- depth(v) − 1.

Suﬃx trees and suﬃx arrays can be generalized to multiple strings. The generalized suﬃx tree of a set of strings S = {s1, s2,... , sk }, denoted GST (S) or simply GST , is a compacted trie of all suﬃxes of each string in S. We assume that the unique termination character $ is appended to the end of each string. A leaf label now consists of a pair of integers (i, j), where i denotes the suﬃx is from string si and j denotes the starting position of the suﬃx in si. Similarly, an edge label in a GST is a substring of one of the strings. An edge label is represented by a triplet of integers (i, j, l), where i denotes the string number, and j and l denote the starting and ending positions of the substring in si. For convenience of understanding, we will continue to show the actual edge labels. Note that two strings may have identical suﬃxes. This is compensated by allowing leaves in the tree to have multiple labels. If a leaf is multiply labeled, each suﬃx should come from a diﬀerent string. If N is the total number of characters (including the $ in each string) of all strings in S, the GST has at most N leaf nodes and takes up O(N ) space. The generalized suﬃx array of S, denoted GSA(S) or simply GSA, is a lexicographically sorted array of all suﬃxes of each string in S. Each suﬃx is represented by an integer pair (i, j) denoting suﬃx starting from position j in si. If suﬃxes from diﬀerent strings are identical, they occupy consecutive positions in the GSA. For convenience, we make an exception for the suﬃx $ by listing it only once, though it occurs in each string. The GST and GSA of strings apple and maple are shown in Figure 29.2.

Suﬃx trees and suﬃx arrays can be constructed in time linear to the size of the input. Suﬃx trees are very useful in solving a plethora of string problems in optimal run-time bounds. Moreover, in many cases, the algorithms are very simple to design and understand. For example, consider the classic pattern matching problem of determining if a pattern P occurs in text T over a constant sized alphabet. Note that P occurs starting from position i in T iﬀ P is a preﬁx of suf fi in T . Thus, whether P occurs in T or not can be determined by checking if P matches an initial part of a path from root to a leaf in ST (T ). Traversing

from the root matching characters in P , this can be determined in O(|P |) time, independent of the size of T . As another application, consider the problem of ﬁnding a longest common substring of a pair of strings. Once the GST of the two strings is constructed, all that is needed is to identify an internal node with the largest string depth that contains at least one leaf from each string. These and many other applications are explored in great detail in subsequent sections. Suﬃx arrays are of interest because they require much less space than suﬃx trees, and can be used to solve many of the same problems. We ﬁrst concentrate on linear time construction algorithms for suﬃx trees and suﬃx arrays. The reader interested in applications can safely skip to Section 29.3.

Search This Blog

algorithms

Suffix Trees and Suffix Arrays:Basic Definitions and Properties.

Comments

Post a Comment

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout