Suffix Trees and Suffix Arrays:Linear Time Construction Algorithm
Linear Time Construction Algorithms
In this section, we explore linear time construction algorithms for suffix trees and suffix arrays. We also show how suffix trees and suffix arrays can be derived from each other in linear time. In suffix tree and suffix array construction algorithms, three different types of alphabets are considered: a constant or fixed size alphabet (|Σ| = O(1)), integer alphabet (Σ = {1, 2,... , n}), and arbitrary alphabet. Suffix trees and suffix arrays can be constructed in linear time for both constant size and integer alphabets. The constant alphabet size case covers many interesting application areas, such as English text, or DNA or protein sequences in molecular biology. The integer alphabet case is interesting because a string of length n can have at most n distinct characters. Furthermore, some algorithms use a recursive technique that would generate and require operating on strings over integer alphabet, even when applied to strings over a fixed alphabet.
Suffix Trees vs. Suffix Arrays
We first show that the suffix array and Lcp array of a string can be obtained from its suffix tree in linear time. Define lexicographic ordering of the children of a node to be the order based on the first character of the edge labels connecting the node to its children. Define lexicographic depth first search to be a depth first search of the tree where the children of each node are visited in lexicographic order. The order in which the leaves of a suffix tree are visited in a lexicographic depth first search gives the suffix array of the corresponding string. In order to obtain lcp information, the string-depth of the current node during the search is remembered. This can be easily updated in O(1) time per edge as the search progresses. The length of the lcp between two consecutive suffixes is given by the smallest string-depth of a node visited between the two suffixes.
Given the suffix array and the Lcp array of a string s (|s$| = n), its suffix tree can be constructed in O(n) time. This is done by starting with a partial suffix tree for the lexicographically smallest suffix, and repeatedly inserting subsequent suffixes in the suffix array into the tree until the suffix tree is complete. Let Ti denote the compacted trie of the first i suffixes in lexicographic order. The first tree T1 consists of a single leaf labeled SA[1] connected to the root with an edge labeled suf fSA[1] = $.
To insert SA[i + 1] into Ti, start with the most recently inserted leaf SA[i] and walk up (|suf fSA[i]|− |lcp(suf fSA[i], suf fSA[i+1])|) = ((n − SA[i]+ 1) − Lcp[i]) characters along the path to the root. This walk can be done in O(1) time per edge by calculating the lengths of the respective edge labels. If the walk does not end at an internal node, create an internal node. Create a new leaf labeled SA[i + 1] and connect it to this internal node with an edge. Set the label on this edge to s[SA[i + 1] + Lcp[i]..n]. This creates the tree Ti+1. The procedure works because suf fSA[i+1] shares a longer prefix with suf fSA[i] than any other suffix inserted so far. To see that the entire algorithm runs in O(n) time, note that inserting a new suffix into Ti requires walking up the rightmost path in Ti. Each edge that is traversed ceases to be on the rightmost path in Ti+1, and thus is never traversed again. An edge in an intermediate tree Ti corresponds to a path in the suffix tree ST . When a new internal node is created along an edge in an intermediate tree, the edge is split into two edges, and the edge below the newly created internal node corresponds to an edge in the suffix tree. Once again, this edge ceases to be on the rightmost path and is never traversed again. The cost of creating an edge in an intermediate tree can be charged to the lowest edge on the corresponding path in the suffix tree. As each edge is charged once for creating and once for traversing, the total run-time of this procedure is O(n).
Finally, the Lcp array itself can be constructed from the suffix array and the string in linear time [14]. Let R be an array of size n such that R[i] contains the position in SA of suf fi. R can be constructed by a linear scan of SA in O(n) time. The Lcp array is computed in n iterations. In iteration i of the algorithm, the longest common prefix between suf fi and its respective right neighbor in the suffix array is computed. The array R facilitates locating an arbitrary suffix suf fi and finding its right neighbor in the suffix array in constant time. Initially, the length of the longest common prefix between suf f1 and its suffix array neighbor is computed directly and recorded. Let suf fj be the right neighbor of suf fi in SA. Let l be the length of the longest common prefix between them.
Suppose l ≥ 1. As suf fj is lexicographically greater than suf fi and s[i] = s[j], suf fj+1 is lexicographically greater than suf fi+1. The length of the longest common prefix between them is l − 1. It follows that the length of the longest common prefix between suf fi+1 and its right neighbor in the suffix array is ≥ l − 1. To determine its correct length, the comparisons need only start from the lth characters of the suffixes.
To prove that the run time of the above algorithm is linear, charge a comparison between the rth character in suffix suf fi and the corresponding character in its right neighbor suffix in SA to the position in the string of the rth character of suf fi, i.e., i + r − 1.
A comparison made in an iteration is termed successful if the characters compared are identical, contributing to the longest common prefix being computed. Because there is one failed comparison in each iteration, the total number of failed comparisons is O(n). As for successful comparisons, each position in the string is charged only once for a successful comparison. Thus, the total number of comparisons over all iterations is linear in n.
In light of the above discussion, a suffix tree and a suffix array can be constructed from each other in linear time. Thus, a linear time construction algorithm for one can be used to construct the other in linear time. In the following subsections, we explore such algorithms. Each algorithm is interesting in its own right, and exploits interesting properties that could be useful in designing algorithms using suffix trees and suffix arrays.
Linear Time Construction of Suffix Trees
Let s be a string of length n including the termination character $. Suffix tree construction algorithms start with an empty tree and iteratively insert suffixes while maintaining the property that each intermediate tree represents a compacted trie of the suffixes inserted so far. When all the suffixes are inserted, the resulting tree will be the suffix tree. Suffix links are typically used to speedup the insertion of suffixes. While the algorithms are identified by the names of their respective inventors, the exposition presented does not necessarily follow the original algorithms and we take the liberty to comprehensively present the material in a way we feel contributes to ease of understanding.
Mc Creight’s Algorithm
McCreight’s algorithm inserts suffixes in the order suf f1, suf f2,..., suf fn. Let Ti denote the compacted trie after suf fi is inserted. T1 is the tree consisting of a single leaf labeled 1 that is connected to the root by an edge with label s[1..n]. In iteration i of the algorithm, suf fi is inserted into tree Ti−1 to form tree Ti. An easy way to do this is by starting from the root and following the unique path matching characters in suf fi one by one until no more matches are possible. If the traversal does not end at an internal node, create an internal node there. Then, attach a leaf labeled i to this internal node and use the unmatched portion of suf fi for the edge label. The run-time for inserting suf fi is proportional to |suf fi| = n − i + 1. The total run-time of the algorithm is Σn (n − i + 1) = O(n2).
In order to achieve an O(n) run-time, suffix links are used to significantly speedup the insertion of a new suffix. Suffix links are useful in the following way − Suppose we are inserting suf fi in Ti−1 and let v be an internal node in Ti−1 on the path from root to leaf labeled (i − 1). Then, path-label(v) = cα is a prefix of suf fi−1. Since v is an internal node, there must be another suffix suf fj (j < i − 1) that also has cα as prefix. Because suf fj+1 is previously inserted, there is already a path from the root in Ti−1 labeled α. To insert suf fi faster, if the end of path labeled α is quickly found, comparison of characters in suf fi can start beyond the prefix α. This is where suffix links will be useful. The algorithm must also construct suffix links prior to using them.
LEMMA 29.1 Let v be an internal node in ST (s) that is created in iteration i − 1. Let path-label(v) = cα, where c is a character and α is a (possibly empty) string. Then, either there exists an internal node u with path-label(u) = α or it will be created in iteration i.
Proof As v is created when inserting suf fi−1 in Ti−2, there exists another suffix suf fj (j < i − 1) such that lcp(suf fi−1, suf fj ) = cα. It follows that lcp(suf fi, suf fj+1) = α.
The tree Ti already contains suf fj+1. When suf fi is inserted during iteration i, internal node u with path-label α is created if it does not already exist.
The above lemma establishes that the suffix link of a newly created internal node can be established in the next iteration.
The following procedure is used when inserting suf fi in Ti−1. Let v be the internal node to which suf fi−1 is attached as a leaf. If v is the root, insert suf fi using character comparisons starting with the first character of suf fi. Otherwise, let path-label(v) = cα. If v has a suffix link from it, follow it to internal node u with path-label α. This allows skipping the comparison of the first |α| characters of suf fi. If v is newly created in iteration i−1, it would not have a suffix link yet. In that case, walk up to parent vf of v. Let β denote the label of the edge connecting vf and v. Let uf = SL(vf) unless vf is the root, in which case let uf be the root itself. It follows that path-label(uf) is a prefix of suf fi. Furthermore, it is guaranteed that there is a path below uf that matches the next |β| characters of suf fi. Traverse |β| characters along this path and either find an internal node u or insert an internal node u if one does not already exist. In either case, set SL(v) = u. Continue by starting character comparisons skipping the first |α| characters of suf fi.
The above procedure requires two different types of traversals − one in which it is known that there exists a path below that matches the next |β| characters of suf fi (type I), and the other in which it is unknown how many subsequent characters of suf fi match a path below (type II). In the latter case, the comparison must proceed character by character until a mismatch occurs. In the former case, however, the traversal can be done by spending only O(1) time per edge irrespective of the length of the edge label. At an internal node during such a traversal, the decision of which edge to follow next is made by comparing the next character of suf fi with the first characters of the edge labels connecting the node to its children. However, once the edge is selected, the entire label or the remaining length of β must match, whichever is shorter. Thus, the traversal can be done in constant time per edge, and if the traversal stops within an edge label, the stopping position can also be determined in constant time.
The insertion procedure during iteration i can now be described as follows: Start with the internal node v to which suf fi−1 is attached as a leaf. If v has a suffix link, follow it and perform a type II traversal to insert suf fi. Otherwise, walk up to v’s parent, take the suffix link from it unless it is the root, and perform a type I traversal to either find or create the node u which will be linked from v by a suffix link. Continue with a type II traversal below u to insert suf fi.
LEMMA 29.2 The total time spent in type I traversals over all iterations is O(n).
Proof A type I traversal is performed by walking down along a path from root to a leaf in O(1) time per edge. Each iteration consists of walking up at most one edge, following a suffix link, and then performing downward traversals (either type II or both type I and type II). Recall that if SL(v) = u, then tree-depth(u) ≥ tree-depth(v) − 1. Thus, following a suffix link may reduce the depth in the tree by at most one. It follows that the operations that may cause moving to a higher level in the tree cause a decrease in depth of at most 2 per iteration. As both type I and type II traversals increase the depth in the tree and there are at most n levels in ST , the total number of edges traversed by type I traversals over all the iterations is bounded by 3n.
LEMMA 29.3 The total time spent in type II traversals over all iterations is O(n).
Proof In a type II traversal, a suffix of the string suf fi is matched along a path in Ti−1until there is a mismatch. When a mismatch occurs, an internal node is created if there does not exist one already. Then, the remaining part of suf fi becomes the edge label connecting leaf labeled i to the internal node. Charge each successful comparison of a character in suf fi to the corresponding character in the original string s. Note that a character that is charged with a successful comparison is never charged again as part of a type II traversal. Thus, the total time spent in type II traversals is O(n).
The above lemmas prove that the total run-time of McCreight’s algorithm is O(n). McCreight’s algorithm is suitable for constant sized alphabets. The dependence of the run-time and space for storing suffix trees on the size of the alphabet |Σ| is as follows: A simple way to allocate space for internal nodes in a suffix tree is to allocate |Σ| + 1 pointers for children,one for each distinct character with which an edge label may begin. With this approach, the edge label beginning with a given character, or whether an edge label exists with a given character, can be determined in O(log |Σ|) time. However, as all |Σ| + 1 pointers are kept irrespective of how many children actually exist, the total space is O(|Σ|n). If the tree is stored such that each internal node points only to its leftmost child and each node also points to its next sibling, if any, the space can be reduced to O(n), irrespective of |Σ|. With this, searching for a child connected by an edge label with the appropriate character takes O(|Σ|) time. Thus, McCreight’s algorithm can be run in O(n log |Σ|) time using O(n|Σ|) space, or in O(n|Σ|) time using O(n) space.
Generalized Suffix Trees
McCreight’s algorithm can be easily adapted to build the generalized suffix tree for a set S = {s1, s2,... , sk } of strings of total length N in O(N ) time. A simple way to do this is to construct the string S = s1$1s2$2 ... sk $k, where each $i is a unique string termination character that does not occur in any string in S. Using McCreight’s algorithm, ST (S) can be computed in O(N ) time. This differs from GST (S) in the following way: Consider a suffix suf fj of string si in GST (S). The corresponding suffix in ST (S) is si[j..|si |]$isi+1$i+1 ... sk $k. Let v be the last internal node on the path from root to leaf representing this suffix in ST (S). As each $i is unique and path-label(v) must be a common prefix of at least two suffixes in S, path-label(v) must be a prefix of si[j..|si|]. Thus, by simply shortening the edge label below v to terminate at the end of the string si and attaching a common termination character $ to it, the corresponding suffix in GST (S) can be generated in O(1) time. Additionally, all suffixes in ST (S) that start with some $i should be removed and replaced by a single suffix $ in GST (S). Note that the suffixes to be removed are all directly connected to the root in ST (S), allowing easy O(1) time removal per suffix. Thus, GST (S) can be derived from ST (S) in O(N ) time.
Instead of first constructing ST (S) and shortening edge labels of edges connecting to leaves to construct GST (S), the process can be integrated into the tree construction itself to directly compute GST (S). When inserting the suffix of a string, directly set the edge label connecting to the newly created leaf to terminate at the end of the string, appended by $. As each suffix that begins with $i in ST (S) is directly attached to the root, execution of McCreight’s algorithm on S will always result in a downward traversal starting from the root when a suffix starting from the first character of a string is being inserted. Thus, we can simply start with an empty tree, insert all the suffixes of one string using McCreight’s algorithm, insert all the suffixes of the next string, and continue this procedure until all strings are inserted. To insert the first suffix of a string, start by matching the unique path in the current tree that matches with a prefix of the string until no more matches are possible, and insert the suffix by branching at this point. To insert the remaining suffixes, continue as described in constructing the tree for one string.
This procedure immediately gives an algorithm to maintain the generalized suffix tree of a set of strings in the presence of insertions and deletions of strings. Insertion of a string is the same as executing McCreight’s algorithm on the current tree, and takes time proportional to the length of the string being inserted. To delete a string, we must locate the leaves corresponding to all the suffixes of the string. By mimicking the process of inserting the string in GST using McCreight’s algorithm, all the corresponding leaf nodes can be reached in time linear in the size of the string to be deleted. To delete a suffix, examine the corresponding leaf. If it is multiply labeled, it is enough to remove the label corresponding to the suffix. It it has only one label, the leaf and edge leading to it must be deleted. If the parent of the leaf is left with only one child after deletion, the parent and its two incident edges are deleted by connecting the surviving child directly to its grandparent with an edge labeled with the concatenation of the labels of the two edges deleted. As the adjustment at each leaf takes O(1) time, the string can be deleted in time proportional to its length.
Suffix trees were invented by Weiner [23], who also presented the first linear time algorithm to construct them for a constant sized alphabet. McCreight’s algorithm is a more space-economical linear time construction algorithm [19]. A linear time on-line construction algorithm for suffix trees is invented by Ukkonen [22]. In fact, our presentation of Mc- Creight’s algorithm also draws from ideas developed by Ukkonen. A unified view of these three suffix tree construction algorithms is studied by Giegerich and Kurtz [10]. Farach [7] presented the first linear time algorithm for strings over integer alphabets. The algorithm recursively constructs suffix trees for all odd and all even suffixes, respectively, and uses a clever strategy for merging them. The complexity of suffix tree construction algorithms for various types of alphabets is explored in [8].
Linear Time Construction of Suffix Arrays
Suffix arrays were proposed by Manber and Myers [18] as a space-efficient alternative to suffix trees. While suffix arrays can be deduced from suffix trees, which immediately implies any of the linear time suffix tree construction algorithms can be used for suffix arrays, it would not achieve the purpose of economy of space. Until recently, the fastest known direct construction algorithms for suffix arrays all required O(n log n) time, leaving a frustrating gap between asymptotically faster construction algorithms for suffix trees, and asymptot- ically slower construction algorithms for suffix arrays, despite the fact that suffix trees contain all the information in suffix arrays. This gap is successfully closed by a number of researchers in 2003, including K¨ar¨akkanen and Sanders [13], Kim et al. [15], and Ko and Aluru [16]. All three algorithms work for the case of integer alphabet. Given the simplicity and/or space efficiency of some of these algorithms, it is now preferable to construct suffix trees via the construction of suffix arrays.
K¨ara¨kkanen and Sanders’ Algorithm
K¨ar¨akkanen and Sanders’ algorithm is the simplest and most elegant algorithm to date to construct suffix arrays, and by implication suffix trees, in linear time. The algorithm also works for the case of an integer alphabet. Let s be a string of length n over the alphabet
Σ = {1, 2,... , n}. For convenience, assume n is a multiple of three and s[n+1] = s[n+2] = 0.
The algorithm has the following steps:
To execute step (1), first perform a radix sort of the 2 n triples (s[i], s[i + 1], s[i + 2]) for each i mod 3 j= 0 and associate with each distinct triple its rank ∈ {1, 2,... , 2 n} in sorted order.
If all triples are distinct, the suffixes are already sorted. Otherwise, let suf fi denote the string obtained by taking suf fi and replacing each consecutive triplet with its corresponding rank. Create a new string sf by concatenating suf f f with suf f f . Note that all suf f f with
i mod 3 = 1 (i mod 3 = 2, respectively) are suffixes of suf f f (suf f f , respectively). A lexicographic comparison of two suffixes in sf never crosses the boundary between suf f f and suf f f because the corresponding suffixes in the original string can be lexicographically distinguished. Thus, sorting sf recursively gives the sorted order of suf fi with i mod 3 j= 0.
Step (2) can be accomplished by performing a radix sort on tuples (s[i], rank(suf fi+1)) for all i mod 3 = 0, where rank(suf fi+1) denotes the rank of suf fi+1 in sorted order obtained in step (1).
Merging of the sorted arrays created in steps (1) and (2) is done in linear time, aided by the fact that the lexicographic order of a pair of suffixes, one from each array, can be determined in constant time. To compare suf fi (i mod 3 = 1) with suf fj (i mod 3 = 0), compare s[i] with s[j]. If they are unequal, the answer is clear. If they are identical, the ranks of suf fi+1 and suf fj+1 in the sorted order obtained in step (1) determines the answer. To compare suf fi (i mod 3 = 2) with suf fj (i mod 3 = 0), compare the first two characters of the two suffixes. If they are both identical, the ranks of suf fi+2 and suf fj+2 in the sorted order obtained in step (1) determines the answer.
The run-time of this algorithm is given by the recurrence T (n) = T (
l) + O(n), which results in O(n) run-time. Note that the 2 : 1 split is designed to make the merging step easy. A 1 : 1 split does not allow easy merging because when comparing two suffixes for merging, no matter how many characters are compared, the remaining suffixes will not fall in the same sorted array, where ranking determines the result without need for further comparisons. Kim et al.’s linear time suffix array construction algorithm is based on a 1 : 1 split, and the merging phase is handled in a clever way so as to run in linear time. This is much like Farach’s algorithm for constructing suffix trees [7] by constructing suffix trees for even and odd positions separately and merging them. Both the above linear time suffix array construction algorithms partition the suffixes based on their starting positions in the string.
A completely different way of partitioning suffixes based on the lexicographic ordering of a suffix with its right neighboring suffix in the string is used by Ko and Aluru to derive a linear time algorithm [16]. This reduces solving a problem of size n to that of solving a problem of size no more than I n l, while eliminating the complex merging step. The algorithm can be made to run in only 2n words plus 1.25n bits for strings over constant alphabet. Algorithmically, Ka¨r¨akkanen and Sanders’ algorithm is akin to mergesort and Ko and Aluru’s algorithm is akin to quicksort. Algorithms for constructing suffix arrays in external memory are investigated by Crauser and Ferragina [5].
It may be more space efficient to construct a suffix tree by first constructing the corresponding suffix array, deriving the Lcp array from it, and using both to construct the suffix tree. For example, while all direct linear time suffix tree construction algorithms depend on constructing and using suffix links, these are completely avoided in the indirect approach. Furthermore, the resulting algorithms have an alphabet independent run-time of O(n) while using only the O(n) space representation of suffix trees. This should be contrasted with the O(|Σ|n) run-time of either McCreight’s or Ukkonen’s algorithms.
Space Issues
Suffix trees and suffix arrays are space efficient in an asymptotic sense because the memory required grows linearly with input size. However, the actual space usage is of significant concern, especially for very large strings. For example, the human genome can be represented as a large string over the alphabet Σ = {A, C, G, T } of length over 3 × 109. Because of linear dependence of space on the length of the string, the exact space requirement is easily characterized by specifying it in terms of the number of bytes per character. Depend- ing on the number of bytes per character required, a data structure for the human genome may fit in main memory, may need a moderate sized disk, or might need a large amount of secondary storage. This has significant influence on the run-time of an application as access to secondary storage is considerably slower. It may also become impossible to run an application for large data sizes unless careful attention is paid to space efficiency.
Consider a naive implementation of suffix trees. For a string of length n, the tree has n leaves, at most n − 1 internal nodes, and at most 2n − 2 edges. For simplicity, count the space required for each integer or a pointer to be one word, equal to 4 bytes on most current computers. For each leaf node, we may store a pointer to its parent, and store the starting index of the suffix represented by the leaf, for 2n words of storage. Storage for each internal node may consist of 4 pointers, one each for parent, leftmost child, right sibling and suffix link, respectively. This will require approximately 4n words of storage. Each edge label consists of a pair of integers, for a total of at most 4n words of storage. Putting this all together, a naive implementation of suffix trees takes 10n words or 40n bytes of storage.
Several techniques can be used to considerably reduce the naive space requirement of 40 bytes per character. Many applications of interest do not need to use suffix links. Similarly, a pointer to the parent may not be required for applications that use traversals down from the root. Even otherwise, note that a depth first search traversal of the suffix tree starting from the root can be conducted even in the absence of parent links, and this can be utilized in applications where a bottom-up traversal is needed. Another technique is to store the internal nodes of the tree in an array in the order of their first occurrence in a depth first search traversal. With this, the leftmost child of an internal node is found right next to it in the array, which removes the need to store a child pointer. Instead of storing the starting and ending positions of a substring corresponding to an edge label, an edge label can be stored with the starting position and length of the substring. The advantage of doing so is that the length of the edge label is likely to be small. Hence, one byte can be used to store edge labels with lengths < 255 and the number 255 can be used to denote edge labels with length at least 255. The actual values of such labels can be stored in an exceptions list, which is expected to be fairly small. Using several such techniques, the space required per character can be roughly cut in half to about 20 bytes [17].
A suffix array can be stored in just one word per character, or 4 bytes. Most applications using suffix arrays also need the Lcp array. Similar to the technique employed in storing edge labels on suffix trees, the entries in Lcp array can also be stored using one byte, with exceptions handled using an ordered exceptions list. Provided most of the lcp values fit in a byte, we only need 5 bytes per character, significantly smaller than what is required for suffix trees. Further space reduction can be achieved by the use of compressed suffix trees and suffix arrays and other data structures [9, 11]. However, this often comes at the expense of increased run-time complexity.
Comments
Post a Comment