Suffix Trees and Suffix Arrays:Lowest Common Ancestors

Lowest Common Ancestors

Consider a string s and two of its suffixes suf fi and suf fj . The longest common prefix of the two suffixes is given by the path label of their lowest common ancestor. If the string-depth of each node is recorded in it, the length of the longest common prefix can be retrieved from the lowest common ancestor. Thus, an algorithm to find the lowest common ancestors quickly can be used to determine longest common prefixes without a single character comparison. In this section, we describe how to preprocess the suffix tree in linear time and be able to answer lowest common ancestor queries in constant time [3].

Bender and Farach’s lca algorithm

Let T be a tree of n nodes. Without loss of generality, assume the nodes are numbered 1 ... n. Let lca(i, j) denote the lowest common ancestor of nodes i and j. Bender and Farach’s algorithm performs a linear time preprocessing of the tree and can answer lca queries in constant time.

Let E be an Euler tour of the tree obtained by listing the nodes visited in a depth first search of T starting from the root. Let L be an array of level numbers such that L[i] contains the tree-depth of the node E[i]. Both E and L contain 2n 1 elements and can be constructed by a depth first search of T in linear time. Let R be an array of size n such that R[i] contains the index of the first occurrence of node i in E. Let RM QA(i, j) denote the position of an occurrence of the smallest element in array A between indices i and j (inclusive). For nodes i and j, their lowest common ancestor is the node at the smallest tree-depth that is visited between an occurrence of i and an occurrence of j in the Euler tour. It follows that

image

Thus, the problem of answering lca queries transforms into answering range minimum queries in arrays. Without loss of generality, we henceforth restrict our attention to answering range minimum queries in an array A of size n.

image

located in constant time. This will allow determination of RM QA(i, j) in constant time. To avoid a direct computation of k, the largest power of 2 that is smaller than or equal to each integer in the range [1..n] can be precomputed and stored in O(n) time. Putting all of this together, range minimum queries can be answered with O(n log n) preprocessing time and O(1) query time.

The preprocessing time is reduced to O(n) as follows: Divide the array A into 2n blocks of size 1 log n each. Preprocess each block such that for every pair (i, j) that falls within a block, RM QA(i, j) can be answered directly. Form an array B of size 2n that contains the minimum element from each of the blocks in A, in the order of the blocks in A, and record the locations of the minimum in each block in another array C. An arbitrary query RM QA(i, j) where i and j do not fall in the same block is answered as follows: Directly find the location of the minimum in the range from i to the end of the block containing it, and also in the range from the beginning of the block containing j to index j. All that remains is to find the location of the minimum in the range of blocks completely contained between i and j. This is done by the corresponding range minimum query in B and using C to find the location in A of the resulting smallest element. To answer range queries in B, B is preprocessed as outlined before. Because the size of B is only image reprocessing image  time and space.

It remains to be described how each of the blocks in A is preprocessed to answer range minimum queries that fall within a block. For each pair (i, j) of indices that fall in a block, the corresponding range minimum query is precomputed and stored. This requires computing O(log2 n) values per block and can be done in O(log2 n) time per block. The total run-time over all blocks is 2n × O(log2 n) = O(n log n), which is unacceptable. The run-time can be reduced for the special case where the array A contains level numbers of  nodes visited in an Euler Tour, by exploiting its special properties. Note that the level numbers of consecutive entries differ by +1 or 1. Consider the 2n blocks of size 1 log n.

Normalize each block by subtracting the first element of the block from each element of the block. This does not affect the range minimum query. As the first element of each block is 0 and any other element differs from the previous one by +1 or 1, the number of distinct blocks is 2 2 log n−1 = 1 n. Direct preprocessing of the distinct blocks takes 1 n × O(log2 n) = O(n) time. The mapping of each block to its corresponding distinct normalized block can be done in time proportional to the length of the block, taking O(n) time over all blocks.

Putting it all together, a tree T of n nodes can be preprocessed in O(n) time such that lca queries for any two nodes can be answered in constant time. We are interested in an application of this general algorithm to suffix trees. Consider a suffix tree for a string of length n. After linear time preprocessing, lca queries on the tree can be answered in constant time. For a given pair of suffixes in the string, the string-depth of their lowest common ancestor gives the length of their longest common prefix. Thus, the longest common prefix can be determined in constant time, without resorting to a single character comparison! This feature is exploited in many suffix tree algorithms.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout