Suffix Trees and Suffix Arrays:Lowest Common Ancestors

By احمد جاد الله فرحات - April 28, 2015

Lowest Common Ancestors

Consider a string s and two of its suﬃxes suf fi and suf fj . The longest common preﬁx of the two suﬃxes is given by the path label of their lowest common ancestor. If the string-depth of each node is recorded in it, the length of the longest common preﬁx can be retrieved from the lowest common ancestor. Thus, an algorithm to ﬁnd the lowest common ancestors quickly can be used to determine longest common preﬁxes without a single character comparison. In this section, we describe how to preprocess the suﬃx tree in linear time and be able to answer lowest common ancestor queries in constant time [3].

Bender and Farach’s lca algorithm

Let T be a tree of n nodes. Without loss of generality, assume the nodes are numbered 1 ... n. Let lca(i, j) denote the lowest common ancestor of nodes i and j. Bender and Farach’s algorithm performs a linear time preprocessing of the tree and can answer lca queries in constant time.

Let E be an Euler tour of the tree obtained by listing the nodes visited in a depth ﬁrst search of T starting from the root. Let L be an array of level numbers such that L[i] contains the tree-depth of the node E[i]. Both E and L contain 2n − 1 elements and can be constructed by a depth ﬁrst search of T in linear time. Let R be an array of size n such that R[i] contains the index of the ﬁrst occurrence of node i in E. Let RM QA(i, j) denote the position of an occurrence of the smallest element in array A between indices i and j (inclusive). For nodes i and j, their lowest common ancestor is the node at the smallest tree-depth that is visited between an occurrence of i and an occurrence of j in the Euler tour. It follows that

Thus, the problem of answering lca queries transforms into answering range minimum queries in arrays. Without loss of generality, we henceforth restrict our attention to answering range minimum queries in an array A of size n.

located in constant time. This will allow determination of RM QA(i, j) in constant time. To avoid a direct computation of k, the largest power of 2 that is smaller than or equal to each integer in the range [1..n] can be precomputed and stored in O(n) time. Putting all of this together, range minimum queries can be answered with O(n log n) preprocessing time and O(1) query time.

The preprocessing time is reduced to O(n) as follows: Divide the array A into 2n blocks of size 1 log n each. Preprocess each block such that for every pair (i, j) that falls within a block, RM QA(i, j) can be answered directly. Form an array B of size 2n that contains the minimum element from each of the blocks in A, in the order of the blocks in A, and record the locations of the minimum in each block in another array C. An arbitrary query RM QA(i, j) where i and j do not fall in the same block is answered as follows: Directly ﬁnd the location of the minimum in the range from i to the end of the block containing it, and also in the range from the beginning of the block containing j to index j. All that remains is to ﬁnd the location of the minimum in the range of blocks completely contained between i and j. This is done by the corresponding range minimum query in B and using C to ﬁnd the location in A of the resulting smallest element. To answer range queries in B, B is preprocessed as outlined before. Because the size of B is only reprocessing time and space.

It remains to be described how each of the blocks in A is preprocessed to answer range minimum queries that fall within a block. For each pair (i, j) of indices that fall in a block, the corresponding range minimum query is precomputed and stored. This requires computing O(log2 n) values per block and can be done in O(log2 n) time per block. The total run-time over all blocks is 2n × O(log2 n) = O(n log n), which is unacceptable. The run-time can be reduced for the special case where the array A contains level numbers of nodes visited in an Euler Tour, by exploiting its special properties. Note that the level numbers of consecutive entries diﬀer by +1 or −1. Consider the 2n blocks of size 1 log n.

Normalize each block by subtracting the ﬁrst element of the block from each element of the block. This does not aﬀect the range minimum query. As the ﬁrst element of each block is 0 and any other element diﬀers from the previous one by +1 or 1, the number of distinct blocks is 2 2 log n−1 = 1 n. Direct preprocessing of the distinct blocks takes 1 √n × O(log2 n) = O(n) time. The mapping of each block to its corresponding distinct normalized block can be done in time proportional to the length of the block, taking O(n) time over all blocks.

Putting it all together, a tree T of n nodes can be preprocessed in O(n) time such that lca queries for any two nodes can be answered in constant time. We are interested in an application of this general algorithm to suﬃx trees. Consider a suﬃx tree for a string of length n. After linear time preprocessing, lca queries on the tree can be answered in constant time. For a given pair of suﬃxes in the string, the string-depth of their lowest common ancestor gives the length of their longest common preﬁx. Thus, the longest common preﬁx can be determined in constant time, without resorting to a single character comparison! This feature is exploited in many suﬃx tree algorithms.

Search This Blog

algorithms

Suffix Trees and Suffix Arrays:Lowest Common Ancestors

Comments

Post a Comment

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout