Succinct Representation of Data Structures:Succinct Structures for Indexing

Succinct Structures for Indexing

A text index is a data structure storing a text (a string or a set of strings) and supporting string matching queries: given a pattern P , find all the occurrences of P in the text. Two well-known and widely used index structures are the suffix trees and suffix arrays. In this section we briefly describe some succinct data structures for these two.

A suffix tree for a text is the compressed digital trie of all the suffixes of the text [39, 58]. A suffix tree for a text of length n has n leaves and at most n 1 internal nodes. The space bound is a consequence of skipping nodes with only one child, hence there are precisely n 1 internal nodes if we use a binary trie. Each leaf points to the position in the text of the corresponding suffix it represents uniquely. The edges are labeled by substrings of the text, which are usually represented by storing a position in the text where the substring starts and its length. Thus, a standard representation of a suffix tree for a text of length n takes O(n lg n) bits. Searching for an occurrence of a pattern of length m using a suffix tree takes O(m) time.

The suffix array of a text is an array storing pointers to the suffixes of the text in their lexicographic order. Thus, a suffix array for a text of length n takes nllg nl bits. Note that the leaf labels of a suffix tree written from left to right form the suffix array, if the children of each node are arranged in lexicographic order of their edge labels. Searching for an occurrence of a pattern of length m using a suffix array takes O(m + lg n) time.

We now briefly sketch the ideas involved in representing a suffix tree (and hence also a suffix array) using O(n) bits. We first convert the trie into binary by using a fixed length encoding of the characters of the alphabet. We then store the parenthesis representation of the underlying tree structure (see Section 37.4.2).

The edge labels of a suffix tree can be omitted, as this can be determined by finding the longest common prefix of the leftmost and rightmost leaves of the parent node (of the edge). The parenthesis representation of an ordinal tree can be augmented with o(n)-bit additional structure to support finding the leftmost and rightmost leaves of a given node in constant time. Thus, one can use this tree representation to store the tree structure of a suffix tree, and store the leaf pointers (suffix array) explicitly. This gives a suffix tree representation that takes n lg n + O(n) bits of space and supports indexing queries in optimal time. See [44] for details.

The above structure uses n llg nl bits to represent the pointers to the text or the suffix array. Grossi and Vitter [26] obtained a suffix array structure that takes O(n) bits and supports finding the i-th element in the suffix array (lookup queries) in O(lgE n) time, for any fixed E > 0. Using this structure they also obtained a suffix tree representation that takes O(n) bits of space and supports finding all the s occurrences of a given pattern of length m in O(m + s lgE n) time. The structure given by Rao [53] generalizes the suffix array structure of Grossi and Vitter, which takes O(nt(lg n)1/(t+1) ) bits and supports lookup in O(t) time, for any parameter 1 t lg lg n. Using this structure, one can get an index structure that takes o(n lg n) bits and supports finding all the s occurrences of a given pattern of length m in O(m + s + lgE n) time.

Ferragina and Manzini [16] presented an opportunistic data structure taking O(nHk (n))+ o(n) bits of space, where Hk (n) denotes the k-th order entropy of the given text of length n. This supports finding all the occurrences of a pattern of length m in O((m + s) lgE n) time, where s is the number of occurrences of the pattern. They also presented its practical performance [17].

Sadakane [54] gave a data structure that takes O(n · (1 + H0) + O(|Σ| lg |Σ|)) bits for a text of length n over an alphabet Σ, where H0 lg |Σ| is the order-0 entropy for the text.

This supports finding all the s occurrences of a given pattern P in O(|P | lg n + s lgE n) time, and decompress a portion of the text of length l in O(l + lgE n) time, for any fixed E> 0. Grossi et al. [27] gave another index structure that takes nHk(n)+ O(n lg lg n lg |Σ|/ lg n) bits for a text of length n over an alphabet Σ. Finding an occurrence of a pattern of length m using this structure takes O(m lg |Σ| + polylog(n)) time. This is also shown to perform well, in terms of space as well as query times, in practice [28]

Comments

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Warshall’s Algorithm -to find TRANSITIVE CLOSURE