Succinct Representation of Data Structures:Succinct Structures for Indexing

By احمد جاد الله فرحات - May 07, 2015

Succinct Structures for Indexing

A text index is a data structure storing a text (a string or a set of strings) and supporting string matching queries: given a pattern P , ﬁnd all the occurrences of P in the text. Two well-known and widely used index structures are the suﬃx trees and suﬃx arrays. In this section we brieﬂy describe some succinct data structures for these two.

A suﬃx tree for a text is the compressed digital trie of all the suﬃxes of the text [39, 58]. A suﬃx tree for a text of length n has n leaves and at most n − 1 internal nodes. The space bound is a consequence of skipping nodes with only one child, hence there are precisely n − 1 internal nodes if we use a binary trie. Each leaf points to the position in the text of the corresponding suﬃx it represents uniquely. The edges are labeled by substrings of the text, which are usually represented by storing a position in the text where the substring starts and its length. Thus, a standard representation of a suﬃx tree for a text of length n takes O(n lg n) bits. Searching for an occurrence of a pattern of length m using a suﬃx tree takes O(m) time.

The suﬃx array of a text is an array storing pointers to the suﬃxes of the text in their lexicographic order. Thus, a suﬃx array for a text of length n takes nllg nl bits. Note that the leaf labels of a suﬃx tree written from left to right form the suﬃx array, if the children of each node are arranged in lexicographic order of their edge labels. Searching for an occurrence of a pattern of length m using a suﬃx array takes O(m + lg n) time.

We now brieﬂy sketch the ideas involved in representing a suﬃx tree (and hence also a suﬃx array) using O(n) bits. We ﬁrst convert the trie into binary by using a ﬁxed length encoding of the characters of the alphabet. We then store the parenthesis representation of the underlying tree structure (see Section 37.4.2).

The edge labels of a suﬃx tree can be omitted, as this can be determined by ﬁnding the longest common preﬁx of the leftmost and rightmost leaves of the parent node (of the edge). The parenthesis representation of an ordinal tree can be augmented with o(n)-bit additional structure to support ﬁnding the leftmost and rightmost leaves of a given node in constant time. Thus, one can use this tree representation to store the tree structure of a suﬃx tree, and store the leaf pointers (suﬃx array) explicitly. This gives a suﬃx tree representation that takes n lg n + O(n) bits of space and supports indexing queries in optimal time. See [44] for details.

The above structure uses n llg nl bits to represent the pointers to the text or the suﬃx array. Grossi and Vitter [26] obtained a suﬃx array structure that takes O(n) bits and supports ﬁnding the i-th element in the suﬃx array (lookup queries) in O(lgE n) time, for any ﬁxed E > 0. Using this structure they also obtained a suﬃx tree representation that takes O(n) bits of space and supports ﬁnding all the s occurrences of a given pattern of length m in O(m + s lgE n) time. The structure given by Rao [53] generalizes the suﬃx array structure of Grossi and Vitter, which takes O(nt(lg n)1/(t+1) ) bits and supports lookup in O(t) time, for any parameter 1 ≤ t ≤ lg lg n. Using this structure, one can get an index structure that takes o(n lg n) bits and supports ﬁnding all the s occurrences of a given pattern of length m in O(m + s + lgE n) time.

Ferragina and Manzini [16] presented an opportunistic data structure taking O(nHk (n))+ o(n) bits of space, where Hk (n) denotes the k-th order entropy of the given text of length n. This supports ﬁnding all the occurrences of a pattern of length m in O((m + s) lgE n) time, where s is the number of occurrences of the pattern. They also presented its practical performance [17].

Sadakane [54] gave a data structure that takes O(n · (1 + H0) + O(|Σ| lg |Σ|)) bits for a text of length n over an alphabet Σ, where H0 ≤ lg |Σ| is the order-0 entropy for the text.

This supports ﬁnding all the s occurrences of a given pattern P in O(|P | lg n + s lgE n) time, and decompress a portion of the text of length l in O(l + lgE n) time, for any ﬁxed E> 0. Grossi et al. [27] gave another index structure that takes nHk(n)+ O(n lg lg n lg |Σ|/ lg n) bits for a text of length n over an alphabet Σ. Finding an occurrence of a pattern of length m using this structure takes O(m lg |Σ| + polylog(n)) time. This is also shown to perform well, in terms of space as well as query times, in practice [28]

Search This Blog

algorithms

Succinct Representation of Data Structures:Succinct Structures for Indexing

Comments

Post a Comment

Popular posts from this blog

Warshall’s Algorithm -to find TRANSITIVE CLOSURE

0/1 Knapsack Problem Memory function.

Collision Detection:Penetration Depth Computation