Posts

Showing posts from April, 2015

String Searching:The Position Heap

Image
The Position Heap We now give a data structure that gives much simpler algorithms, at a cost of slightly in- creasing the worst-case time required for a query. The algorithms can easily be programmed by undergraduate data-structures students. The data structure is a trie, and has one node for each position in the text. The data structures and algorithms can be modified to give the same bounds for construction and searching, but this undermines the principal advantages, which are simplicity and low mem- ory requirements. The data structure is closely related to trees that are used for storing hash keys in [3]. Building the Position Heap Let a string be represented by a trie if it is the label of a path from the root in the trie. For analyzing the position heap us adopt the convention of indexing the characters of t in descending order, so t = t n t n − 1 ... t 1. In this case, we let t [ i : j ] denote t i t i − 1 ... t j . The algorithm for constructing the position heap c

String Searching:The Compact DAWG

Image
The Compact DAWG By Theorem 30.1 and Lemma 30.3, we cannot assume that the DAWG requires linear space if the nodes are explicitly labeled with their position sets. The algorithm for building the DAWG in linear time does not label the nodes with their position sets. However, without the labels, it is not possible to use the DAWG to find the k locations where a substring p occurs in t in O ( | p | + k ) time. One remedy for this problem is to label a node with a position i if it represents the smallest position set that contains i as a member. The total number of these labels is n . We can reverse the directions of the suffix pointers that are installed during the DAWG construction algorithm, yielding a tree on the position sets. If a node represents a set X of positions, the members of X can be returned in O ( | X | ) time by traversing the subtree rooted at X , assembling a list of these labels. (This tree is isomorphic to the suffix tree of the reverse of the text, but there is no ne

String Searching:The DAWG

Image
The DAWG L EMMA 30.1 Let x and y be two strings such that E ndP ositions ( x ) ∩ E ndP ositions ( y ) / = ∅ . One of x and y must be a suffix of the other, and either E ndP ositions ( x ) = E ndP ositions ( y ), E ndP ositions ( x ) ⊂ E ndP ositions ( y ) or E ndP ositions ( y ) ⊂ E ndP ositions ( x ). Pr o o f If x and y have a common ending position i , then the two occurrences coincide in a way that forces one to be a suffix of the other. Suppose without loss of generality that y is a suffix of x . Then every occurrence of x contains an occurrence of y inside of it that ends at the same position, so E ndpositions ( x ) ⊆ E ndpositions ( y ). ♦ For instance, in the string aabcabcaac , the string ca has ending positions { 5 , 8 } , while the string aabca has ending positions { 5 } , and ca is a suffix of aabca . Let x ’s right-equivalence class in t be the set { y | E ndP ositions ( y ) = E ndP ositions ( x ) } . The only infinite class is degenerate class of strings with the empty s

String Searching:Introduction and Preliminaries

Image
Introduction Searching for occurrences of a substring in a text is a common operation familiar to anyone who uses a text editor, word processor, or web browser. It is also the case that algorithms for analyzing textual databases can generate a large number of searches. If a text, such as a portion of the genome of an organism, is to be searched repeatedly, it is sometimes the case that it pays to preprocess the text to create a data structure that facilitates the searches. The suffix tree [5] and suffix array [4] discussed in Chapter 29 are examples. In this chapter, we give some alternatives to these data structures that have advantages over them in some circumstances, depending on what type of searches or analysis of the text are desired, the amount of memory available, and the amount of effort to be invested in an implementation. In particular, we focus on the problem of finding the locations of all occurrences of a string x in a text t , where the letters of t are drawn from a fixed

Suffix Trees and Suffix Arrays:Advanced Applications

Image
Advanced Applications Su ffix Links from Lowest Common Ancestors Suppose we are given a suffix tree and it is required to establish suffix links for each internal node. This may become necessary if the suffix tree creation algorithm does not construct suffix links but they are needed for an application of interest. For example, the suffix tree may be constructed via suffix arrays, completely avoiding the construction and use of suffix links for building the tree. The links can be easily established if the tree is preprocessed for l ca queries. Mark each internal node v of the suffix tree with a pair of leaves ( i, j ) such that leaves labeled i and j are in the subtrees of different children of v . The marking can be done in linear time by a bottom-up traversal of the tree. To find the suffix link from an internal node v (other than the root) marked with ( i, j ), note that v = l ca ( i, j ) and l cp ( suf f i , suf f j ) = path - l abel ( v ). Let path - l abel ( v ) = cα , where c is the first charac