Online Dictionary Structures:Introduction and Trie Implementations

Introduction

Given an initially empty set S, the dictionary problem consists of executing on-line any sequence of operations of the form S.membership(s), S.insert(s) and S.delete(s), where each element s is an object (or point in one dimensional space). Each object can be stored in a single word, and it takes O(1) time to save and/or retrieve it. The set may be represented by using arrays (sorted or unsorted), linked lists (sorted or unsorted), hash tables, binary search trees, AVL-trees, B-trees, 2-3 trees, weighted balanced trees, or balanced binary search trees (i.e., 2-3-4 trees,symmetric B-trees, half balanced trees or red-black trees). The worst case time complexity for performing each of these operations is O(log n), where n is the maximum number of elements in a set, when using AVL-trees, B-trees (ﬁxed order), 2-3 trees, weighted balanced trees, or balanced binary search trees.

See Chapters 2, 3, 9, 10 and 15 for details on these structures.

The insertion or deletion of elements in these structures requires a set of operations to preserve certain properties of the resulting trees. For binary search trees, these operations are called rotations, and for m-way search trees they are called splitting or combining nodes. The balanced binary search trees are the only trees that require a constant number of rotations for both the insert and delete operations [13, 15].

In this chapter we discuss several algorithms [4, 5] for the generalized dictionary problem when the data is multidimensional, rather than one dimensional. Each data element consists of d ordered components which we call ordered d-tuple, or simply d-tuple. Each component contains a value which can be stored in a memory word and which can be compared against another value to determine whether the values are identical, the ﬁrst value is larger than the second one, or the ﬁrst value is smaller than the second one. The comparison operation takes a constant amount of time. We show that the multidimensional dictionary operations can be implemented to take O(d + log n), where n is the number of d-tuples in the set. We also show that other common operations can also be executed within the same time complexity bounds.

Let us now discuss one of the applications of multidimensional dictionaries. We are given a set of n points in multidimensional space and an integer D, and the problem is to ﬁnd the least number of orthogonal hypersquares (or d-boxes) of size D to cover all the points, i.e., each of the points must be in at least one of the d-boxes. This covering problem arises in image processing, and in locating emergency facilities so that all users are within a reasonable distance of one of the facilities [3]. This covering problem has been shown to be NP-hard and several polynomial time approximation schemes for its solution have been developed [3]. The simplest approximation algorithm deﬁnes d-boxes along a multidimensional grid with grid d-boxes of length D. The approximation algorithm takes every d-dimensional point and by using simple arithmetic operations, including the ﬂoor function, ﬁnds its appropriate (grid) d-box. The d-box, which is characterized by d integer components, is inserted into a multidimensional dictionary and the operation is repeated for each d-dimensional point. Then one just visits all the d-tuples in the set and the d-boxes they represent are part of the solution generated by this simple approximation algorithm. Note that when a d-tuple is inserted into a multidimensional dictionary that contains it, the d-tuple will not modify the dictionary because dictionaries store sets, i.e., multiple copies of the d-tuples are not allowed. Other application of multidimensional dictionaries are when accessing multi-attribute data by value. These applications include the management of geometrical objects and the solution of geometry search problems.

Given the d-tuple s in set S, one may access in constant time the ith element in the d-tuple by using the function s.x(i). In other words, the d-tuple s is simply (s.x(1), s.x(2),... , s.x(d)) In this chapter we examine the methods given in [4, 5] to represent the data set and their algorithms to perform on-line any sequence of the multidimensional dictionary operations. The most eﬃcient of the implementations performs each of the three dictionary operations in O(d + log n) time, where n is the number of d-tuples, and d is the number of dimen- sions. Each of the insert and delete operations requires no more than a constant number of rotations. The best of the algorithms requires dn words to represent the d-tuples, plus O(n) additional space is required to keep additional pointers and data. Because we are using balanced binary search trees, we can also perform other operations eﬃciently. For example, ﬁnd the (lexicographically) smallest or largest d-tuple (O(log n) time), print in lexicographic order (O(dn) time), and concatenation (O(d + log n) time). By modifying slightly the representation and introducing additional information, one can also ﬁnd the (lexicographically) kth smallest or largest d-tuple in (O(log n) time). In Section 24.5 we show that the structure given in [5] may also be used to implement the split operation in O(d + log n) time, and that the approach can also be used in other balanced binary search trees, like AVL, weight balanced, etc.

Trie Implementations

It is interesting to note that two decades ago balanced structures were written oﬀ for this type of applications. As noted in [12], “balanced tree schemes based on key comparisons (e.g., AVL-trees, B-trees, etc.) lose some of their usefulness in this more general context”. At that time the approach was to use tries in conjunction with balanced tree schemes to represent multidimensional data.

Given a set of strings over an alphabet Σ, the tree of all their preﬁxes is called a trie (Chapter 28). In Figure 24.1 we depict the trie for the set of strings over the English alphabet T = {akam, aklm, cmgi, cmos, cors, corv, nort, novl, novn}. In this case all the elements in the trie have the same number of symbols, but in general a trie may contain string with diﬀerent number of symbols. Each node x in a trie is the destination of all the strings with the same preﬁx (say px) and node x consists of a set of element-pointer pairs. The ﬁrst component of the pair is an alphabet element and the second one is a pointer to a subtrie that contains all the strings with preﬁx px followed by the alphabet element. In Figure 24.1 the pairs for a trie node are represented by the edges emanating from the lower portion of the circle that represents the node. Note that no two element-pointer pairs for a trie node have the same ﬁrst component or the same second component. The set of element-pointer pairs in a trie node may be represented in diﬀerent ways. For example one may store the element-ponter pairs for each internal node as:

1. An array of m pointers, where m is the number of elements in the alphabet Σ. In this case one needs to deﬁne a function to translate each element in the alphabet to an integer in the range [0,m − 1]. The function can normally be implemented to take constant time.

2. A sorted linked list with all the symbols and corresponding pointers of the branches emanating from the node (Sussenguth [14]).

3. A binary search tree with all the symbols and corresponding pointers of the branches emanating from the node. (Clampett [2])

We shall refer to the resulting structures as trie-array, trie-list, and trie bst, respectively.

For multidimensional dictionaries deﬁned over the set of integers [0, m), the trie method treats a point in d-space as a string with d elements deﬁned over the alphabet Σ = {0, 1,... ,m − 1} (see Figure 24.1). For the trie-array representation the element-pointer pairs in a trie node are represented by an m-element vector of pointers. The ith pointer corresponds to the ith alphabet symbol, and a null pointer is used when the corresponding alphabet element is not part of any element-pointer pair. The space required to represent n d-tuples in this structure is about dnm pointers, but the total amount of information can be represented in O(dn log m) bits. The insert and delete operation takes O(md) time, because either of these operations may add or delete d − 1 trie nodes and each one has an m-component array of pointers. On the other hand, the membership operation takes only O(d) time. This is the fastest possible implementation for the membership operation. The trie-array implementation is not possible when m is large because there will be too much space wasted. The trie-list representation mentioned above is much better for this scenario. In this case the list of element-pointer pairs is stored in a sorted linked list. The list is sorted with respect to the alphabet elements, i.e., the ﬁrst component in the element-pointer pairs. In the trie-bst representation, the element-pointers are stored in a binary search tree. The ordering is with respect to the alphabet elements. In the former case, we use dn memory locations for the d-tuples, plus 2dn pointers, and in the latter case one needs 3dn pointers. The time complexity for insert, delete and membership in both of these representations is O(d + n). It is important to note that there are two types of pointers, the trie pointers and the linked list or binary search tree pointers.

In practice one may use hybrid structures in which some nodes in the trie are trie- arrays and others are trie-lists, and depending on the number of element-pointer pairs one transforms from one representation to the other. For example, if the number of element- pointer pairs becomes more than some bound bu one uses trie-array nodes, but if it is less then bl then one uses the trie-list. If it is some number between these two bounds, then either representation is ﬁne to use. By using appropriate values for bl and bu in this approach we can avoid “trashing” which means that one spends most of the time changing back and forth from one representations to the other.

Bentley and Saxe [1] used a modiﬁed version of the trie-bst implementation. In this case the binary search tree is replaced by a completely balanced binary tree. I.e., each binary search tree or subtree for trie node x with k element-pointer pairs has as root an element-pointer pair (y, ptr) such that the median of the ordered strings (with preﬁx equal to the preﬁx of the trie node x (denoted by px) plus any of the k alphabet elements in the k element-pointer pairs) has as preﬁx px followed by y. For example, the root of the trie in Figure 24.1 has the pair with alphabet element c, and the root of the subtrie for the preﬁx no is the pair with alphabet element v. This balanced structure is the best possible when the trie does not change or it changes very slowly like in multikey sorting or restricted searching [9, 10]. Since the overall structure is rigid, it can be shown that rebalancing after inserting or deleting an element can be very expensive and therefore not appropriate in a dynamic environment.

Research into using diﬀerent balanced strategies for the trie-bst structures has lead into structures that use ﬁxed order B-trees [7], weight balanced trees [12], AVL trees [16], and balanced binary search trees [17]. It has been shown that insert, delete and membership for multidimensional dictionaries for all of the above structures takes O(d + log n) time in the worst case, except for weight balanced trees in which case it is an expected time complexity bound. The above procedures are complex and require balancing criteria in addition to the obvious ones. Also, the number of rotations needed for these insert and delete operations is not bounded by a constant. Vaishnavi [17] used balanced binary search trees that require only a constant number of rotations, however since one may encounter d trees, the total number of rotations are no longer bounded by a constant. He left it as an open problem to design a structure such that multidimensional dictionaries can be implemented so that only a constant number of rotations are performed after each insert and delete operation.

Search This Blog

algorithms