Online Dictionary Structures:Binary Search Tree Implementations

Binary Search Tree Implementations

As pointed out by Gonzalez [4, 5], using a balanced binary search tree (without a trie) and storing each tuple at each node leads to membership (and also insert and delete) algorithms that take O(d log n) time, where n is the number of elements in the tree, because one needs to compare the element being searched with the d-tuples at each node. One can go further and claim that for some problem instances it actually requires Θ(d log n) time. Gonzalez [5] also points out that simple shortcuts to the search process do not work. For example if we reach a node in the search such that the first k components are identical, one may be tempted to conclude that in the subtree rooted at that node one needs to search only from position k + 1 to d in the d-tuples. This is false because the k-component prefixes of all the d-tuples in a subtree may vary considerable in a binary search tree. One can easily show that even the membership operation cannot be implemented this way. However, this variation is more predictable when comparing against the smallest or largest d-tuple in a subtree. This is a key idea exploited in [4].

Manber and Myers [11] developed an efficient algorithm that given an N symbol text it finds all the occurrences of any input word q. The scenario is that the text is static, but there will be many word searches. Their approach is to preprocess the text and generate a structure where the searching can be performed efficiently. In their preprocessing stage they construct a sorted list with all the N suffixes of the text. Locating all the occurrences of a string q reduces to performing two binary search operations in the list of suffixes, the first for the first suffix that contains as prefix q and the second search is for the last suffix that contains as prefix q. Both searches are similar, so lets discuss the first one. This operation is similar to the membership operation discussed in this chapter. Manber and Myers [11] binary search process begins by letting L (R) be the largest (smallest) string in the in the list. Then l (r), the index of the first element where L (R) differ from q is computed. Note that if two strings are identical the index of the first component where they differ is set to the length of string plus 1. We use this convention throughout this chapter. The middle entry M in the list is located and then they compute the index of the first component where M and q differ. If this value is computed the obvious way, then the procedure will not be efficient. To do this efficiently they compute it with l, r as well as index of the first component where M and L, and M and R differ. These last two values are precomputed in the preprocessing stage. This indirect computation may take O(|q|) time; however, overall the phases of the computation the process takes at most O(|q|) time. The advantage of this approach is that it requires only O(N ) space, and the preprocessing can be done in O(N ) expected time [11]. The disadvantage is that it is not a dynamic. Updating the text requires expensive recomputations in the precomputed data, i.e., one needs to find the first component where many pairs in the list differ in order to carry out efficiently the binary search process. For their application [11] the text is static. The time required to do the search for q is O(|q| + log N ) time. This approach results in a structure that is similar to the fully balanced tree strategy in [1].

Gonzalez [4] solved Vaishnavi’s open problem by designing a binary tree data structure where the multidimensional dictionary operations can be performed in O(d + log n) time while performing only a constant number of rotations. To achieve these goals he represents the set of d-tuples S in a balanced binary search tree that contains additional information at each node. This additional information is similar to the one in the lists of Manber and Myers [11], but it can be recomputed efficiently as we insert and delete elements. The disadvantage is that the membership operation is more complex. But all the procedures developed in [4] are simpler than the ones given in [7, 12, 16, 17]. One just needs to add a few instructions in addition to the normal code required to manipulate balanced binary search trees [15]. The additional information in [4] includes for every node v in the tree the index of the first element where v and the smallest (largest) d-tuple in the subtree rooted at v differ as well as a pointer to this d-tuple. As mentioned above, testing for membership is more complex. At each iteration in the search process [4] we are at the tree node t and we know that if q is in the tree then it is in the subtree rooted at t. The algorithm knows either the index of a component where q and the smallest d-tuple in the subtree t differ, or the index of a component where q and the largest d-tuple in the subtree t differ. Then the algorithm determines that q is in node t, or it advances to the left or right subtrees of node t. In either case it maintains the above invariant. It is important to point out the invariant is: “the index of a component where q and the smallest d-tuple in the subtree differ” rather than the “the first index of a ...”. It does not seem possible to find “the first index ...” in this structure efficiently with the information stored in the tree. This is not a big problem when q is in the tree since it will be found quickly, but if it is not in the tree then in order to avoid reporting that it is in the tree one must perform an additional verification step at the end that takes O(d) time. Gonzalez [4] calls this search strategy “assume, verify and conquer” (AVC). I.e., in order to avoid multiple expensive verification steps one assumes that some prefixes of strings match. The outcome of the search depends on whether or not these assumptions were valid. This can be determined by performing one simple verification step that takes O(d) time. The elimination of multiple verifications is very important because in the worst case there are Ω(log n) verifications, and each one could take Ω(d) time. The difference between this approach and the one in [11] is that Manber and Myers compute the first element where M and q differ, where as Gonzalez [4] computes an element where M and q differ. As we said before, in Gonzalez [4] structure one cannot compute efficiently the first element where M and q differ.

Gonzalez [5] modified the structure in [4] to one that follows the search process in [11]. The new structure, which we discuss in Section 24.4, is in general faster to update because for every node t one keeps the index of the first component where the d-tuple stored at node t and the smallest (largest) d-tuple greater (smaller) than all the d-tuples in the subtree rooted at t (if any) differ, rather than the one between node t and the smallest (largest) d-tuple in its subtree rooted at t as in [4]. In this structure only several nodes have to be modified when inserting a node or deleting a leaf node, but in the structure in [4] one may need to update O(log n) nodes. Deleting a non-leaf node from the tree requires more work in this structure than in [4], but membership testing is simpler in the structure in [5]. To summarize, the membership algorithm in [5] mimics the search procedure in [11], but follows the update approach developed in [4]. Gonzalez [5] established that the dictionary operations can be implemented to take O(d + log n) time while performing only a constant number of rotations for both insert and delete. Other operations which can be performed efficiently in this multidimensional balanced binary search trees are: find the (lexicographically) smallest (largest) d-tuple (O(log n) time), print in lexicographic order (O(dn) time), and concatenation (O(d + log n) time). Finding the (lexicographically) kth smallest (largest) d-tuple can also be implemented efficiently (O(log n) time) by adding to each node the number of nodes in its left subtree. The asymptotic time complexity bound for the procedures in [5] is identical to the ones in [4], but the procedures in [5] are simpler. To distinguish this new type of balanced binary search trees from the classic ones and the ones in [4], we refer to these trees as multidimensional balanced binary search trees. In this article we follow the notation in [5].

In Section 24.5 we show that the rotation operation in [5] can be implemented to take only constant time by making some rather simple operations that were first discussed in [6]. The implication of this, as pointed out in [6], is that the split operation can also be implemented to take O(d + log n). Also, the efficient implementation of the rotation operation allows us to use the technique in [5] on many other binary search trees, like AVL, weight balanced, etc., since performing O(log n) rotations does not limit the applicability of the techniques in [5]. These observations were first reported in [6], where they present a similar approach, but in a more general setting.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout