Online Dictionary Structures:Binary Search Tree Implementations

Binary Search Tree Implementations

As pointed out by Gonzalez [4, 5], using a balanced binary search tree (without a trie) and storing each tuple at each node leads to membership (and also insert and delete) algorithms that take O(d log n) time, where n is the number of elements in the tree, because one needs to compare the element being searched with the d-tuples at each node. One can go further and claim that for some problem instances it actually requires Θ(d log n) time. Gonzalez [5] also points out that simple shortcuts to the search process do not work. For example if we reach a node in the search such that the ﬁrst k components are identical, one may be tempted to conclude that in the subtree rooted at that node one needs to search only from position k + 1 to d in the d-tuples. This is false because the k-component preﬁxes of all the d-tuples in a subtree may vary considerable in a binary search tree. One can easily show that even the membership operation cannot be implemented this way. However, this variation is more predictable when comparing against the smallest or largest d-tuple in a subtree. This is a key idea exploited in [4].

Manber and Myers [11] developed an eﬃcient algorithm that given an N symbol text it ﬁnds all the occurrences of any input word q. The scenario is that the text is static, but there will be many word searches. Their approach is to preprocess the text and generate a structure where the searching can be performed eﬃciently. In their preprocessing stage they construct a sorted list with all the N suﬃxes of the text. Locating all the occurrences of a string q reduces to performing two binary search operations in the list of suﬃxes, the ﬁrst for the ﬁrst suﬃx that contains as preﬁx q and the second search is for the last suﬃx that contains as preﬁx q. Both searches are similar, so lets discuss the ﬁrst one. This operation is similar to the membership operation discussed in this chapter. Manber and Myers [11] binary search process begins by letting L (R) be the largest (smallest) string in the in the list. Then l (r), the index of the ﬁrst element where L (R) diﬀer from q is computed. Note that if two strings are identical the index of the ﬁrst component where they diﬀer is set to the length of string plus 1. We use this convention throughout this chapter. The middle entry M in the list is located and then they compute the index of the ﬁrst component where M and q diﬀer. If this value is computed the obvious way, then the procedure will not be eﬃcient. To do this eﬃciently they compute it with l, r as well as index of the ﬁrst component where M and L, and M and R diﬀer. These last two values are precomputed in the preprocessing stage. This indirect computation may take O(|q|) time; however, overall the phases of the computation the process takes at most O(|q|) time. The advantage of this approach is that it requires only O(N ) space, and the preprocessing can be done in O(N ) expected time [11]. The disadvantage is that it is not a dynamic. Updating the text requires expensive recomputations in the precomputed data, i.e., one needs to ﬁnd the ﬁrst component where many pairs in the list diﬀer in order to carry out eﬃciently the binary search process. For their application [11] the text is static. The time required to do the search for q is O(|q| + log N ) time. This approach results in a structure that is similar to the fully balanced tree strategy in [1].

Gonzalez [4] solved Vaishnavi’s open problem by designing a binary tree data structure where the multidimensional dictionary operations can be performed in O(d + log n) time while performing only a constant number of rotations. To achieve these goals he represents the set of d-tuples S in a balanced binary search tree that contains additional information at each node. This additional information is similar to the one in the lists of Manber and Myers [11], but it can be recomputed eﬃciently as we insert and delete elements. The disadvantage is that the membership operation is more complex. But all the procedures developed in [4] are simpler than the ones given in [7, 12, 16, 17]. One just needs to add a few instructions in addition to the normal code required to manipulate balanced binary search trees [15]. The additional information in [4] includes for every node v in the tree the index of the ﬁrst element where v and the smallest (largest) d-tuple in the subtree rooted at v diﬀer as well as a pointer to this d-tuple. As mentioned above, testing for membership is more complex. At each iteration in the search process [4] we are at the tree node t and we know that if q is in the tree then it is in the subtree rooted at t. The algorithm knows either the index of a component where q and the smallest d-tuple in the subtree t diﬀer, or the index of a component where q and the largest d-tuple in the subtree t diﬀer. Then the algorithm determines that q is in node t, or it advances to the left or right subtrees of node t. In either case it maintains the above invariant. It is important to point out the invariant is: “the index of a component where q and the smallest d-tuple in the subtree diﬀer” rather than the “the ﬁrst index of a ...”. It does not seem possible to ﬁnd “the ﬁrst index ...” in this structure eﬃciently with the information stored in the tree. This is not a big problem when q is in the tree since it will be found quickly, but if it is not in the tree then in order to avoid reporting that it is in the tree one must perform an additional veriﬁcation step at the end that takes O(d) time. Gonzalez [4] calls this search strategy “assume, verify and conquer” (AVC). I.e., in order to avoid multiple expensive veriﬁcation steps one assumes that some preﬁxes of strings match. The outcome of the search depends on whether or not these assumptions were valid. This can be determined by performing one simple veriﬁcation step that takes O(d) time. The elimination of multiple veriﬁcations is very important because in the worst case there are Ω(log n) veriﬁcations, and each one could take Ω(d) time. The diﬀerence between this approach and the one in [11] is that Manber and Myers compute the ﬁrst element where M and q diﬀer, where as Gonzalez [4] computes an element where M and q diﬀer. As we said before, in Gonzalez [4] structure one cannot compute eﬃciently the ﬁrst element where M and q diﬀer.

Gonzalez [5] modiﬁed the structure in [4] to one that follows the search process in [11]. The new structure, which we discuss in Section 24.4, is in general faster to update because for every node t one keeps the index of the ﬁrst component where the d-tuple stored at node t and the smallest (largest) d-tuple greater (smaller) than all the d-tuples in the subtree rooted at t (if any) diﬀer, rather than the one between node t and the smallest (largest) d-tuple in its subtree rooted at t as in [4]. In this structure only several nodes have to be modiﬁed when inserting a node or deleting a leaf node, but in the structure in [4] one may need to update O(log n) nodes. Deleting a non-leaf node from the tree requires more work in this structure than in [4], but membership testing is simpler in the structure in [5]. To summarize, the membership algorithm in [5] mimics the search procedure in [11], but follows the update approach developed in [4]. Gonzalez [5] established that the dictionary operations can be implemented to take O(d + log n) time while performing only a constant number of rotations for both insert and delete. Other operations which can be performed eﬃciently in this multidimensional balanced binary search trees are: ﬁnd the (lexicographically) smallest (largest) d-tuple (O(log n) time), print in lexicographic order (O(dn) time), and concatenation (O(d + log n) time). Finding the (lexicographically) kth smallest (largest) d-tuple can also be implemented eﬃciently (O(log n) time) by adding to each node the number of nodes in its left subtree. The asymptotic time complexity bound for the procedures in [5] is identical to the ones in [4], but the procedures in [5] are simpler. To distinguish this new type of balanced binary search trees from the classic ones and the ones in [4], we refer to these trees as multidimensional balanced binary search trees. In this article we follow the notation in [5].

In Section 24.5 we show that the rotation operation in [5] can be implemented to take only constant time by making some rather simple operations that were ﬁrst discussed in [6]. The implication of this, as pointed out in [6], is that the split operation can also be implemented to take O(d + log n). Also, the eﬃcient implementation of the rotation operation allows us to use the technique in [5] on many other binary search trees, like AVL, weight balanced, etc., since performing O(log n) rotations does not limit the applicability of the techniques in [5]. These observations were ﬁrst reported in [6], where they present a similar approach, but in a more general setting.

Search This Blog

algorithms

Online Dictionary Structures:Binary Search Tree Implementations

Binary Search Tree Implementations

Comments

Post a Comment

Popular posts from this blog

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Data Structure Visualization:Introduction and Value of Data Structure Rendering

0/1 Knapsack Problem Memory function.