Balanced Binary Search Trees:Low Height Schemes

Low Height Schemes

Most rebalancing schemes reproduce the result of AVL-trees [1] in the sense that they guarantee a height of c · log(n) for some constant c > 1, while doing updates in O(log n) time. Since the height determines the worst-case complexity of almost all operations, it may be reasonable to ask exactly how close to the best possible height Ilog(n + 1)l a tree can be maintained during updates. Presumably, the answer depends on the amount of rebalancing work we are willing to do, so more generally the question is: given a function f , what is the best possible height maintainable with O(f (n)) rebalancing work per update?

This question is of practical interest—in situations where many more searches than up- dates are performed, lowering the height by factor of (say) two will improve overall performance, even if it is obtained at the cost of a larger update time. It is also of theoretical interest, since we are asking about the inherent complexity of maintaining a given height in binary search trees. In this section, we review the existing answers to the question.

Already in 1976, Maurer et al. [55] proposed the k-neighbor trees, which guarantee a height of c · log(n), where c can be chosen arbitrarily close to one. These are unary-binary trees, with all leaves having the same depth and with the requirement that between any two unary nodes on the same level, at least k 1 binary nodes appear. They may be viewed as a type of (1, 2)-trees where the rebalancing operations exchange children, not only with neighboring nodes (as in standard (a, b)-tree or B-tree rebalancing), but with nodes a horizontal distance k away. Since at each level, at most one out of k nodes is unary, the number of nodes increases by a factor of (2(k 1) + 1)/k = 2 1/k for each level. This implies a height bound of log21/k n = log(n)/ log(2 1/k). By first order approximation, log(1 + x) = Θ(x) and 1/(1 + x) = 1 Θ(x) for x close to zero, so 1/ log(2 1/k) = 1/(1 + log(1 1/2k)) = 1 + Θ(1/k). Hence, k-trees maintain a height of (1 + Θ(1/k)) log n in time O(k log n) per update.

Another proposal [8] generalizes the red-black method of implementing (2, 4)-trees as binary trees, and uses it to implement (a, b)-trees as binary trees for a = 2k and b = 2k+1. Each (a, b)-tree node is implemented as a binary tree of perfect balance. If the underlying (a, b)-tree has t levels, the binary tree has height at most t(k +1) and has at least (2k )t = 2kt nodes. Hence, log n tk, so the height is at most (k + 1)/k log n = (1 + 1/k) log n. As in red-black trees, a node splitting or fusion in the (a, b)-tree corresponds to a constant amount of recoloring. These operations may propagate along the search path, while the remaining rebalancing necessary takes place at a constant number of (a, b)-tree nodes. In the binary formulation, these operations involve rebuilding subtrees of size Θ(2k) to perfect balance. Hence, the rebalancing cost is O(log(n)/k + 2k) per update.

Choosing k = llog log nJ gives a tree with height bound log n + log(n)/ log log(n) and update time O(log n). Note that the constant for the leading term of the height bound is now one. To accommodate a non-constant k, the entire tree is rebuilt when llog log nJ changes. Amortized this is O(1) work, which can be made a worst case bound by using incremental rebuilding [68].

Returning to k-trees, we may use the method of non-constant k also there. One possibility is k = Θ(log n), which implies a height bound as low as log n + O(1), maintained with O(log2 n) rebalancing work per update. This height is O(1) from the best possible. A similar result can be achieved using the general balanced trees described in Section 10.6: In

the proof of complexity in that section, the main point is that the cost |v| of a rebuilding at a node v can be attributed to at least (21/c 1/2)|v| updates, implying that each update is attributed at most (1/(21/c 1/2)) cost at each of the at most O(log n) nodes on the search path. The rebalancing cost is therefore O(1/(21/c 1/2) log n) for maintaining height c · log n. Choosing c = 1 + 1/ log n gives a height bound of log n + O(1), maintained in O(log2 n) amortized rebalancing work per update, since (21/(1+1/ log n)) 1/2) can be shown to be Θ(1/ log n) using the first order approximations 1/(1 + x) = 1 Θ(x) and 2x = 1 + Θ(x) for x close to zero.

We note that a binary tree with a height bound of log n + O(1) in a natural way can be embedded in an array of length O(n): Consider a tree T with a height bound of log n + k for an integer k, and consider n ranging over the interval [2i; 2i+1[ for an integer i. For n in this interval, the height of T never exceeds i + k, so we can think of T as embedded in a virtual binary tree T I with i + k completely full levels. Numbering nodes in T I by an in-order traversal and using these numbers as indexes in an array A of size 2i+k 1 gives an embedding of T into A. The keys of T will appear in sorted order in A, but empty array entries may exist between keys. An insertion into T which violates the height bound corresponds to an insertion into the sorted array A at a non-empty position. If T is maintained by the algorithm based on general balanced trees, rebalancing due to the insertion consists of rebuilding some subtree in T to perfect balance, which in A corresponds to an even redistribution of the elements in some consecutive segment of the array. In particular, the redistribution ensures an empty position at the insertion point.

In short, the tree rebalancing algorithm can be used as a maintenance algorithm for a sorted array of keys supporting insertions and deletions in amortized O(log2 n) time. The requirement is that the array is never filled to more than some fixed fraction of its capacity (the fraction is 1/2k−1 in the example above). Such an amortized O(log2 n) solution, phrased directly as a maintenance algorithm for sorted arrays, first appeared in [38]. By the converse of the embedding just described, [38] implies a rebalancing algorithm for low height trees with bounds as above. This algorithm is similar, but not identical, to the one arising from general balanced trees (the criteria for when to rebuild/redistribute are similar, but differ in the details). A solution to the sorted array maintenance problem with worst case O(log2 n) update time was given in [78]. Lower bounds for the problem appear in [28, 29], with one of the bounds stating that for algorithms using even redistribution of the elements in some consecutive segment of the array, O(log2 n) time is best possible when the array is filled up to some constant fraction of its capacity.

We note that the correspondence between the tree formulation and the array formulation only holds when using partial rebuilding to rebalance the tree—only then is the cost of the redistribution the same in the two versions. In contrast, a rotation in the tree will shift entire subtrees up and down at constant cost, which in the array version entails cost proportional to the size of the subtrees. Thus, for pointer based implementation of trees, the above Ω(log2 n) lower bound does not hold, and better complexities can be hoped for.

Indeed, for trees, the rebalancing cost can be reduced further. One method is by applying the idea of bucketing: The subtrees on the lowest Θ(log K) levels of the tree are changed into buckets holding Θ(K) keys. This size bound is maintained by treating the buckets as (a, b)-tree nodes, i.e., by bucket splitting, fusion, and sharing. Updates in the top tree only happen when a bucket is split or fused, which only happens for every Θ(K) updates in the bucket. Hence, the amortized update time for the top tree drops by a factor K. The buckets themselves can be implemented as well-balanced binary trees—using the schemes above based on k-trees or general balanced trees for both top tree and buckets, we arrive at a height bound of log n + O(1), maintained with O(log log2 n) amortized rebalancing work. Applying the idea recursively inside the buckets will improve the time even further.

This line of rebalancing schemes was developed in [3, 4, 9, 10, 42, 43], ending in a scheme [10] maintaining height Ilog(n + 1)l + 1 with O(1) amortized rebalancing work per update.

This rather positive result is in contrast to an observation made in [42] about the cost of maintaining exact optimal height Ilog(n + 1)l: When n = 2i 1 for an integer i, there is only one possible tree of height Ilog(n +1)l, namely a tree of i completely full levels. By the ordering of keys in a search tree, the keys of even rank are in the lowest level, and the keys of odd rank are in the remaining levels (where the rank of a key k is defined as the number of keys in the tree that are smaller than k). Inserting a new smallest key and removing the largest key leads to a tree of same size, but where all elements previously of odd rank now have even rank, and vice versa. If optimal height is maintained, all keys previously in the lowest level must now reside in the remaining levels, and vice versa—in other words, the entire tree must be rebuilt. Since the process can be repeated, we obtain a lower bound of Ω(n), even with respect to amortized complexity. Thus, we have the intriguing situation that a height bound of Ilog(n+1)l has amortized complexity Θ(n) per update, while raising the height bound a trifle to Ilog(n + 1)l + 1 reduces the complexity to Θ(1).

Actually, the papers [3, 4, 9, 10, 42, 43] consider a more detailed height bound of the form Ilog(n + 1) + εl, where ε is any real number greater than zero. For ε less than one, this expression is optimal for the first integers n above 2i 1 for any i, and optimal plus one for the last integers before 2i+1 1. In other words, the smaller an ε, the closer to the next power of two is the height guaranteed to be optimal. Considering tangents to the graph of the logarithm function, it is easily seen that ε is proportional to the fraction of integers n for which the height is non-optimal.

Hence, an even more detailed formulation of the question about height bound versus rebalancing work is the following: Given a function f , what is the smallest possible ε such that the height bound Ilog(n + 1) + εl is maintainable with O(f (n)) rebalancing work per update?

In the case of amortized complexity, the answer is known. In [30], a lower bound is given, stating that no algorithm using o(f (n)) amortized rebuilding work per update can guarantee a height of Ilog(n + 1) + 1/f (n)l for all n. The lower bound is proved by mapping trees to arrays and exploiting a fundamental lemma on density from [28]. In [31], a balancing scheme was given which maintains height Ilog(n + 1) + 1/f (n)l in amortized O(f (n)) time per update, thereby matching the lower bound. The basic idea of the balancing scheme is similar to k-trees, but a more intricate distribution of unary nodes is used. Combined, these results show that for amortized complexity, the answer to the question above is

ε(n) Θ(1/f (n)).

We may view this expression as describing the inherent amortized complexity of rebal- ancing a binary search tree, seen as a function of the height bound maintained. Using the observation above that for any i, Ilog(n + 1) + εl is equal to Ilog(n + 1)l for n from 2i 1 to (1 Θ(ε))2i+1, the result may alternatively be viewed as the cost of maintaining optimal height when n approaches the next power of two: for n = (1 ε)2i+1, the cost is Θ(1).

A graph depicting this cost appears in Figure 10.6.

This result holds for the fully dynamic case, where one may keep the size at (1 ε)2i+1 by alternating between insertions and deletions. In the semi-dynamic case where only insertions take place, the amortized cost is smaller—essentially, it is the integral of the function in Figure 10.6, which gives Θ(n log n) for n insertions, or Θ(log n) per insertion.

image

More concretely, we may divide the insertions causing n to grow from 2i to 2i+1 into i segments, where segment one is the first 2i1 insertions, segment two is the next 2i2 insertions, and so forth. In segment j, we employ the rebalancing scheme from [31] with f (n) = Θ(2j ), which will keep optimal height in that segment. The total cost of insertions is O(2i ) inside each of the i segments, for a combined cost of O(i2i), which is O(log n) amortized per insertion. By the same reasoning, the lower bound from [30] implies that this is best possible for maintaining optimal height in the semi-dynamic case.

Considering worst case complexity for the fully dynamic case, the amortized lower bound stated above of course still applies. The best existing upper bound is height Ilog(n + 1) + min{1/jf (n), log(n)/f (n)}l, maintained in O(f (n)) worst case time, by a combination of results in [4] and [30]. For the semi-dynamic case, a worst case cost of Θ(n) can be enforced when n reaches a power of two, as can be seen by the argument above on odd and even ranks of nodes in a completely full tree.

Comments

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout