B Trees:The B-tree

The B-tree

The problem which the B-tree aims to solve is: given a large collection of objects, each having a key and an value, design a disk-based index structure which efficiently supports query and update.

Here the query that is of interest is the exact-match query: given a key k, locate the value of the object with key=k. The update can be either an insertion or a deletion. That is, insert a new object into the index, or delete from the index an object with a given key.

B-tree Definition

A B-tree is a tree structure where every node corresponds to a disk page and which satisfies the following properties:

• A node (leaf or index) x has a value x.num as the number of objects stored in x. It also stores the list of x.num objects in increasing key order. The key and value of the ith object (1 i x.num) are represented as x.key[i] and x.value[i], respectively.

• Every leaf node has the same depth.

• An index node x stores, besides x.num objects, x.num+1 child pointers. Here each child pointer is a page ID of the corresponding child node. The ith child pointer is denoted as x.child[i]. It corresponds to a key range (x.key[i 1], x.key[i]). This means that in the ith sub-tree, any object key must be larger than x.key[i 1] and smaller than x.key[i]. For instance, in the sub-tree referenced by x.child[1], the object keys are smaller than x.key[1]. In the sub-tree referenced by x.child[2], the objects keys are between x.key[1] and x.key[2], and so on.

• Every node except the root node has to be at least half full. That is, suppose an index node can hold up to 2B child pointers (besides, of course, 2B-1 objects), then any index node except the root must have at least B child pointers. A leaf node can hold more objects, since no child pointer needs to be stored. However, for simplicity we assume a leaf node holds between B and 2B objects.

• If the root node is an index node, it must have at least two children.

A special case of the B-tree is when B = 2. Here every index node must have 2 or 3 or 4 child pointers. This special case is called the 2-3-4 tree.

image

In the figure, every index node contains between 2 and 4 child pointers, and every leaf node contains between 2 and 4 objects. The root node A is an index node. Currently it has one object with key=25 and two child pointers. In the left sub-tree, every object has key<25. In the right sub-tree, every object has key>25. Every leaf node (D through I) are located at the same depth: their distance to A is 2. Currently, there are two pages which are full: an index node B and a leaf node D.

B-tree Query

To find the value of an object with key=k, we call the Query algorithm given below. The parameters are the tree root pageID and the search key k. The algorithm works as follows. It follows (at most) a single path from root to leaf. At each index node along the path, there can be at most one sub-tree whose key range contains k. A recursive call on that sub-tree is performed (step 2c). Eventually, we reach a leaf node (step 3a). If there exists an object in the node with key=k, the algorithm returns the value of the object. Otherwise, the object does not exist in the tree and NU LL is returned. Since the index nodes of the B-tree also stores objects, it is possible that the object with key=k is found in an index node. In this case, the algorithm returns the object value without going down to the next level (step 2a).

Algorithm Query(pageID, k)

Input: pageID of a B-tree node, a key k to be searched.

Output: value of the object with key= k; NU LL if non-exist.

image

As an example, Figure 15.2 shows how to perform a search query for k = 13.

At node A, we should follow the left sub-tree since k < 25. At node B, we should follow the third sub-tree since 10 < k < 16. Now we reach a leaf node F . An object with key=13 is found in the node.

image

If the query wants to search for k = 12, we still examine the three nodes A, B, F . This time, no object with key=12 is found in F , and thus the algorithm returns NU LL. If the search key is 10 instead, the algorithm only examines node A and B. Since in node B such an object is found, the algorithm stops there.

Notice that in the Query algorithm, only Disk Read function is called. The other three functions, e.g. Disk Write are not needed as the algorithm does not modify the B-tree.

Since the query algorithm examines a single path from root to leaf, the complexity of the algorithm in number of I/Os is O(logB n), where n is the number of objects.

B-tree Insertion

To insert a new object with key k and value v into the index, we call the Insert algorithm given below.

Algorithm Insert(root, k, v)

Input: root pageID of a B-tree, the key k and the value v of a new object.

Prerequisite: The object does not exist in the tree.

Action: Insert the new object into the B-tree.

image

Basically, the algorithm makes sure that root page is not currently full, and then it calls the Insert Not Full function to insert the object into the tree. If the root page x is full, the algorithm will split it into two nodes y and z, and node x will be promoted to a higher level, thus increasing the height of the tree.

This scenario is illustrated in Figure 15.3. Node x is a full root page.

It contains three objects and four child pointers. If we try to insert some record into the tree, the root node is split into two nodes y and z. Originally, x contains x.num = 3 objects. The left object (key=6) is moved to a new node y. The right object (key=16) is moved to a new node z.

image

The middle object (key=10) remains in x. Correspondingly, the child pointers D, E, F , G are also moved. Now, x contains only one object (key=10). We make it as the new root, and make y and z be the two children of it.

To insert an object into a sub-tree rooted by a non-full node x, the following algorithm Insert Not Full is used.

Algorithm InsertNotFull(x, k, v)

Input: an in-memory page x of a B-tree, the key k and the value v of a new object.

Prerequisite: page x is not full.

Action: Insert the new object into the sub-tree rooted by x.

image

Algorithm Insert Not Full examines a single path from root to leaf, and eventually insert the object into some leaf page. At each level, the algorithm follows the child pointer whose key range contains the key of the new object (step 2a). If no node along the path is full, the algorithm recursively calls itself on each of these nodes (step 2d) till the leaf level, where the object is inserted into the leaf node (step 1).

Consider the other case when some node w along the path is full (step 2c). The node is first split into two (w and y). The right half of the objects from w are moved to y, while the middle object is pushed into the parent node. After the split, the key range of either w or y, but not both, contains the key of the new object. A recursive call is performed on the correct node.

As an example, consider inserting an object with key=14 into the B-tree of Figure 15.2. The result is shown in Figure 15.4. The child pointers that are followed are thick. When we examine the root node A, we follow the child pointer to B. Since B is full, we first split it into two, by moving the right half of the objects (only one object in our case, with key=16) into a new node Bll. The child pointers to F and G are moved as well. Further, the previous middle object in B (key=10) is moved to the parent node A. A new child pointer to Bll is also generated in A. Now, since the key of the new object is 14, which is bigger than 10, we recursively call the algorithm on Bll. At this node, since 14 < 16, we recursively call the algorithm on node F . Since F is a leaf node, the algorithm finishes by inserting the new object into F . The accessed disk pages are shown as shadowed.

image

B-tree Deletion

This section describes the Delete algorithm which is used to delete an object with key=k from the B-tree. It is a recursive algorithm. It takes (besides k) as parameter a tree node, and it will perform deletion in the sub-tree rooted by that node.

We know that there is a single path from the root node to the node x that contains k.

The Delete algorithm examines this path. Along the path, at each level when we examine node x, we first make sure that x has at least one more element than half full (except the case when x is the root). The reasoning behind this is that in order to delete an element from the sub-tree rooted by x, the number of element stored in x can be reduced at most by one. If x has one more element than half full (minimum occupancy), it can be guaranteed that x will not underflow. We distinguish three cases:

1. x is a leaf node;

2. x is an index node which contains an object with key=k;

3. x is an index node which does not contain an object with key=k.

We first describe the Delete algorithm and then discuss the three cases in more detail.

Algorithm Delete(x, k)

Input: an in-memory node x of a B-tree, the key k to be deleted.

Prerequisite: an object with key=k exists in the sub-tree rooted by x.

Action: Delete the object from the sub-tree rooted by x.

image

3. else

(a) If the child y that precedes k in x has at least one more object than minimally required, find the predecessor kl of k in the sub-tree rooted by y, recursively delete kl from the sub-tree and replace k with kl in x.

(b) Otherwise, y is exactly half full. We check the child z that immediately follows k in x. If z has at least one more object than minimally required, find the successor kl of k in the sub-tree rooted by z, recursively delete kl from the sub-tree and replace k with kl in x.

(c) Otherwise, both y and z are half full. Merge them into one node and push k down to the new node as well. Recursively delete k from this new node.

4. end if

Along the search path from the root to the node containing the object to be deleted, for each node x we encounter, there are three cases. The simplest scenario is when x is a leaf node (step 1 of the algorithm). In this case, the object is deleted from the node and the algorithm returns. Note that there is no need to handle underflow. The reason is: if the leaf node is root, there is only one node in the tree and it is fine if it has only a few objects; otherwise, the previous recursive step has already guaranteed that x has at least one more object than minimally required.

Steps 2 and 3 of the algorithm correspond to two different cases of dealing with an index node.

For step 2, the index node x does not contain the object with key=k. Thus there exists a child node y whose key range contains k. After we read the child node into memory (step 2b), we will recursively call the Delete algorithm on the sub-tree rooted by y (step 2d).

However, before we do that, step 2(c) of the algorithm makes sure that y contains at least one more object than half full.

Suppose we want to delete 5 from the B-tree shown in Figure 15.2. When we are examining the root node A, we see that child node B should be followed next. Since B has two more objects than half full, the recursion goes to node B. In turn, since D has two more objects than minimum occupancy, the recursion goes to node D, where the object can be removed. Let’s examine another example. Still from the B+-tree shown in Figure 15.2, suppose we want to delete 33. The algorithm finds that the child node y = C is half full. One more object needs to be incorporated into node C before a recursive call on C is performed. There are two sub-cases. The first sub-case is when one immediate sibling z of node y has at least one more object than minimally required. This case corresponds to step 2(c)i of the algorithm. To handle this case, we drag one object down from x to y, and we push one object from the sibling node up to x. As an example, the deletion of object 33 is shown in Figure 15.5.

image

FIGURE 15.5: Illustration of step 2(c)i of the Delete algorithm. Deleting an object with key=33 from the B-tree of Figure 15.2. At node A, we examine the right child. Since node C only had one object before, a new object was added to it in the following way: the object with key=25 is moved from A to C, and the object with key=16 is moved from B to A. Also, the child pointer pointing to G is moved from B to C.

Another sub-case is when all immediate siblings of y are exactly half full. In this case, we merge y with one sibling. In our 2-3-4-tree example, an index node which is half full contains one object. If we merge two such nodes together, we also drag an object from the parent node of them down to the merged node. The node will then contain three objects, which is full but does not overflow.

For instance, suppose we want to delete object 31 from Figure 15.5. When we are examining node x = C, we see that we need to recursively delete in the child node y = H. Now, both immediate siblings of H are exactly half full. So we need to merge H with a sibling, say G. Besides moving the remaining object 28 from H to G, we also should drag object 25 from the parent node C to G. The figure is omitted for this case.

The third case is that node x is an index node which contains the object to be deleted. Step 3 of algorithm Delete corresponds to this scenario. We cannot simply delete the object from x, because we also need to decrement the number of child pointers by one. In Figure 15.5, suppose we want to delete object with key=25, which is stored in index node C. We cannot simply remove the object, since C would have one object but three child pointers left. Now, if child node G immediately to the left of key 25 had three or more objects, the algorithm would execute step 3(a) and move the last object from G into C to fill in the space of the deleted object. Step 3(b) is a symmetric step which shows that we can move an object from the right sub-tree.

image

FIGURE 15.6: Illustration of step 3(c) of the Delete algorithm. Deleting an object with key=25 from the B-tree of Figure 15.5. At node A, we examine the right child. We see that node C contains the object with key=25. We cannot move an object up from a child node of C, since both children G and H (around key 25) are exactly half full. The algorithm merges these two nodes into one, by moving objects 28 and 31 from H to G and then deleting H. Node C loses an object (key=25) and a child pointer (to H).

However, in our case, both child nodes G and H are half full and thus cannot contribute an object. Step 3(c) of the algorithm corresponds to this case. As shown in Figure 15.6, the two nodes are merged into one.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout