R-trees:Introduction and Basic Concepts.

Introduction

Spatial database management systems must be able to store and process large amounts of disk-resident spatial data. Multidimensional data support is needed in many fields including geographic information systems (GIS), computer aided design (CAD), and medical, multi- media, and scientific databases. Spatial data operations that need to be supported include spatial joins and various types of queries such as intersection, containment, topological and proximity queries. The challenge, and primary performance objective, for applications dealing with disk-resident data is to minimize the number of disk retrievals, or I/Os, needed to answer a query. Main memory data structures are designed to reduce computation time rather than I/O, and hence are not directly applicable to a disk based environment.

Just as the B-tree [13] (Chapter 15) and its variants were proposed to optimize I/O while processing single dimensional disk resident data, the original R-tree [23] and later variants have been proposed to index disk resident multidimensional data efficiently.

R-trees are very versatile, in the sense that they can be used with little or no modification to tackle many different problems related to spatial data. Asymptotically better data structures exist for specific instances of these problems, but the solution of choice is different for different types of inputs or queries. Thus, one of the main advantages of R-trees is that the same data structure can be used to handle different problems on arbitrary types of multidimensional data.

In this chapter we describe the original R-tree proposal and some of its variants. We analyze the efficiency of various operations and examine various performance models. We also describe R-tree based algorithms for additional operations, such as proximity queries and spatial joins. There are many more applications, variants and issues related to R-trees than we can possibly cover in one chapter. Some of the ones we do not cover include parallel and distributed R-trees [9, 27, 46, 55], variants for high dimensional [5, 30, 67] and for spatiotemporal data [31, 48, 49, 53, 54, 62, 63], and concurrency [33, 41]. See Chapter 47 for more on concurrent data structures. Other data structures with similar functionality, such as range, quad and k-d trees, are covered in other chapters of Part IV of this handbook. Some of these (see Chapter 27) are specifically designed for disk-resident data.

Basic Concepts

R-trees were first introduced in [23]. An R-tree is a hierarchical data structure derived from the B+-tree and originally designed to perform intersection queries efficiently. The tree stores a collection of d-dimensional points or rectangles which can change in time via insertions and deletions. Other object types, such as polygons or polyhedra, can be handled by storing their minimum bounding rectangles (MBRs) and performing additional tests to eliminate false hits. A false hit happens when the query object intersects the MBR of a data object but does not intersect the object itself. In the sequel we talk about rectangles only, with the understanding that a point is simply a degenerate rectangle. We use the terms MBR and bounding box, interchangeably. In our context, the d-dimensional rectangles are “upright”, i.e., each rectangle is the Cartesian product of d one-dimensional intervals:

[l1, h1] × ... × [ld, hd]. Thus, 2d values are used to specify a rectangle.

Each node of the tree stores a maximum of B entries. With the exception of the root, each node also stores a minimum of b B/2 entries. This constraint guarantees a space utilization of at least b/B. Each entry E consists of a rectangle r and a pointer pr . As with B+-trees all input values are stored at the leaves. Thus, at the leaf level, r is the bounding box of an actual object pointed to by pr . At internal nodes, r is the bounding box of all rectangles stored in the subtree pointed to by pr .

A downward path in the tree corresponds to a sequence of nested rectangles. All leaf nodes occur at the same level (i.e., have the same depth), even after arbitrary sequences of updates. This guarantees that the height of the tree is O(logb n), where n is the number of input rectangles. Notice that MBRs at the same level may overlap, even if the input rectangles are disjoint.

Figure 21.1 illustrates an R-tree with 3 levels (the root is at level 0) and a maximum of B = 4 rectangles per node. The 64 small dark data rectangles are grouped into 16 leaf level nodes, numbered 1 to 16. The bounding box of the set of rectangles stored at the same node is one of the rectangles stored at the parent of the node. In our example, the MBRs of leaf level nodes 1 through 4 are placed in node 17, in level 1. The root node contains the MBRs of the four level 1 nodes: 17, 18, 19, and 20.

Intersection queries

To perform an intersection query Q, all rectangles that intersect the query region must be retrieved and examined (regardless of whether they are stored in an internal node or a leaf node). This retrieval is accomplished by using a simple recursive procedure that starts at the root node and which may follow multiple paths down the tree. In the worst case, all nodes may need to be retrieved, even though some of the input data need not be reported. A node is processed by first identifying all the rectangles stored at that node which intersect Q. If the node is an internal node, the subtrees corresponding to the identified rectangles are searched recursively. Otherwise, the node is a leaf node and the retrieved rectangles (or the data objects themselves) are simply reported.

For illustration, consider the query Q in the example of Figure 21.1. After examining the root node, we determine that nodes 19 and 20 of level 1 must be searched. The search then proceeds with each of these nodes. Since the query region does not intersect any of the MBRs stored in node 19, this sub-query is terminated. While processing the other sub- query, it is determined that Q intersects the MBR corresponding to node 13 and this node

image

is retrieved. Upon checking the rectangles in node 13, the one data rectangle intersected by Q is returned.

Other type of queries, such as arbitrarily shaped queries (e.g., point or polygonal queries) or retrieving all rectangles contained or containing Q, can be handled using a straightforward modification of the above procedure.

Updating the tree

Many applications require support for update operations, such as insertion and deletion of rectangles. A tree that can change over time via such operations is said to be dynamic.

New rectangles can be inserted using a procedure similar to that used to insert a new key in a B+-tree. In other words, the new rectangle r is first added to a leaf node v and, if the node overflows, a split is performed that requires updating one rectangle and inserting another one in the parent of v. This procedure continues until either a node with fewer than B entries is found or the root is split, in which case a new node is created and the height of the tree grows by one. Independent of whether a split is performed or not, the bounding rectangles of all ancestors of r may need to be updated.

One important difference between R-trees and B+-trees is that, in our case, there is no incorrect leaf node to which the new rectangle can be added. Choosing a leaf node impacts performance, not correctness, and performance depends on our ability to cluster rectangles so as to minimize the expected number of rectangles intersected by a query of a given size.

In practice this means that clustering should attempt to minimize the areas and perimeters of the resulting MBRs. Models for predicting the expected number of nodes that need to be visited while performing an intersection query are discussed in Section 21.5.

The problem of partitioning a set of rectangles into buckets with capacity B > 2 such that the expected number of rectangles intersected by a random query is minimized is NP-hard [43]. Hence, it is unlikely we will ever know how to build optimal R-trees efficiently. As a result, many heuristic algorithms for building the tree have been proposed. It is worth noting that, in practice, some of the proposed heuristics result in well tuned R-trees and near optimal I/O for 2-dimensional data. We start with Guttman’s original heuristics, which are of two types: leaf selection and node splitting.

To identify a leaf for insertion, Guttman proposes proceeding down the tree, always choosing the rectangle in the current node of the path whose area would increase by the smallest amount were we to insert the new rectangle in the subtree that corresponds to that rectangle. The reasoning behind this approach is that rectangles with small areas are less likely to be chosen for further exploration during a search procedure.

When a node overflows, a split is performed. Ideally, when this happens, one would like to partition a set S of B + 1 rectangles into two sets S1 and S2 such that the sum of their areas is minimized and each set contains at least b entries. Guttman proposes three different strategies, only the first of which is guaranteed to yield an optimal partition. The first strategy is a brute force algorithm that chooses the best split by checking all candidate partitions of the overflowed set S. This strategy is not practical, as the number of candidates is exponential in the node capacity, which can easily exceed 50 or so rectangles.

The second strategy, quadratic split, starts by selecting the two rectangles r1, r2 S which maximize the quantity area(r!) area(r1) area(r2), where r! is the MBR of r1 r2. These two rectangles act as seeds which are assigned to different sides of the partition, i.e., one is assigned to S1 and the other to S2. The remaining entries of S are then assigned to the set (S1 or S2) whose MBR area increases the least when including the new rectangle in that set. The entries are not considered in arbitrary order. Rather, the next entry to be allocated is the one with the strongest preference for a group, i.e., the entry r that maximizes |A1 A2|, where Ai = area(MBR(Si {r})) area(MBR(Si)). This heuristic runs in quadratic time and attempts to assign a priority to the unallocated entries according to the performance penalty that the wrong assignment could cause. If at any time during the allocation procedure the size of the smaller set plus the number of unallocated entries is equal to b, then all remaining entries are allocated to that set.

The third and final strategy, linear split, also assigns entries to the group whose MBR area increases the least, but differs from quadratic split in the method used to pick the seeds and in the order in which the remaining entries are allocated. The seeds are the two rectangles r1 and r2 whose separation is largest along (at least) one of the dimensions. We elaborate on this. Let lj (r) and hj (r) denote the low and high endpoints, respectively, of the j-th interval of r. The width of S along dimension j is simply wj = maxr {hj (r)}minr {lj (r)}.

The normalized separation of S along dimension j is sj = (maxr {lj (r)}minr {hj (r)})/wj .

The seeds r1 and r2 are the two rectangles that yield the largest normalized separation considering all dimensions j. Once the seeds are chosen, the remaining entries are allocated to one set or the other in random order. This last heuristic runs in linear time.

A different linear split algorithm is described in [1]. Other efforts [2, 19] include polynomial time algorithms to partition a set of rectangles so as to minimize the sum of areas of the two resulting bounding rectangles.

In order to delete a rectangle r we first find the node v containing r and remove the entry, adjusting the bounding rectangles in all ancestors, as necessary. If the node occupancy goes below b the tree needs to be readjusted so as to keep the height in O(logb n). There are different ways in which one can readjust the tree. One possibility is to redistribute the remaining entries of v among the siblings of v, in a manner similar to how underflowed nodes are treated in some B-tree algorithms. Instead, Guttman suggests reinserting the remaining entries of v, as this helps the global structure of the tree by considering non- sibling nodes during reinsertion. Of course, this procedure needs to be applied recursively, as internal nodes may underflow as well. Finally, if after deletion the root has exactly one child, the tree height shrinks by one and the only child becomes the new root.

Comments

Popular posts from this blog

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Data Structure Visualization:Introduction and Value of Data Structure Rendering

0/1 Knapsack Problem Memory function.