R-trees:Advanced Operations

Advanced Operations

Even though R-trees were originally designed to perform intersections queries, it did not take long before R-trees were being used for other types of operations, such as nearest neighbor, both simple [11, 25, 50] and constrained to a range [18], reverse nearest neighbor [32, 61, 68], regular and bichromatic closest pairs [14, 24, 59], incremental nearest neighbors [25], topological queries [44], spatial joins [7, 8, 21, 22, 37–40, 47] and distance joins[24, 60]. Some of the proposals work on an ordinary R-tree while others require that the tree be modified or augmented in various ways. Because of space limitations we concentrate our attention on two problems: nearest neighbors and spatial joins.

Nearest Neighbor Queries

We discuss the problem of finding the input object closest to an arbitrary query point Q. We assume the standard Euclidean metric, i.e., if a = (a1,... , ad) and b = (b1,..., bd) are arbitrary points then dist(a, b) = (),d (ai bi)2)1/2. For a general object O, such as a rectangle or polygon, we define dist(Q, O) = minpO dist(Q, p).

image

The first proposal to solve this problem using R-trees is from [50]. Roussopoulos et al define two bounding functions of Q and an arbitrary rectangle r:

mindist(Q, r) = dist(Q, r), the distance from Q to the closest point in r. minmaxdist(Q, r) = minf maxpf (dist(Q, p), where f ranges over all (d 1)- dimensional facets of r.

Notice that mindist(Q, r) = 0 if Q is inside r and that for any object or rectangle s that is a descendant of r, min dist(Q, r) dist(Q, s) minmaxdist(Q, r). This last fact follows from the fact that each of the facets of r must share a point with at least one input object, but this object can be as far as possible within an incident face. Thus, the bounding functions serve as optimistic and pessimistic estimates of the distance from Q to the nearest object inside r.

The following properties of the bounding functions readily follow:

P1 For any object or MBR r and MBR s, if mindist(Q, r) > minmaxdist(Q, s) then r cannot be or contain the nearest neighbor of Q.

P2 For any MBR r and object s, if mindist(Q, r) > dist(Q, s) then r cannot contain the nearest neighbor of Q.

The authors describe a branch-and-bound algorithm that performs a depth-first traversal of the tree and keeps track of the best distance so far. The two properties above are used to identify and prune branches that are guaranteed not to contain the answer. For each call the algorithm keeps a list of active nodes, i.e., nodes that tentatively need to be explored in search of a better estimate. No node of the list is explored until the subtrees corresponding to nodes appearing earlier in the active list have been processed or pruned. Thus, a sorting policy to determine the order in which rectangles stored in a node are processed is also required. In practice one would like to examine first those nodes that lower the best distance estimate as quickly as possible, but this order is difficult to determine a priori. Two criteria considered include sorting by min dist and by min max dist. Experimental results suggest that sorting by min dist results in slightly better performance. The algorithm is summarized in Figure 21.7. Global variable best Distance stores the estimate of the distance to the nearest neighbor. The initial call uses the query and root of the tree as arguments.

image

A simple modification to the above algorithm allows [50] to report the k > 1 nearest neighbors of Q. All is needed is to keep track of the best k distances encountered so far and to perform the pruning with respect to the k-th best.

Cheung and Fu [11] show that pruning based on P1 is not necessary and do away with computing minmaxdist altogether. They simplify the algorithm by pruning with P2 exclusively, and by rearranging the code so that the pruning step occurs before, and not after, each recursive call.

Hjaltason and Samet [24] also describe an algorithm that avoids using P1. Furthermore, unlike [50] which keeps a local list of active entries for each recursive call, their algorithm uses a global priority queue of active entries sorted by the optimistic distance from Q to that entry. This modification minimizes the number of R-tree nodes retrieved and results in an incremental algorithm, i.e., one that reports answers in increasing order of distance, a desirable characteristic when the value of k is not known a priori.

Spatial Joins

We consider the problem of calculating the intersection of two sets R = {r1, r2,... , rn} and S = {s1, s2,..., sm} of spatial objects. The spatial join, R i>< S, is the set of all pairs (ri, sj ) such that ri sj j= . The join of two spatial data sets is a useful and common operation. Consider, for example, the rivers and roads of a region, both represented using line segments. The spatial join of the river and road data sets yields the set of likely bridge locations. If the subset of river segments whose level is above a given threshold has been previously selected, the same spatial join now computes likely locations for road flooding.

Methods to compute spatial joins without R-trees exist but are not covered in this chapter. We consider the case where R-trees have been constructed for one or both data sets and hence can be used to facilitate the join computation. Unless otherwise stated, we will assume that the ri’s and sj ’s refer to the MBRs enclosing the actual data objects.

In [8], Brinkhoff et al proposed the “canonical” spatial join algorithm based on R-trees. A similar join algorithm was proposed at the same time by Gunther [21]. Gunther’s algorithm is applicable for general trees and includes R-trees as a special case. In Figure 21.8 we paraphrase the algorithm of [8]. Let v1 and v2 be nodes of R-trees T1 and T2, respectively. Let Eij , be an entry of vi of the form (rij , pij ), where rij and pij denote the MBR and child pointer, respectively. To join data sets R and S, indexed by R-trees T1 and T2, invoke the SpatialJoin algorithm in Figure 21.8, passing as arguments the roots of T1 and T2.

image

In [8] the authors generalize the algorithm for the case where the two R-trees have different heights. They also improve upon the basic algorithm by reducing CPU computation and disk accesses. The CPU time is improved by reducing the number of rectangle intersection checks. One method to accomplish this is to first calculate the subset of entries within nodes v1 and v2 that intersect MBR(v1) MBR(v2), and then do full intersection tests among this subset. A second approach is to sort the entries and then use a sweep-line algorithm.

In addition, the paper considers reduction of page I/O by ordering the accesses to pages so as to improve the buffer hit ratio. This is done by accessing the data in sweep-line order or Z-order.

In [7] Brinkhoff et al suggest the following 3-step approach for joining complex polygonal data sets: (1) use the R-trees and MBRs of the data sets to reduce the set to a list of potential hits; (2) use a lightweight approximation algorithm to further reduce the set, leaving in some “false positives”; (3) conduct the exact intersection tests on the remaining polygons using techniques from computational geometry (see Chapter 2 of [16], for example).

image

Often only one of the data sets to be joined will be indexed by an R-tree. This is the case when the DBA decides that maintaining an index is too expensive, when the data set being joined is the result of a series of non-spatial attribute selections, or for multi-step joins where intermediate results are not indexed.

To join two data sets when only one is indexed by an R-tree, Lo and Ravi shankar [37] propose the idea of seeded tree join. The main idea is to use the existing R-tree to “seed” the creation of a new index, called the seeded tree, for the non-indexed data set, and then perform a join using the method of [8].

Consider the example shown in Figure 21.9 (similar to the example in [37]). Assume that the existing tree has four entries in the root node. The four rectangles, R1 to R4, in the left hand side represent the bounding rectangles of these entries. The dark filled squares do not belong to the existing R-tree but, rather, correspond to some of the data items in the non-indexed data set. Assuming a node capacity of four (i.e., B = 4) and using the normal insertion or loading algorithms which minimize area and perimeter, the dark rectangles would likely be grouped as shown by the dark MBRs in Figure 21.9b. A spatial join would then require that each node of each tree be joined with two nodes from the other tree, for a total of eight pairs. On the other hand, if the second R-tree was structured as show in Figure 21.9c, then each node would be joined with only one node from the other tree, for a total of four pairs. Hence, when performing a spatial join, it might be better to structure the top levels of the tree in a fashion that is sub-optimal for general window queries.

The general algorithm for seeded tree creation is to copy the top few levels of the existing tree and use them as the top levels for the new tree. These top levels are called the seed levels. The entries at the lowest seed level are called “slots”. Non-indexed data items are then inserted into the seeded tree by inserting them into an R-tree that is built under the appropriate slot. In [37] the authors experimentally compare three join algorithms: (1) R-tree join, where an R-tree is fist built on the non-indexed data set and then joined; (2) brute force, where the non-indexed data set is read sequentially and a region query is run against the existing R-tree for each data item in the non-indexed data set; (3) seeded tree join, where a seeded tree is built and then joined. The authors consider several variants of the seeded tree creation algorithm and compare the performance. Experimental studies show that the seeded tree method significantly reduces I/O. Note that if the entire seeded tree does not fit in memory significant I/O can occur during the building process. The authors propose to minimize this I/O by buffering runs for the slots and then building the tree for each slot during a second pass.

Another approach [22] for the one-index case is to sort the non-indexed data using the MBR lower endpoints for one of the dimensions, sort the leaf level MBRs from the existing R-tree on the same dimension, and finally join the two sorted data sets using a sweep-line algorithm. The authors analytically demonstrate that as long as the buffer size is sufficient to merge-sort efficiently, their algorithm results in less I/O than creating any type of index followed by a join. Experimental results also show a significant I/O reduction.

In [47] a relational hash-based join approach is used. Although mostly hash-based, the method does need to default to R-trees in some cases. A sampled subset of R is partitioned into N buckets, R1 ... RN , for some N . Each non-sampled object from R is then added to the bucket that requires the least enlargement. Set S is then partitioned in N corresponding buckets, S1 ... SN by testing each object oS of S and placing a copy of it into Si, for each bucket Ri such that oS Ri j= . If an object in S does not intersect any of the Ri buckets the object is discarded. The bucket pairs (Ri, Si) are then read into memory and pairwise joined. If a bucket pair is too large to fit in memory, an R-tree index is built for one of the two buckets and an index-loop join is used.

In [38, 40] a method that combines the seeded tree and the hash-join approach is proposed. The proposed method, slot index spatial join, chooses a level from the existing tree and uses the MBRs of the entries as the buckets for the hashing. The chosen level is determined by the number of entries at that level and the number of buffer pages available for the bucket joining algorithm. Since R-trees have a wide fan out, the optimal number of buckets to use usually falls between level sizes. The paper considers several heuristics for combining MBRs from the next level down to tune the number of buckets. Overall performance is shown to be better than all previously existing methods in all but a few cases.

A method to join multiple spatial relations together is discussed in [38, 39]. The authors propose a multi-way join algorithm called synchronous traversal and develop cost models to use for query optimization. They consider using only pairwise joins, only the synchronous traversal method, and combinations of the two approaches. They show that, in general, a combination of the two results in the best performance.

The concept of spatial join has been further generalized to distance joins, where a distance based ordering of a subset of the Cartesian product of the two sets is returned. In [24, 60], distance join algorithms using R-trees are proposed.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Drawing Trees:HV-Layout