Geographic Information Systems:Spatial Join

Spatial Join

In order to compute map overlays, a GIS provides an operator called spatial join that allows flexible combinations of multiple inputs according to a spatial predicate. The spatial join computes a subset of the Cartesian product of the inputs and therefore, it is closely related to the join operator of a database management system (DBMS). A (binary) join on two sets of spatial objects, say R and S, according to a binary spatial predicate P is given by

image

The join is called a spatial join if the binary predicate P refers to spatial attributes of the objects. Among the most important ones are the following:

• Intersection predicate: P(r, s) = (r s j= ).

• Distance predicate: Let DF be a distance function and E > 0. Then, PDF (r, s) = (DF (r, s) < E).

In the following we assume that R and S consist of N two-dimensional spatial objects and that the join returns a total of T pairs of objects. N is assumed being sufficiently large such that the problem cannot be solved entirely in main memory of size M . Therefore, we are particularly interested in external memory algorithms. Due to today’s large main memory, the reader might question this assumption. However, remember that a GIS is a resource- intensive multi-user system where multiple complex queries are running concurrently. Thus, the memory assigned to a specific join algorithm might be substantially lower than the total physically available. Given a block size of B, we use the following notations: n = N/B, m = M/B and t = T /B. Without loss of generality, we assume that B is a divisor of N , M and T .

A naive approach to processing a spatial join is simply to perform a so-called nested-loop algorithm, which checks the spatial predicate for each pair of objects. The nested-loop algorithm is generally applicable, but requires O(N 2) time and O(n2) I/O operations. For special predicates, however, we are able to design new algorithms that provide substantial improvements in runtime compared to the naive approach. Among those is the intersection predicate, which is also the most frequently used predicate in GIS. We therefore will restrict our discussion to the intersection predicate to which the term spatial join will refer to by default. We postpone the discussion of other predicates to the end of the section.

The processing of spatial joins follows the general paradigm of multi-step query processing [55] that consists at least of the following two processing steps. In the filter step, a spatial join is performed using conservative approximations of the spatial objects like their minimum bounding rectangles (MBRs). In the refinement step, the candidates of the filter step are checked against the join predicate using their exact geometry. In this section, we limit our discussion on the filter step that utilizes the MBR of the spatial objects. The reader is referred to [6, 31] for a detailed discussion of the refinement step and intermediate processing steps that are additionally introduced.

If main memory is large enough to keep R and S entirely in memory, the filter step is (almost) equivalent to the rectangle intersection problem, one of the elementary problems in computational geometry. The problem can be solved in O(N log N + T ) runtime where T denotes the size of the output. This can be accomplished by using either a sweep-line algorithm [72] or a divide-and-conquer algorithm [25]. However, the disadvantage of these algorithms is their high overhead that results in a high constant factor. For real-life spatial data, it is generally advisable, see [1] for example, to employ an algorithm that is not optimal in the worst-case.

External Algorithms

In this subsection, we assume that the spatial relations are larger than main memory. According to the availability of indices on both relations, on one of the relations, or on none of the relations, we obtain three different classes of spatial join algorithms for the filter step.

Index on both spatial relations

In [5, 7], spatial join algorithms were presented where each of the relations is indexed by an R-tree. Starting at the roots, this algorithm synchronously traverses the trees node by node and joins all pairs of overlapping MBRs. If the nodes are leaves, the associated spatial objects are further examined in the refinement step. Otherwise, the algorithm is called recursively for each qualifying pair of entries. As pointed out in [20], this algorithm is not limited to R-trees, but can be applied to a broad class of index structures. Important to both, I/O and CPU cost, is the traversal strategy. In [7], a depth-first strategy was proposed where the qualifying pairs of nodes are visited according to a spatial ordering. If large buffers are available, a breadth-first traversal strategy was shown to be superior [30] in an experimental comparison. Though experimental studies have shown that these algorithms provide fast runtime in practice, the worst-case complexity of the algorithm is O(n2). The general problem of computing an optimal schedule for reading pages from disk is shown [48] to be NP-hard for spatial joins. For specific situations where two arbitrary rectangles from R and S do not share boundaries, the optimal solution can be computed in linear time. This is generally satisfied for bounding boxes of the leaves of R-trees.

Index on one spatial relation

Next, we assume that only one of the relations, say R, is equipped with a spatial index. We distinguish between the following three approaches. The first approach is to issue a range query against the index on R for each MBR of S. By using worst-case optimal spatial index structures, this already results in algorithms with subquadratic runtime. When a page buffer is available, it is also beneficial to sort the MBRs of S according to a criterion that preserves locality, e.g. Hilbert-value of the center of the MBRs. Then, two consecutive queries will access the same pages with high probability and therefore, the overall number of disk accesses decreases. The second approach as proposed in [40] first creates a spatial index on S, the relation without spatial index. The basic idea for the creation of the index is to use the upper levels of the available index on R as a skeleton. Thereafter, one of the algorithms is applied that requires an index on both of the relations. In [43] an improvement of the algorithm is presented that makes better use of the available main memory. A third approach is presented in [2, 22] where the spatial index is used for sorting the data according to one of the minimum boundaries of the rectangles. The sorted sequence then serves as input to an external plane-sweep algorithm.

Index on none of the inputs

There are two early proposals on spatial join processing that require no index. The first one [23] suggests using an external version of a computational geometry algorithm. This is basically achieved by employing an external segment tree. The other [54] can be viewed as a sweep-line algorithm where the ordering is derived from z-order. Though the algorithm was originally not designed for rectangles, this can be accomplished in a straightforward manner [15, 37]. Like any other sweep-line algorithm, R and S are first sorted, where the z-order serves as criterion in the following way. A rectangle receives the z-value of the smallest cell that still covers the entire rectangle. Let x be a z-value and b(x) = (x0, x1,..., xk ) its binary representation where k = l(x) denotes the level of the corresponding cell. Then, x <z y, if b(x) <lexi b(y) where <lexi denotes the lexicographical order on strings. Note that if x is a prefix of y, x will precede y in lexicographical order. After sorting the input, the processing continues while maintaining a stack for each input. The stacks satisfy the following invariant: The z-value of each element (except the last) is a prefix of its successor. The algorithm simply takes the next element from the sorted inputs, say x from input R. Before x is pushed to the stack, all elements from both stacks are removed that are not prefix of b(x). Thereafter, the entire stack of S is checked for overlap. The worst-case of the join occurs when each of the rectangles belongs to the cell that represents the entire data space. Then, the join runs like a nested loop and requires O(n2) I/O operations. This can practically be improved by introducing redundancy [15, 57]. However, the worst-case bound remains the same.

Other methods like the Partition Based Spatial-Merge Join (PBSM) [61] and the Spatial Hash-Join [41] are based on the principles of divide-and-conquer where spatial relations are partitioned into buckets that fit into main memory and the join is computed for each pair of corresponding buckets. Let us discuss PBSM in more detail. PBSM performs in four phases. In the first phase, the number of partitions p is computed such that the join for each pair of partitions is likely to be processed in main memory. Then, a grid is introduced with g cells, g p and each of the cells is then associated with a partition. A rectangle is then assigned to a partition if it intersects with at least one of its cells. In the second phase, pairs of partitions have to be processed that still contain too much data. This usually requires repartitioning of the data into smaller partitions. In the third phase, the join is processed in main memory for every pair of related partitions. The fourth phase consists of sorting, in order to get rid of duplicates in the response set. This however can be replaced by applying an inexpensive check of the result whenever it is produced [14]. Overall, the worst-case is O(n2) I/O operations for PBSM.

Spatial Hash-Joins [41] differ from PBSM in the following way. One of the relations is first partitioned using a spatial index structure like an R-tree which uniquely assigns rectangles to leaves, where each of them corresponds to a partition. The other relation is then partitioned using the index of the first relation. Each rectangle has to be stored in all partitions where an overlap with the MBR of the partition exists. Overall, this guarantees the avoidance of duplicate results.

At the end of the nineties, [1] proposed the first spatial join algorithm that meets the lower I/O bound O(n logm n + t). The method is an external sweep-line algorithm that is also related to the divide-and-conquer algorithm of [23]. Rather than partitioning the problem recursively into two, it is proposed to partition the input recursively into k = m strips of almost equal size until the problem is small enough for being solved in memory. This results in a recursion tree of height O(logm n). At each level O(m) sorted lists of size Θ(B) are maintained where simultaneously an interval join is performed. Since each of the interval joins runs in O(n! + t!) (n! and t! denotes the input size and result size of the join, respectively) and at most O(N ) intervals are maintained on each level of the recursion, it follows that at most O(n) accesses are required for each level.

Instead of using the optimal algorithm, [1] employs a plane-sweep algorithm in their experiments where the sweep-line is organized by a dynamic interval tree. This is justified by the observation that the sweep-line is small as it holds only a small fraction of the entire spatial relations. If the algorithm really runs out of memory, it is recommended invoking the optimal algorithm for the entire problem. A different overflow strategy has been presented in [34] where multiple iterations over the spatial relations might be necessary. The advantage of this strategy is that each answer is computed exactly once.

Advanced Issues

There are various extensions of the processing of spatial joins which will be discussed in the following. We first discuss the processing of spatial joins on multiple inputs. Next we discuss the processing of distance joins. Eventually, we conclude the section with a brief summary of requirements on the implementation of algorithms within a system.

The problem of joining more than two spatial relations according to a set of binary spatial predicates has been addressed in [42, 59]. Such a join can be represented by a connected graph where the nodes correspond to the spatial relations and edges to binary join predicates. A first approach, called pairwise join method (PJM), is to decompose the multi-way join into an operator tree of binary joins. The graphs with cycles therefore need to be transformed into a tree by ignoring the edges with lowest selectivity. Depending on the availability of spatial indexes, it is then advantageous to use from each class of spatial join algorithms the most efficient one. A different approach generalizes synchronous traversal to multiple inputs. The most important problem is not related to I/O, but to the processing cost of checking the join predicates. In [42] different strategies are examined for an efficient processing of the join predicates. Results of an experimental study revealed that neither PJM nor synchronous traversal performs best in all situations. Therefore, an algorithm is presented for computing a hybrid of these approaches by using dynamic programming.

In addition to the intersection predicate, there are many other spatial predicates that are of practical relevance for spatial joins in a GIS, see [60] for a survey.

Among those, the distance predicate has received most attention. The distance join of R and S computes all pairs within a given distance. This problem has been even more extended in the following directions. First, pairs should be reported in an increasing order of the distances of the spatial objects. Second, only a fixed number of pairs should be reported. Third, answers should be produced on demand one at a time (without any limitations on the total number of answers). This problem has been addressed in [29, 71] where R-trees are assumed to be available on the spatial relations. The synchronized traversal is then controlled by two priority queues, where one maintains pairs of nodes according to their minimum distance and the other is primarily used for pruning irrelevant pairs of entries. In [71], it was recognized that there are many overlapping nodes which are not distinguishable in the priority queues.

In order to break ties, a secondary ordering has been introduced that assigns a high priority to such pairs that are likely to contribute to the final result.

The design of algorithms for processing spatial joins largely depends on the specific system requirements. Similar to DBMSs, a complex query in a GIS is translated into an operator tree where nodes may correspond to spatial joins or other operators. There are two important problems that arise in this setting. First, an operator for processing spatial joins should have the ability to produce an estimation of the processing cost before the actual processing starts. Therefore, cost formulas are required that are inexpensive to compute, depend on only a few parameters of the input and produce sufficiently accurate estimations. This problem has recently attracted research attention [73]. Second, a demand-driven implementation of operators is generally required [19] where answers are lazily produced. This allows a pipelined processing of chains of operators where answers can continuously be delivered to the user without materializing them in advance. Therefore, join algorithms should be non-blocking, i.e., first answers should be produced without having consumed the entire input.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Collision Detection:Penetration Depth Computation

Concurrent Data Structures:Linked Lists