Geographic Information Systems:Space Filling Curves: Order in Many Dimensions
Space Filling Curves: Order in Many Dimensions
As explained above, our interest in space filling curves (SFCs) comes from two sources. The first one is the fact that we aim at exploiting the typical database support mechanisms that a conventional database management system (DBMS) offers, such as support for transactions and recovery. On the data structures level, this support is automatic if we resort to a data structure that is inherently supported by a DBMS. These days, this is the case for a number of one dimensional data structures, such as those of the B-tree family (see Chapter 15). In addition, spatial database management systems support one of a small number of multidimensional data structures, such as those of the R-tree family (see Chapter 21) or of a grid based structure. The second reason for our interest in space filling curves is the general curiosity in the gap between one dimension and many: In what way and to what degree can we bridge this gap for the sake of supporting spatial operations of the kind described above? In our setting, the gap between one dimension and many goes back to the lack of a linear order in many dimensions that is useful for all purposes. A space filling curve tries to overcome this problem at least to some degree, by defining an artificial linear order on a regular grid that is as useful as possible for the intended purpose. Limiting ourselves in this section to two dimensional space, we define a space filling curve more formally as a bijective mapping p from an index pair of a grid cell to its number in the linear order:
For the sake of simplicity, we limit ourselves to numbers N = 2n for some positive integer n.
Our main attention is on the choice of the linear order in such a way that range queries are efficient. In the GIS setting, it is not worst case efficiency that counts, but efficiency in expectation. It is, unfortunately, hard to say what queries can be expected. Therefore, a number of criteria are conceivable according to which the quality of a space filling curve should be measured. Before entering the discussion about these criteria, let us present some of the most prominent space filling curves that have been investigated for GIS data structures.
Recursively Defined Space Filling Curves
Two space filling curves have been investigated most closely for the purpose of producing a linear ordering suitable for data structures for GIS, the z-curve, also called Peano curve or Morton encoding, and the Hilbert curve. We depict them in Figs. 55.1 and 55.2. They can both be defined recursively with a simple refinement rule: To obtain a 2n+1 × 2n+1 grid from a 2n × 2n grid, replace each cell by the elementary pattern of four cells as in Fig. 55.1, with the appropriate rotation for the Hilbert curve, as indicated in Fig. 55.2 for the four by four grid (i.e., for n = 2). In the same way, a slightly less popular space filling curve can be defined, the Gray code curve (see Fig. 55.3).
For all recursively defined space filling curves, there is an obvious way to compute the mapping p (and also its inverse), namely just along the recursive definition. Any such
computation therefore takes time proportional to the logarithm of the grid size. Without going to great lengths in explaining how this computation is carried out in detail (see the beautiful book [65] for this and other mathematical aspects of space filling curves), let us just mention that there is a particularly nice way of viewing it for the z-curve: Here, p simply alternately interleaves the bits of both of its arguments, when these are expressed in binary. This may have made the z-curve popular among geographers at an early stage, even though our subsequent discussion will reveal that it is not necessarily the best choice.
Range Queries for Space Filling Curve Data Structures
A space filling curve defines a linear order that is used to store the geographical data objects. We distinguish between two extreme storage paradigms that can be found in spatial data structures, namely a partition of the data space according to the objects present in the data set (such as in R-trees, or in B-trees for a single dimension), or a partition of the space only, regardless of the objects (such as in regular cell partitions). The latter makes sense if the distribution of objects is somewhat uniform (see the spatial data structures chapters in this Handbook), and in this case achieves considerable efficiency. Naturally, a multitude of data structures that operate in between both extremes have been proposed, such as the grid file (see the spatial data structures chapters in this Handbook). As to the objects, let us limit ourselves to points in the plane, so that we can focus on a single partition of the plane into grid cells and need not worry about objects that cross cell boundaries (these are dealt with elsewhere in this Handbook, see e.g. the chapter on R-trees, or the survey chapter). For simplicity, let us restrict our attention to partitions of the data space into a 2n × 2n grid, for some integer n. Partitions into other numbers of grid cells (i.e., not powers of four) can be achieved dynamically for some data structures (see e.g. z-Hashing, [32]), but are not always easy to obtain; we will ignore that aspect for now.
Corresponding to both extreme data structuring principles, there are two ways to associate data with external storage blocks for space filling curves. The simplest way is to identify a grid cell with a disk block. This is the usual setting. Grid cell numbers correspond to physical disk block addresses, and a range query translates into the need to access the corresponding set of disk blocks. In the example of Fig. 55.4, for instance, the query range shown in bold results in the need to read disk blocks corresponding to cells with numbers 2-3, 6, 8-12, 14, that is, cell numbers come in four consecutive sequences. In the best case, the cells to be read have consecutive numbers, and hence a single disk seek operation will suffice, followed by a number of successive block read operations. In the worst case, the sequence of disk cell numbers to be read breaks into a large number of consecutive pieces. It is one concern in the choice of a space filling curve to bound this number, or some measure related to it (see the next subsection).
The association of a grid cell with exactly one disk block gives a data structure based on a space filling curve very little flexibility to adapt to a not so uniform data set. A different and more adaptive way of associating disk blocks with grid cells is based on a partition of the data space into cells so that for each cell, the objects in the cell fit on a disk block (as before), but a disk block stores the union of the sets of objects in a consecutive number of cells. This method has the advantage that relatively sparsely populated cells can go together in one disk block, but has the potential disadvantage that disk block maintenance becomes more complex. Not too complex, though, because a disk block can simply be maintained as a node of a B-tree (or, more specifically, a leaf, depending on the B-tree variety), with the content of a cell limiting the granularity of the data in the node. Under this assumption, the adaptation of the data structure to a dynamically changing population of data objects simply translates to split and merge operations of B-tree nodes. In this setting, the measure of efficiency for range queries may well be different from the above: One might be interested in running a range query on the B-tree representation, and ignoring (skipping) the contents of retrieved cells that do not contribute to the result. Hence, for a range query to be efficient, the consecutive single piece of the space filling curve that includes all cells of the query range should not run outside the query range for too long, i.e., for too high a number of cells. We will discuss the effect of a requirement of this type on the design of a space filling curve in more detail below.
Obviously, there are many other ways in which a space filling curve can be the basis for a spatial data structure, but we will refrain from a more detailed discussion here and limit ourselves to the two extremes described so far.
Are All Space Filling Curves Created Equal?
Consider a space filling curve that visits the cells of a grid by visiting next an orthogonal neighbor of the cell just visited. The Hilbert curve is of this orthogonal neighbor type, but the z-curve is not. Now consider a square query region of some fixed (but arbitrary) size,
say k by k grid cells, and study the number of consecutive pieces of the space filling curve that this query region defines (see Fig. 55.5 for a range query region of 3 by 3 grid cells that defines 3 consecutive pieces of a Hilbert curve). We are interested in the average number of curve pieces over all locations of the query range.
For a fixed location of the query range, the number of curve pieces that the range defines is half the number of orthogonal neighbor links that cross the range boundary (ignoring the starting or ending cell of the entire curve). The reason is that when we start at the first cell of the entire curve and follow the curve, it is an orthogonal neighbor on the curve that leads into the query range and another one that leads out, repeatedly until the query range is exhausted. To obtain the average number of curve pieces per query range, we sum the number of boundary crossing orthogonal neighbors over all query range locations (and then divide that number by twice the number of locations, but this step is of no relevance in our argument).
This sum, however, amounts to the same as summing up for every orthogonal neighbor link the number of range query locations in which it is a crossing link. We content ourselves for the sake of simplicity with an approximate average, by ignoring cells close to the boundary of the data space, an assumption that will introduce only a small inaccuracy whenever k is much less than N (and this is the only interesting case). Hence, each orthogonal link will be a crossing link for 2k query range positions, namely k positions for the query range on each of both sides. The interesting observation now is that this summation disregards the position of the orthogonal link completely (apart from the fact that we ignore an area of size O(kN ) along the boundary of the universe). Hence, for any square range query, the average number of pieces of a space filling curve in the query range is the same across all space filling curves of the orthogonal neighbor type.
In this sense, therefore, all space filling curves of the orthogonal neighbor type have the same quality. This includes snake like curves such as row-major zigzag snakes and spiral snakes. Orthogonal neighbors are not a painful limitation here: If we allow for other than orthogonal neighbors, this measure of quality can become only worse, because a non-orthogonal neighbor is a crossing link for more query range locations. In rough terms, this indicates that the choice of space filling curve may not really matter that much. Nevertheless, it also shows that the above measure of quality is not perfectly chosen, since it captures the situation only for square query ranges, and that is not fully adequate for GISs.
Many Curve Pieces for a Query Range
The properties that recursively definable space filling curves entail for spatial data structure design have been studied from a theoretical perspective [3, 8]. The performance of square range queries has been studied [3] on the basis that range queries in a GIS can have any size, but are most likely to not deviate from a square shape by too much. Based on the external storage access model in which disk seek and latency time by far dominates the time to read a block, they allow a range query to skip blocks for efficiency’s sake. Skipping a block amounts to reading it in a sequence of blocks, without making use of the information thus obtained.
In the example of Fig. 55.4, skipping the blocks of cells 4, 5, 7, and 13 leads to a single sequence of the consecutive blocks 2-14. This can obviously be preferable to just reading consecutive sequences of relevant cell blocks only and therefore performing more disk seek operations, provided that the number of skipped blocks is not too large.
In an attempt to quantify one against the other, consider a range query algorithm that is allowed to read a linear number of additional cells [3], as compared with those in the query range. It turns out that for square range query regions of arbitrary size and the permission to read at most a linear number of extra cells, each recursive space filling curve needs at least three disk seek operations in the worst case. While no recursive space filling curve
needs more than four disk seeks in the worst case, none of the very popular recursive space filling curves, including the z-curve, the Hilbert curve, and the Gray code curve, can cope with less than four. One can define a recursive space filling curve that needs only three disk seeks in the worst case [3], and hence the lower and upper bounds match. This result is only a basis for data structure design, though, because its quality criterion is still too far from GIS reality to guide the design of practical space filling curves.
Along the same lines, one can refine the cost measure and account explicitly for disk seek cost [8]: For a sufficiently short sequence of cells, it is cheaper to skip them, but as soon as the sequence length exceeds a constant (that is defined by the disk seek cost, relative to the cost of reading a block), it is cheaper to stop reading and perform a disk seek. A refinement of the simple observation from above leads to performance formulas expressed in the numbers of links of the various types that the space filling curve uses, taking the relative disk seek cost into consideration [8]. It turns out that the local behavior of space filling curves can be modeled well by random walks, again perhaps an indication that at least locally, the choice of a particular space filling curve is not crucial for the performance.
One Curve Piece for a Query Range
The second approach described above reads a single consecutive piece of the space filling curve to respond to a range query, from the lowest numbered cell in the range to the highest numbered. In general, this implies that a high number of irrelevant cells will be read. As we explained above, such an approach can be attractive, because it allows immediately to make use of well established data structures such as the B-tree, including all access algorithms. We need to make sure, however, that the inefficiency that results from the extra accessed cells remains tolerable. Let us therefore now calculate this inefficiency approximately. To this end, we change our perspective and calculate for any consecutive sequence of cells along the space filling curve its shape in two-dimensional space. The shape itself may be quite complicated, but for our purpose a simple rectangular bounding box of all grid cells in question will suffice.
The example in Fig. 55.6 shows a set of 3 and a set of 4 consecutive cells along the curve, together with their bounding boxes (shaded), residing in a corner of a Hilbert curve. Now, a simple measure of quality suggests itself: The fewer useless cells we get in the bounding box, the better. A quick thought reveals that this measure is all too crude, because it does not take into consideration that square ranges are more important than skinny ranges. This bias can be taken into account by placing a piece of the space filling curve into its smallest enclosing square (see the dotted outline of a 3 by 3 square in Fig. 55.6 for a bounding square of the 3 cells), and by taking the occupancy (i.e., percentage of relevant cells) of that square as the measure of quality. The question we now face is to bound the occupancy for a given space filling curve, across all possible pieces of the curve,
and further, to find a best possible curve with respect to this bound. For simplicity’s sake, let us again limit ourselves to curves with orthogonal links.
Let us first argue that the lower bound on the occupancy, for any space filling curve, cannot be higher than one third. This is easy to see by contradiction. Assume to this end that there is a space filling curve that guarantees an occupancy of more than one third. In particular, this implies that there cannot be two vertical links immediately after each other, nor can there be two consecutive horizontal links (the reason is that the three cells defining these two consecutive links define a smallest enclosing square of size 3 by 3 cells, with only 3 of these 9 cells being useful, and hence with an occupancy of only one third). But this finishes the argument, because no space filling curve can always alternate between vertical and horizontal links (the reason is that a corner of the space that is traversed by the curve makes two consecutive links of the same type unavoidable, see Fig. 55.7).
On the positive side, the Hilbert curve guarantees an occupancy of one third. The reason is that the Hilbert curve gives this guarantee for the 4 by 4 grid, and that this property is preserved in the recursive refinement.
This leads to the conclusion that in terms of the worst case occupancy guarantee, the Hilbert curve is a best possible basis for a spatial data structure.
Comments
Post a Comment