Geometric and Spatial Data Structures in External Memory:Spatial Data Structures and Range Search.
Spatial Data Structures and Range Search
In this section we consider online EM data structures for storing and querying spatial data. A fundamental database primitive in spatial databases and geographic information systems (GIS) is range search, which includes dictionary lookup as a special case. An orthogonal range query, for a given d-dimensional rectangle, returns all the points in the interior of the rectangle. In this section we use range searching (especially for the orthogonal 2-D case when d = 2) as the canonical query operation on spatial data. Other types of spatial queries include point location, ray shooting, nearest neighbor, and intersection queries, but for brevity we restrict our attention primarily to range searching.
There are two types of spatial data structures: data-driven and space-driven. R-trees and kd-trees are data-driven since they are based upon a partitioning of the data items themselves, whereas space-driven methods such as quad trees and grid files are organized by a partitioning of the embedding space, akin to order-preserving hash functions. In this section we discuss primarily data-driven data structures.
Multidimensional range search is a fundamental primitive in several online geometric applications, and it provides indexing support for constraint and object-oriented data models.
(See [74] for background.) We have already discussed multidimensional range searching in a batched setting in Section 27.2. In this section we concentrate on data structures for the online case.
For many types of range searching problems, it is very difficult to develop theoretically optimal algorithms and data structures. Many open problems remain. The goal for online data structures is typically to achieve the three optimality Criteria O1–O3 of Section 27.1.2. We explain in Section 27.4.6 that under a fairly general computational model for general orthogonal queries, as pictured in Figure 27.5(d), it is impossible to satisfy Criteria O1 and O2 simultaneously. At least Ω(n(log n)/ log(logB N + 1)) blocks of disk space must be used to achieve a query bound of O((logB N )c + z) I/Os per query, for any constant c [113].
Three natural questions arise:
• What sort of performance can be achieved when using only a linear amount of disk space? In Sections 27.4.1 and 27.4.2, we discuss some of the linear-space data structures used extensively in practice. None of them come close to satisfying Criteria O1 and O3 for range search in the worst case, but in typical-case scenarios they often perform well. We devote Section 27.4.2 to R-trees and their variants, which are the most popular general-purpose spatial structures developed to date.
• Since the lower bound applies only to general 2-D rectangular queries, are there any data structures that meet Criteria O1–O3 for the important special cases of 2-D range searching pictured in Figures 27.5(a), 27.5(b), and 27.5(c)? Fortunately the answer is yes. We show in Sections 27.4.3 and 27.4.4 how to use a “bootstrapping” paradigm to achieve optimal search and update performance.
• Can we meet Criteria O1 and O2 for general four-sided range searching if the disk space allowance is increased to O(n(log n)/ log(logB N + 1)) blocks? Yes again! In Section 27.4.5 we show how to adapt the optimal structure for three-sided searching in order to handle general four-sided searching in optimal search cost. The update cost, however, is not known to be optimal.
In Section 27.5 we discuss other scenarios of range search dealing with three dimensions and nonorthogonal queries. We discuss the lower bounds for 2-D range searching in Section 27.4.6.
Linear-Space Spatial Structures
Grossi and Italiano [62] construct an elegant multidimensional version of the B-tree called the cross tree. Using linear space, it combines the data-driven partitioning of weight- balanced B-trees (cf. Section 27.3.2) at the upper levels of the tree with the space-driven partitioning of methods such as quad trees at the lower levels of the tree. For d > 1, d-
FIGURE 27.5: Different types of 2-D orthogonal range queries: (a) Diagonal corner two- sided 2-D query (equivalent to a stabbing query, cf. Section 27.4.3), (b) two-sided 2-D query, (c) three-sided 2-D query, and (d) general four-sided 2-D query.
dimensional orthogonal range queries can be done in O(n1−1/d + z) I/Os, and inserts and deletes take O(logB N ) I/Os. The O-tree of Kanth and Singh [75] provides similar bounds.
Cross trees also support the dynamic operations of cut and concatenate in O(n1−1/d ) I/Os.
In some restricted models for linear-space data structures, the 2-D range search query performance of cross trees and O-trees can be considered to be optimal, although it is much larger than the logarithmic bound of Criterion O1.
One way to get multidimensional EM data structures is to augment known internal memory structures, such as quad trees and kd-trees, with block-access capabilities. Examples include kd-B-trees [102], buddy trees [110], and hB-trees [54, 86]. Grid files [67, 80, 90] are a flattened data structure for storing the cells of a two-dimensional grid in disk blocks. An- other technique is to “linearize” the multidimensional space by imposing a total ordering on it (a so-called space-filling curve), and then the total order is used to organize the points into a B-tree [57, 71, 95]. Linearization can also be used to represent nonpoint data, in which the data items are partitioned into one or more multidimensional rectangular regions [1, 94]. All the methods described in this paragraph use linear space, and they work well in certain situations; however, their worst-case range query performance is no better than that of cross trees, and for some methods, such as grid files, queries can require Θ(n) I/Os, even if there are no points satisfying the query. We refer the reader to [8, 56, 92] for a broad survey of these and other interesting methods. Space-filling curves arise again in connection with R-trees, which we describe next.
The R-tree of Guttman [64] and its many variants are a practical multidimensional gener- alization of the B-tree for storing a variety of geometric objects, such as points, segments, polygons, and polyhedra, using linear disk space. Internal nodes have degree Θ(B) (except possibly the root), and leaves store Θ(B) items. Each node in the tree has associated with it a bounding box (or bounding polygon) of all the items in its subtree. A big difference between R-trees and B-trees is that in R-trees the bounding boxes of sibling nodes are al- lowed to overlap. If an R-tree is being used for point location, for example, a point may lie within the bounding box of several children of the current node in the search. In that case the search must proceed to all such children.
In the dynamic setting, there are several popular heuristics for where to insert new items into an R-tree and how to rebalance it; see Chapter 21 and [8, 56, 61] for a survey.
The R* tree variant of Beckmann et al. [27] seems to give best overall query performance. To insert an item, we start at the root and recursively insert the item into the subtree whose bounding box would expand the least in order to accommodate the item. In case of a tie (e.g., if the item already fits inside the bounding boxes of two or more subtrees), we choose the subtree with the smallest resulting bounding box. In the normal R-tree algorithm, if a leaf node gets too many items or if an internal node gets too many children, we split it, as in B-trees. Instead, in the R*-tree algorithm, we remove a certain percentage of the items from the overflowing node and reinsert them into the tree. The items we choose to reinsert are the ones whose centroids are furthest from the center of the node’s bounding box. This forced reinsertion tends to improve global organization and reduce query time.
If the node still overflows after the forced reinsertion, we split it. The splitting heuristics try to partition the items into nodes so as to minimize intuitive measures such as coverage, overlap, or perimeter. During deletion, in both the normal R-tree and R*-tree algorithms, if a leaf node has too few items or if an internal node has too few children, we delete the node and reinsert all its items back into the tree by forced reinsertion.
The rebalancing heuristics perform well in many practical scenarios, especially in low
dimensions, but they result in poor worst-case query bounds. An interesting open problem is whether nontrivial query bounds can be proven for the “typical-case” behavior of R-trees for problems such as range searching and point location. Similar questions apply to the methods discussed in Section 27.4.1. New R-tree partitioning methods by de Berg et al. [46] and Agarwal et al. [7] provide some provable bounds on overlap and query performance.
In the static setting, in which there are no updates, constructing the R*-tree by repeated insertions, one by one, is extremely slow. A faster alternative to the dynamic R-tree construction algorithms mentioned above is to bulk-load the R-tree in a bottom-up fash- ion [1, 70, 94]. Such methods use some heuristic for grouping the items into leaf nodes of the R-tree, and then recursively build the nonleaf nodes from bottom to top. As an example, in the so-called Hilbert R-tree of Kamel and Faloutsos [70], each item is labeled with the position of its centroid on the Peano-Hilbert space-filling curve, and a B+-tree is built upon the totally ordered labels in a bottom-up manner. Bulk loading a Hilbert R-tree is there- fore easy to do once the centroid points are presorted. These static construction methods algorithms are very different in spirit from the dynamic insertion methods: The dynamic methods explicitly try to reduce the coverage, overlap, or perimeter of the bounding boxes of the R-tree nodes, and as a result, they usually achieve good query performance. The static construction methods do not consider the bounding box information at all. Instead, the hope is that the improved storage utilization (up to 100%) of these packing methods compensates for a higher degree of node overlap. A dynamic insertion method related to [70] was presented in [71]. The quality of the Hilbert R-tree in terms of query performance is generally not as good as that of an R*-tree, especially for higher-dimensional data [30, 72]. In order to get the best of both worlds—the query performance of R*-trees and the bulk construction efficiency of Hilbert R-trees—Arge et al. [11] and van den Bercken et al. [118] independently devised fast bulk loading methods based upon buffer trees that do top-down construction in O(n logm n) I/Os, which matches the performance of the bottom-up methods within a constant factor. The former method is especially efficient and supports dynamic batched updates and queries. In Figure 27.6 and Table 27.1, we report on some experiments that test the construction, update, and query performance of various R-tree methods. The experimental data came from TIGER/line data sets from four U.S. states [116]; the implementations were done using the TPIE system.
Figure 27.6 compares the construction cost for building R-trees and the resulting query performance in terms of I/Os for the naive sequential method for construction into R*-trees (labeled “naive”) and the newly developed buffer R*-tree method [11] (labeled “buffer”). An R-tree was constructed on the TIGER road data for each state and for each of four possible buffer sizes. The four buffer sizes were capable of storing 0, 600, 1,250, and 5,000 rectangles,
respectively; buffer size 0 corresponds to the naive method, and the larger buffers correspond to the buffer method. The query performance of each resulting R-tree was measured by posing rectangle intersection queries using rectangles taken from TIGER hydrographic data. The results, depicted in Figure 27.6, show that buffer R*-trees, even with relatively small buffers, achieve a tremendous speedup in number of I/Os for construction without any worsening in query performance, compared with the naive method. The CPU costs of the two methods are comparable. The storage utilization of buffer R*-trees tends to be in the 90% range, as opposed to roughly 70% for the naive method.
Bottom-up methods can build R-trees even more quickly and more compactly, but they generally do not support bulk dynamic operations, which is a big advantage of the buffer tree approach. Kamel et al. [72] develop a way to do bulk updates with Hilbert R-trees, but at a cost in terms of query performance. Table 27.1 compares dynamic update methods for the naive method, for buffer R-trees, and for Hilbert R-trees [72] (labeled “Hilbert”). A single R-tree was built for each of the four U.S. states, containing 50% of the road data objects for that state. Using each of the three algorithms, the remaining 50% of the objects were inserted into the R-tree, and the construction time was measured. Query performance was then tested as before. The results in Table 27.1 indicate that the buffer R*-tree and the Hilbert R-tree achieve a similar degree of packing, but the buffer R*-tree provides better update and query performance.
Bootstrapping for 2-D Diagonal Corner and Stabbing Queries
An obvious paradigm for developing an efficient dynamic EM data structure, given an existing data structure that works well when the problem fits into internal memory, is to “externalize” the internal memory data structure. If the internal memory data structure uses a binary tree, then a multiway tree such as a B-tree must be used instead. However, when searching a B-tree, it can be difficult to report the outputs in an output-sensitive manner. For example, in certain searching applications, each of the Θ(B) subtrees of a given node in a B-tree may contribute one item to the query output, and as a result each subtree may need to be explored (costing several I/Os) just to report a single output item. Fortunately, we can sometimes achieve output-sensitive reporting by augmenting the data structure with a set of filtering substructures, each of which is a data structure for a smaller version of the same problem. We refer to this approach, which we explain shortly in more detail, as the bootstrapping paradigm. Each substructure typically needs to store only O(B2) items and to answer queries in O(logB B2 + Zt/B) = O(lZt/Bl) I/Os, where Zt is the number of items reported. A substructure can even be static if it can be constructed in O(B) I/Os, since we can keep updates in a separate buffer and do a global rebuilding in O(B) I/Os whenever there are Θ(B) updates. Such a rebuilding costs O(1) I/Os (amortized) per update. We can often remove the amortization and make it worst-case using the weight- balanced B-trees of Section 27.3.2 as the underlying B-tree structure.
Arge and Vitter [17] first uncovered the bootstrapping paradigm while designing an optimal dynamic EM data structure for diagonal corner two-sided 2-D queries (see Figure 27.5(a)) that meets all three design criteria for online data structures listed in Section 27.1.2. Diagonal corner two-sided queries are equivalent to stabbing queries, which have the following form: “Given a set of one-dimensional intervals, report all the intervals ‘stabbed’ by the query value x.” (That is, report all intervals that contain x.) A diagonal corner query x on a set of 2-D points {(a1, b2), (a2, b2), ... } is equivalent to a stabbing query x on the set of closed intervals {[a1, b2], [a2, b2], ... }.
The EM data structure for stabbing queries is a multiway version of the well-known interval tree data structure [51, 52] for internal memory, which supports stabbing queries in O(log N + Z) CPU time and updates in O(log N ) CPU time and uses O(N ) space. We can externalize it by using a weight-balanced B-tree as the underlying base tree, where thenodes have degree Θ(√). Each node in the base tree corresponds in a natural way to a one-dimensional range of x-values; its Θ(√ ) children correspond to subranges called slabs, and the Θ(√) = Θ(B) contiguous sets of slabs are called multislabs, as in Section 27.2 for a similar batched problem. Each input interval is stored in the lowest node v in the base tree whose range completely contains the interval. The interval is decomposed by v’s Θ(√B ) slabs into at most three pieces: the middle piece that completely spans one or more slabs of v, the left end piece that partially protrudes into a slab of v, and the right end piece that partially protrudes into another slab of v, as shown in Figure 27.7.
The three pieces are stored in substructures of v. In the example in Figure 27.7, the middle piece is stored in a list associated with the multislab it spans (corresponding to the contiguous range of slabs 3–5), the left end piece is stored in a one-dimensional list for slab 2 ordered by left endpoint, and the right end piece is stored in a one-dimensional list for slab 6 ordered by right endpoint.
Given a query value x, the intervals stabbed by x reside in the substructures of the nodes of the base tree along the search path from the root to the leaf for x. For each such
FIGURE 27.7: Internal node v of the EM priority search tree, for B = 64 with √ = 8 slabs. Node v is the lowest node in the tree completely containing the indicated interval.
The middle piece of the interval is stored in the multislab list corresponding to slabs 3–5.
(The multislab lists are not pictured.) The left and right end pieces of the interval are stored in the left-ordered list of slab 2 and the right-ordered list of slab 6, respectively.
node v, we consider each of v’s multislabs that contains x and report all the intervals in the multislab list. We also walk sequentially through the right-ordered list and left-ordered list for the slab of v that contains x, reporting intervals in an output-sensitive way.
The big problem with this approach is that we have to spend at least one I/O per multislab containing x, regardless of how many intervals are in the multislab lists. For example, there may be Θ(B) such multislab lists, with each list containing only a few stabbed intervals (or worse yet, none at all). The resulting query performance will be highly nonoptimal. The solution, according to the bootstrapping paradigm, is to use a substructure in each node consisting of an optimal static data structure for a smaller version of the same problem; a good choice is the corner data structure developed by Kanellakis et al. [74]. The corner substructure in this case is used to store all the intervals from the “sparse” multislab lists, namely those that contain fewer than B intervals, and thus the substructure contains only O(B2) intervals. When visiting node v, we access only v’s nonsparse multislab lists, each of which contributes Zt ≥ B intervals to the output, at an output-sensitive cost of O(Zt/B) I/Os, for some Zt. The remaining Ztt stabbed intervals stored in v can be found by a single query to v’s corner substructure, at a cost of O(logB B2 + Ztt/B) = O(lZtt/Bl) I/Os. Since there are O(logB N ) nodes along the search path in the base tree, the total collection of Z stabbed intervals is reported in O(logB N + z) I/Os, which is optimal. Using a weight- balanced B-tree as the underlying base tree allows the static substructures to be rebuilt in worst-case optimal I/O bounds.
Stabbing queries are important because, when combined with one-dimensional range queries, they provide a solution to dynamic interval management, in which one-dimensional intervals can be inserted and deleted, and intersection queries can be performed. These operations support indexing of one-dimensional constraints in constraint databases. Other applications of stabbing queries arise in graphics and GIS. For example, Chiang and Silva [40] apply the EM interval tree structure to extract at query time the boundary components of the isosurface (or contour) of a surface. A data structure for a related problem, which in addition has optimal output complexity, appears in [5]. The above bootstrapping approach also yields dynamic EM segment trees with optimal query and update bound and O(n logB N )-block space usage.
FIGURE 27.8: Internal node v of the EM priority search tree, with slabs (children) w1, w2, ..., w5. The Y-sets of each child, which are stored collectively in v’s child cache, are indicated by the bold points. (a) The three-sided query is completely contained in the x-range of w2. The relevant (bold) points are reported from v’s child cache, and the query is recursively answered in w2. (b) The three-sided query spans several slabs. The relevant (bold) points are reported from v’s child cache, and the query is recursively answered in w2, w3, and w5. The query is not extended to w4 in this case because not all of its Y-set Y (w4) (stored in v’s child cache) satisfies the query, and as a result, none of the points stored in w4’s subtree can satisfy the query.
Bootstrapping for Three-Sided Orthogonal 2-D Range Search
Arge et al. [15] provide another example of the bootstrapping paradigm by developing an optimal dynamic EM data structure for three-sided orthogonal 2-D range searching (see Figure 27.5(c)) that meets all three design Criteria O1–O3.
In internal memory, the optimal structure is the priority search tree [88], which answers three-sided range queries in O(log N + Z) CPU time, does updates in O(log N ) CPU time, and uses O(N ) space. The EM structure of Arge et al. [15] is an externalization of the priority search tree, using a weight-balanced B-tree as the underlying base tree. Each node in the base tree corresponds to a one-dimensional range of x-values, and its Θ(B) children correspond to subranges consisting of vertical slabs. Each node v contains a small substructure called a child cache that supports three-sided queries. Its child cache stores the “Y-set” Y (w) for each of the Θ(B) children w of v. The Y-set Y (w) for child w consists of the highest Θ(B) points in w’s slab that are not already stored in the child cache of some ancestor of v. There are thus a total of Θ(B2) points stored in v’s child cache.
We can answer a three-sided query of the form [x1, x2] × [y1, +∞) by visiting a set of nodes in the base tree, starting with the root. For each visited node v, we pose the query [x1, x2] × [y1, +∞) to v’s child cache and output the results. The following rules are used to determine which of v’s children to visit: We visit v’s child w if either
1. w is along the leftmost search path for x1 or the rightmost search path for x2 in the base tree, or
2. the entire Y-set Y (w) is reported when v’s child cache is queried.
(See Figure 27.8.) There are O(logB N ) nodes w that are visited because of rule 1. When rule 1 is not satisfied, rule 2 provides an effective filtering mechanism to guarantee output- sensitive reporting: The I/O cost for initially accessing a child node w can be charged to the Θ(B) points of Y (w) reported from v’s child cache; conversely, if not all of Y (w) is reported, then the points stored in w’s subtree will be too low to satisfy the query, and there is no need to visit w. (See Figure 27.8(b).) Provided that each child cache can be queried in O(1) I/Os plus the output-sensitive cost to output the points satisfying the query, the resulting overall query time is O(logB N + z), as desired.
All that remains is to show how to query a child cache in a constant number of I/Os, plus the output-sensitive cost. Arge et al. [15] provide an elegant and optimal static data structure for three-sided range search, which can be used in the EM priority search tree described above to implement the child caches of size O(B2). The static structure is a persistent B-tree optimized for batched construction. When used for O(B2 ) points, it occupies O(B) blocks, can be built in O(B) I/Os, and supports three-sided queries in O(lZt/Bl) I/Os per query, where Zt is the number of points reported. The static structure is so simple that it may be useful in practice on its own.
Both the three-sided structure developed by Arge et al. [15] and the structure for two-sided diagonal queries discussed in Section 27.4.3 satisfy Criteria O1–O3 of Section 27.1.2.
So in a sense, the three-sided query structure subsumes the diagonal two-sided structure, since three-sided queries are more general. However, diagonal two-sided structure may prove to be faster in practice, because in each of its corner substructures, the data accessed during a query are always in contiguous blocks, whereas the static substructures used for three-sided search do not guarantee block contiguity. Empirical work is ongoing to evaluate the performance of these data structures.
On a historical note, earlier work on two-sided and three-sided queries was done by Ramaswamy and Subramanian [99] using the notion of path caching; their structure met Criterion O1 but had higher storage overheads and amortized and/or nonoptimal update bounds. Subramanian and Ramaswamy [113] subsequently developed the p-range tree data structure for three-sided queries, with optimal linear disk space and nearly optimal query and amortized update bounds.
General Orthogonal 2-D Range Search
The dynamic data structure for three-sided range searching can be generalized using the filtering technique of Chazelle [35] to handle general four-sided queries with optimal I/O query bound O(logB N + z) and optimal disk space usage O(n(log n)/ log(logB N + 1)) [15]. The update bound becomes O((logB N )(log n)/log(logB N + 1)), which may not be optimal.
The outer level of the structure is a balanced (logB N + 1)-way 1-D search tree with Θ(n) leaves, oriented, say, along the x-dimension. It therefore has about (log n)/ log(logB N + 1) levels. At each level of the tree, each input point is stored in four substructures (described below) that are associated with the particular tree node at that level that spans the x-value of the point. The space and update bounds quoted above follow from the fact that the substructures use linear space and can be updated in O(logB N ) I/Os.
To search for the points in a four-sided query rectangle [x1, x2] × [y1, y2], we decompose the four-sided query in the following natural way into two three-sided queries, a stabbing query, and logB N − 1 list traversals: We find the lowest node v in the tree whose x-range contains [x1, x2]. If v is a leaf, we can answer the query in a single I/O. Otherwise we query the substructures stored in those children of v whose x-ranges intersect [x1, x2]. Let 2 ≤ k ≤ logB N + 1 be the number of such children. The range query when restricted to the leftmost such child of v is a three-sided query of the form [x1, +∞] × [y1, y2], and when restricted to the rightmost such child of v, the range query is a three-sided query of the form [−∞, x2]×[y1, y2]. Two of the substructures at each node are devoted for three-sided queries of these types; using the linear-sized data structures of Arge et al. [15] in Section 27.4.4, each such query can be done in O(logB N + z) I/Os.
For the k − 2 intermediate children of v, their x-ranges are completely contained inside the x-range of the query rectangle, and thus we need only do k − 2 list traversals in y-order and retrieve the points whose y-values are in the range [y1, y2]. If we store the points in each node in y-order (in the third type of substructure), the Zt output points from a node can be found in O( Zt/Bl) I/Os, once a starting point in the linear list is found. We can find all k − 2 starting points via a single query to a stabbing query substructure S associated with v. (This structure is the fourth type of substructure.) For each two y-consecutive points (ai, bi) and (ai+1, bi+1) associated with a child of v, we store the y-interval [bi, bi+1] in S. Note that S contains intervals contributed by each of the logB N + 1 children of v. By a single stabbing query with query value y1, we can thus identify the k − 2 starting points in only O(logB N ) I/Os [17], as described in Section 27.4.3. (We actually get starting points for all the children of v, not just the k − 2 ones of interest, but we can discard the starting points we don’t need.) The total number of I/Os to answer the range query is thus O(logB N + z), which is optimal.
Lower Bounds for Orthogonal Range Search
We mentioned in Section 27.4 that Subramanian and Ramaswamy [113] prove that no EM data structure for 2-D range searching can achieve design Criterion O1 of Section 27.1.2
using less than O(n(log n)/ log(logB N + 1)) disk blocks, even if we relax the criterion and allow O((logB N )c + z) I/Os per query, for any constant c. The result holds for an EM version of the pointer machine model, based upon the approach of Chazelle [36] for the internal memory model.
Hellerstein et al. [65] consider a generalization of the layout-based lower bound argument of Kanellakis et al. [74] for studying the tradeoff between disk space usage and query performance. They develop a model for indexability, in which an “efficient” data structure is expected to contain the Z output points to a query compactly within O(lZ/Bl) = O( zl) blocks. One shortcoming of the model is that it considers only data layout and ignores the search component of queries, and thus it rules out the important filtering paradigm discussed earlier in Section 27.4. For example, it is reasonable for any query to perform at least logB N I/Os, so if the output size Z is at most B, a data structure may still be able to satisfy Criterion O1 even if the output is contained within O(logB N ) blocks rather than O(z) = O(1) blocks. Arge et al. [15] modify the model to rederive the same nonlinear space lower bound O(n(log n)/ log(logB N + 1)) of Subramanian and Ramaswamy [113] for 2-D range searching by considering only output sizes Z larger than (logB N )cB, for which the number of blocks allowed to hold the outputs is Z/B = O((logB N )c + z). This approach ignores the complexity of how to find the relevant blocks, but as mentioned in Section 27.4.5, the authors separately provide an optimal 2-D range search data structure that uses the same amount of disk space and does queries in the optimal O(logB N + z) I/Os. Thus, despite its shortcomings, the indexability model is elegant and can provide much insight into the complexity of blocking data in external memory. Further results in this model appear in [79, 108].
One intuition from the index ability model is that less disk space is needed to efficiently answer 2-D queries when the queries have bounded aspect ratio (i.e., when the ratio of the longest side length to the shortest side length of the query rectangle is bounded). An interesting question is whether R-trees and the linear-space structures of Sections 27.4.1 and 27.4.2 can be shown to perform provably well for such queries. Another interesting scenario is where the queries correspond to snapshots of the continuous movement of a sliding rectangle.
When the data structure is restricted to contain only a single copy of each point, Kanth and Singh [75] show for a restricted class of index-based trees that d-dimensional rangequeries in the worst case require Ω(n1−1/d + z) I/Os, and they provide a data structure with a matching bound. Another approach to achieve the same bound is the cross tree data structure [62] mentioned in Section 27.4.1, which in addition supports the operations of cut and concatenate.
Comments
Post a Comment