Geometric and Spatial Data Structures in External Memory:Introduction and EM Algorithms for Batched Geometric Problems
Introduction
The Input/Output communication (or simply I/O ) between the fast internal memory and the slow external memory (such as disk) can be a bottleneck when processing massive amounts of data, as is the case in many spatial and geometric applications [124]. Problems involving massive amounts of geometric data are ubiquitous in spatial databases [82, 106, 107], geographic information systems (GIS) [82, 106, 119], constraint logic programming [73, 74], object-oriented databases [130], statistics, virtual reality systems, and computer graphics [55]. NASA’s Earth Observing System project, the core part of the Earth Science Enterprise (formerly Mission to Planet Earth), produces petabytes (1015 bytes) of raster data per year [53]. Microsoft’s TerraServer online database of satellite images is over one terabyte in size [115]. A major challenge is to develop mechanisms for processing the data, or else much of the data will be useless.
One promising approach for efficient I/O is to design algorithms and data structures that bypass the virtual memory system and explicitly manage their own I/O. We refer to such algorithms and data structures as external memory (or EM ) algorithms and data structures. (The terms out-of-core algorithms and I/O algorithms are also sometimes used.)
We concentrate in this chapter on the design and analysis of EM memory data structures for batched and online problems involving geometric and spatial data. Luckily, many problems on geometric objects can be reduced to a small core of problems, such as computing intersections, convex hulls, multidimensional search, range search, stabbing queries, point location, and nearest neighbor search. In this chapter we discuss useful paradigms for solving these problems in external memory.
The three primary measures of performance of an algorithm or data structure are the number of I/O operations performed, the amount of disk space used, and the internal (parallel) computation time. For reasons of brevity we shall focus in this chapter on only the first two measures. Most of the algorithms we mention run in optimal CPU time, at least for the single-processor case.
We can capture the main properties of magnetic disks and multiple disk systems by the commonly used parallel disk model (PDM) introduced by Vitter and Shriver [125]. Data is transferred in large units of blocks of size B so as to amortize the latency of moving the read-write head and waiting for the disk to spin into position. Storage systems such as RAID use multiple disks to get more bandwidth [38, 66]. The principal parameters of PDM are the following:
N = problem input data size (in terms of items),
M = size of internal memory (in terms of items),
B = size of disk block (in terms of items), and
D = # independent disks,
where M < N and 1 ≤ DB ≤ M .
Queries are naturally associated with online computations, but they can also be done in batched mode. For example, in the batched orthogonal 2-D range searching problem discussed in Section 27.2, we are given a set of N points in the plane and a set of Q queries in the form of rectangles, and the problem is to report the points lying in each of the Q query rectangles. In both the batched and online settings, the number of items reported in response to each query may vary. We thus define two more performance parameters:
Q = number of input queries (for a batched problem), and
Z = query output size (in terms of items).
If the problem does not involve queries, we set Q = 0.
It is convenient to refer to some of the above PDM parameters in units of disk blocks rather than in units of data items; the resulting formulas are often simplified. We define the lowercase notation
to be the problem input size, internal memory size, query output size, and number of queries, respectively, in units of disk blocks.
For simplicity, we restrict our attention in this chapter to the single-disk case D = 1. The batched algorithms we discuss can generally be sped up by using multiple disks in an optimal
manner using the load balancing techniques discussed in [124]. Online data structures that use a single disk can generally be transformed automatically by the technique of disk striping to make optimal use of multiple disks [124].
Programs that perform well in terms of PDM will generally perform well when implemented on real systems [124]. More complex and precise models have been formulated [22, 103, 111]. Hierarchical (multilevel) memory models are discussed in [124] and its references.
Cache-oblivious models are discussed in Chapter 34.
Many of the algorithm design techniques we discuss in this chapter, which exploit data locality so as to minimize I/O communication steps, form the basis for algorithms in the other models.
Design Criteria for External Memory Data Structures
The data structures we examine in this chapter are used in batched algorithms and in online settings. In batched problems, no preprocessing is done, and the entire file of data items must be processed, often by streaming the data through the internal memory in one or more passes. Queries are done in a batched manner during the processing. The goal is generally twofold:
Most nontrivial problems require the same number of I/Os as does sorting. In particular, criterion B1 is related to the I/O complexity of sorting N items in the PDM model, which is O(n logm n) [124].
Online data structures support query operations in a continuous manner. When the data items do not change and the data structure can be preprocessed before any queries are made, the data structure is known as static. When the data structure supports insertions and deletions of items, intermixed with the queries, the data structure is called dynamic. The primary theoretical challenges in the design and analysis of online EM data structures are threefold:
Criteria O1–O3 correspond to the natural lower bounds for online search in the comparison model. The three criteria are problem-dependent, and for some problems they cannot be met. For dictionary queries, we can do better using hashing, achieving O(1) I/Os per query on the average.
Criterion O1 combines together the O(logB N ) I/O cost for the search with the O(lzl) I/O cost for reporting the output. When one cost is much larger than the other, the query algorithm has the extra freedom to follow a filtering paradigm [35], in which both the search component and the output reporting are allowed to use the larger number of I/Os. For example, when the output size Z is large, the search component can afford to be somewhat sloppy as long as it doesn’t use more than O(z) I/Os; when Z is small, the Z output items do not have to reside compactly in only O(lzl) blocks. Filtering is an important design paradigm in online EM data structures.
For many of the batched and online problems we consider, there is a data structure (such as a scanline structure or binary search tree) for the internal memory version of the problem that can answer each query in O(log N +Z) CPU time, but if we use the same data structure naively in an external memory setting (using virtual memory to handle page management),
a query may require Ω(log N + Z) I/Os, which is excessive.∗
The goal for online algorithms is to build locality directly into the data structure and explicitly manage I/O so that the log N and Z terms in the I/O bounds of the naive approach are replaced by logB N and z, respectively. The relative speedup in I/O performance, namely (log N + Z)/(logB N + z), is at least (log N )/ logB N = log B, which is significant in practice, and it can be as much as Z/z = B for large Z.
For batched problems, the I/O performance can be improved further, since the answers to the queries do not need to be provided immediately but can be reported as a group at the end of the computation. For the batched problems we consider in this chapter, the Q = qB queries collectively use O(q logm n + z) I/Os, which is about B times less than a naive approach.
Overview of Chapter
In the next section, we discuss batched versions of geometric search problems. One of the primary methods used for batched geometric problems is distribution sweeping, which uses a data structure reminiscent of the distribution paradigm in external sorting. Other useful batched techniques include persistent B-trees, batched filtering, external fractional cascading, external marriage-before-conquest, and batched incremental construction.
The most popular EM online data structure is the B-tree structure, which provides excel- lent performance for dictionary operations and one-dimensional range searching. We give several variants and applications of B-trees in Section 27.3. We look at several aspects of multidimensional range search in Section 27.4 and related problems such as stabbing queries and point location. Data structures for other variants and related problems such as nearest neighbor search are discussed in Section 27.5. Dynamic and kinetic data structures are dis- cussed in Section 27.6. A more comprehensive survey of external memory data structures appears in [124].
EM Algorithms for Batched Geometric Problems
Advances in recent years have allowed us to solve a variety of batched geometric problems optimally, meeting both optimality Criteria B1 and B2 of Section 27.1.2. These problems include
1. Computing the pairwise intersections of N segments in the plane and their trapezoidal decomposition,
2. Finding all intersections between N nonintersecting red line segments and N nonintersecting blue line segments in the plane,
3. Answering Q orthogonal 2-D range queries on N points in the plane (i.e., finding all the points within the Q query rectangles),
4. Constructing the 2-D and 3-D convex hull of N points,
5. Voronoi diagram and Triangulation of N points in the plane, 6. Performing Q point location queries in a planar subdivision of size N ,
7. Finding all nearest neighbors for a set of N points in the plane,
8. Finding the pairwise intersections of N orthogonal rectangles in the plane,
9. Computing the measure of the union of N orthogonal rectangles in the plane,
10. Computing the visibility of N segments in the plane from a point, and
11. Performing Q ray-shooting queries in 2-D Constructive Solid Geometry (CSG) models of size N .
Goodrich et al. [59], Zhu [131], Arge et al. [18], Arge et al. [14], and Crauser et al. [44, 45] develop EM algorithms for those problems using these EM paradigms for batched problems:
Distribution sweeping, a generalization of the sorting distribution paradigm [124] for “externalizing” plane sweep algorithms.
Persistent B-trees, an offline method for constructing an optimal-space persistent version of the B-tree data structure (see Section 27.3.1),
yielding a factor of B improvement over the generic persistence techniques of Driscoll et al. [49].
Batched filtering, a general method for performing simultaneous EM searches in data structures that can be modeled as planar layered directed acyclic graphs; it is useful for 3-D convex hulls and batched point location. Multi search on parallel computers is considered in [48].
External fractional cascading, an EM analogue to fractional cascading on a segment tree, in which the degree of the segment tree is O(mα) for some constant 0 < α ≤ 1. Batched queries can be performed efficiently using batched filtering; online queries can be supported efficiently by adapting the parallel algorithms of work of Tamassia and Vitter [114] to the I/O setting.
External marriage-before-conquest, an EM analogue to the technique of Kirkpatrick and Seidel [76] for performing output-sensitive convex hull constructions.
Batched incremental construction, a localized version of the randomized incremental construction paradigm of Clarkson and Shor [42], in which the updates to a simple dynamic data structure are done in a random order, with the goal of fast overall performance on the average. The data structure itself may have bad worst-case performance, but the randomization of the update order makes worst- case behavior unlikely. The key for the EM version so as to gain the factor of B I/O speedup is to batch together the incremental modifications.
For illustrative purposes, we focus in the remainder of this section primarily on the distribution sweep paradigm [59], which is a combination of the distribution paradigm for sorting [124] and the well-known sweeping paradigm from computational geometry [47, 98]. As an example, let us consider how to achieve optimality Criteria B1 and B2 for computing the pairwise intersections of N orthogonal segments in the plane, making use of the following recursive distribution sweep: At each level of recursion, the region under consideration is
FIGURE 27.2: Distribution sweep used for finding intersections among N orthogonal segments. The vertical segments currently stored in the slabs are indicated in bold (namely s1, s2, ..., s9). Segments s5 and s9 are not active, but have not yet been deleted from the slabs. The sweep line has just advanced to a new horizontal segment that completely spans slabs 2 and 3, so slabs 2 and 3 are scanned and all the active vertical segments in slabs 2 and 3 (namely s2, s3, s4, s6, s7) are reported as intersecting the horizontal segment. In the process of scanning slab 3, segment s5 is discovered to be no longer active and can be deleted from slab 3. The end portions of the horizontal segment that “stick out” into slabs 1 and 4 are handled by the lower levels of recursion, where the intersection with s8 is eventually discovered.
partitioned into Θ(m) vertical slabs, each containing Θ(N/m) of the segments’ endpoints. We sweep a horizontal line from top to bottom to process the N segments. When the sweep line encounters a vertical segment, we insert the segment into the appropriate slab. When the sweep line encounters a horizontal segment h, as pictured in Figure 27.2, we report h’s intersections with all the “active” vertical segments in the slabs that are spanned completely by h. (A vertical segment is “active” if it intersects the current sweep line; vertical segments that are found to be no longer active are deleted from the slabs.) The remaining two end portions of h (which “stick out” past a slab boundary) are passed recursively to the next level, along with the vertical segments. The downward sweep then proceeds. After the initial sorting (to get the segments with respect to the y-dimension), the sweep at each of the O(logm n) levels of recursion requires O(n) I/Os, yielding the desired bound in B1 of O((n + q) logm n + z). Some timing experiments on distribution sweeping appear in [39].
Arge et al. [14] develop a unified approach to distribution sweep in higher dimensions.
A central operation in spatial databases is spatial join. A common preprocessing step is to find the pairwise intersections of the bounding boxes of the objects involved in the spatial join. The problem of intersecting orthogonal rectangles can be solved by combining the previous sweep line algorithm for orthogonal segments with one for range searching. Arge et al. [14] take a more unified approach using distribution sweep, which is extendible to higher dimensions: The active objects that are stored in the data structure in this case are rectangles, not vertical segments. The authors choose the branching factor to be Θ(√m ).
Each rectangle is associated with the largest contiguous range of vertical slabs that it spans. Each of the possible Θ((√m )) = Θ(m) contiguous ranges of slabs is called a multislab. The reason why the authors choose the branching factor to be Θ(√m ) rather than Θ(m) is so that the number of multislabs is Θ(m), and thus there is room in internal memory for a buffer for each multislab. The height of the tree remains O(logm n).
The algorithm proceeds by sweeping a horizontal line from top to bottom to process the N rectangles. When the sweep line first encounters a rectangle R, we consider the multislab lists for all the multislabs that R intersects. We report all the active rectangles in those multislab lists, since they are guaranteed to intersect R. (Rectangles no longer active are discarded from the lists.) We then extract the left and right end portions of R that partially “stick out” past slab boundaries, and we pass them down to process in the next lower level of recursion. We insert the remaining portion of R, which spans complete slabs, into the list for the appropriate multislab. The downward sweep then continues. After the initial sorting preprocessing, each of the O(logm n) sweeps (one per level of recursion) takes O(n) I/Os, yielding the desired bound O((n + q) logm n + z).
The resulting algorithm, called Scalable Sweeping-Based Spatial Join (SSSJ) [13, 14], outperforms other techniques for rectangle intersection. It was tested against two other sweep line algorithms: the Partition-Based Spatial-Merge (QPBSM) used in Paradise [97] and a faster version called MPBSM that uses an improved dynamic data structure for intervals [13]. The TPIE (Transparent Parallel I/O Environment) system [11, 117, 122] served as the common implementation platform. The algorithms were tested on several data sets. The timing results for the two data sets in Figures 27.3(a) and 27.3(b) are given in Figures 27.3(c) and 27.3(d), respectively. The first data set is the worst case for sweep line algorithms; a large fraction of the line segments in the file are active (i.e., they intersect the current sweep line). The second data set is a best case for sweep line algorithms, but the two PBSM algorithms have the disadvantage of making extra copies of the rectangles. In both cases, SSSJ shows considerable improvement over the PBSM-based methods. In other experiments done on more typical data, such as TIGER/line road data sets [116], SSSJ and MPBSM perform about 30% faster than does QPBSM. The conclusion we draw is that SSSJ is as fast as other known methods on typical data, but unlike other methods, it scales well even for worst-case data. If the rectangles are already stored in an index structure, such as the R-tree index structure we consider in Section 27.4.2, hybrid methods that combine distribution sweep with inorder traversal often perform best [12].
For the problem of finding all intersections among N line segments, Arge et al. [18] give an efficient algorithm based upon distribution sort, but the output component of the I/O bound is slightly nonoptimal: z logm n rather than z. Crauser et al. [44, 45] attain the optimal I/O bound of criterion B1, namely O((n + q) logm n + z), by constructing the trapezoidal decomposition for the intersecting segments using an incremental randomized construction. For I/O efficiency, they do the incremental updates in a series of batches, in which the batch size is geometrically increasing by a factor of m.
Online issues also arise in the analysis of batched EM algorithms: In practice, batched algorithms must adapt in a robust and online way when the memory allocation changes, and online techniques can play an important role. Some initial work has been done on memory-adaptive EM algorithms in a competitive framework [23].
FIGURE 27.3: Comparison of Scalable Sweeping-Based Spatial Join (SSSJ) with the orig- inal PBSM (QPBSM) and a new variant (MPBSM). In this variant of the problem, each data set contains N/2 red rectangles (designated by solid sides) and N/2 blue rectangles (designated by dashed sides), and the goal is to find all intersections between red rectangles and blue rectangles. In each data set shown, the number of intersections is O(N ): (a) Data set 1 consists of tall, skinny (vertically aligned) rectangles; (b) Data set 2 consists of short, wide (horizontally aligned) rectangles; (c) Running times on data set 1; (d) Running times on data set 2.
Comments
Post a Comment