Geographic Information Systems:Geographic Information Systems: What They Are All About
Geographic Information Systems: What They Are All About
Geographic Information Systems (GISs) serve the purpose of maintaining, analyzing and visualizing spatial data that represent geographic objects, such as mountains, lakes, houses, roads, tunnels. For spatial data, geometric (spatial) attributes play a key role, representing e.g. points, lines, and regions in the plane or volumes in 3-dimensional space. They model geographical features of the real world, such as geodesic measurement points, boundary lines between adjacent pieces of land (in a cadastral database), lakes or recreational park regions (in a tourist information system). In three dimensions, spatial data describe tunnels, underground pipe systems in cities, mountain ranges, or quarries. In addition, spatial data in a GIS possess non-geometric, so-called thematic attributes, such as the time when a geodesic measurement was taken, the name of the owner of a piece of land in a cadastral database, the usage history of a park.
This chapter aims to highlight some of the data structures and algorithms aspects of GISs that define challenging research problems, and some that show interesting solutions. More background information and deeper technical expositions can be found in books such as [38, 64, 66, 75].
Geometric Objects
Our technical exposition will be limited to geometric objects with a vector representation. Here, a point is described by its coordinates in the Euclidean plane with a Cartesian co- ordinate system. We deliberately ignore the geographic reality that the earth is (almost) spherical, to keep things simple. A line segment is defined by its two end points. A polygonal line is a sequence of line segments with coinciding endpoints along the way. A (simple) polygonal region is described by its corner points, in clockwise (or counterclockwise) order around its interior. In contrast, in a raster representation of a region, each point in the region, discretized at some resolution, has an individual representation, just like a pixel in a raster image. Satellites take raster images at an amazing rate, and hence raster data abound in GISs, challenging current storage technology with terabytes of incoming data per day. Nevertheless, we are not concerned with raster images in this chapter, even though some of the techniques that we describe have implications for raster data [58]. The reason for our choice is the different flavor that operations with raster images have, as compared with vector data, requiring a chapter of their own.
Topological versus Metric Data
For some purposes, not even metric information is needed; it is sufficient to model the topology of a spatial dataset. How many states share a border with Iowa? is an example of a question of this topological type. In this chapter, however, we will not specifically study the implications that this limitation has. There is a risk of confusing the limitation to topological aspects only with the explicit representation of topology in the data structure. Here, the term explicit refers to the fact that a topological relationship need not be computed with substantial effort. As an example, assume that a partition of the plane into polygons is stored so that each polygon individually is a separate clockwise sequence of points around its interior. In this case, it is not easy to find the polygons that are neighbors of a given polygon, that is, share some boundary. If, however, the partition is stored so that each edge of the polygon explicitly references both adjacent polygons (just like the doubly connected edge list in computational geometry [62]), then a simple traversal around the given polygon will reveal its neighbors. It will always be an important design decision for a data structure which representation to pick.
Geometric Operations
Given this range of applications and geometric objects, it is no surprise that a GIS is expected to support a large variety of operations. We will discuss a few of them now, and then proceed to explain in detail how to perform two fundamental types of operations in the remainder of the chapter, spatial searching and spatial join. Spatial searching refers to rather elementary geometric operations without which no GIS can function. Here are a few examples, always working on a collection of geometric objects, such as points, lines, polygonal lines, or polygons. A nearest neighbor query for a query point asks for an object in the collection that is closest to the query point, among all objects in the collection. A distance query for a query point and a certain distance asks for all objects in the collection that are within the given distance from the query point. A range query (or window query) for a query range asks for all objects in the collection that intersect the given orthogonal window. A ray shooting query for a point and a direction asks for the object in the collection that is hit first if the ray starts at the given point and shoots in the given direction. A point- in-polygon query for a query point asks for all polygons in the collection in which the query point lies. These five types of queries are illustrations only; many more query types are just as relevant. For a more extensive discussion on geometric operations, see the chapters on geometric data structures in this Handbook. In particular, it is well understood that great care must be taken in geometric computations to avoid numeric problems, because already tiny numerical inaccuracies can have catastrophic effects on computational results. Practically all books on geometric computation devote some attention to this problem [13, 49, 62], and geometric software libraries such as CGAL [11] take care of the problem by offering exact geometric computation.
Naturally, we can only hope to respond to queries of this nature quickly, if we devise and make use of appropriate data structures. An extra complication arises here due to the fact that GISs maintain data sets too large for internal memory. Data maintenance and analysis operations can therefore be efficient only if they take external memory properties into account, as discussed also in other chapters in this Handbook. We will limit ourselves here to external storage media with direct access to storage blocks, such as disks (for raster data, we would need to include tertiary storage media such as tapes). A block access to a random block on disk takes time to move the read-write-head to the proper position (the latency), and then to read or write the data in the block (the transfer). With today’s disks, where block sizes are on the order of several kBytes, latency is a few milliseconds, and transfer time is less. Therefore, it pays to read a number of blocks in consecution, because they require the head movement only once, and in this way amortize its cost over more than one block. We will discuss in detail how to make use of this cost savings possibility.
All operations on an external memory geometric data structure follow the general filter- refinement pattern [54] that first, all relevant blocks are read from disk. This step is a first (potentially rough) filter that makes a superset of the relevant set of geometric objects available for processing in main memory. In a second step, a refinement identifies the exact set of relevant objects. Even though complicated geometric operators can make this refinement step quite time consuming, in this chapter we limit our considerations to the filter step. Because queries are the dominant operations in GISs by far, we do not explicitly discuss updates (see the chapters on external memory spatial data structures in this Handbook for more information).
Applications of Geographic Information
Before we go into technical detail, let us mention a few of the applications that make GISs a challenging research area up until today, with more fascinating problems to expect than what we can solve.
Map Overlay
Maps are the most well-known visualizations of geographical data. In its simplest form, a map is a partition of the plane into simple polygons. Each polygon may represent, for instance, an area with a specific thematic attribute value. For the attribute land use, polygons can stand for forest, savanna, lake areas in a simplistic example, whereas for the attribute state, polygons represent Arizona, New Mexico, Texas. In a GIS, each separable aspect of the data (such as the planar polygonal partitions just mentioned) is said to define a layer. This makes it easy to think about certain analysis and viewing operations, by just superimposing (overlaying) layers or, more generally, by applying Boolean operations on sets of layers. In our example, an overlay of a land use map with a state map defines a new map, where each new polygon is (some part of) the intersection of two given polygons, one from each layer. In map overlay in general, a Boolean combination of all involved thematic attributes together defines polygons of the resulting map, and one resulting attribute value in our example are the savannas of Texas. Map overlay has been studied in many different contexts, ranging from the special case of convex polygons in the partition and an internal memory plane-sweep computation [50] to the general case that we will describe in the context of spatial join processing later in this chapter.
Map Labeling
Map visualization is an entire field of its own (traditionally called cartography), with the general task to layout a map in such a way that it provides just the information that is desired, no more and no less; one might simply say, the map looks right. What that means in general is hard to say. For maps with texts that label cities, rivers, and the like, looking right implies that the labels are in proper position and size, that they do not run into each other or into important geometric features, and that it is obvious to which geometric object a label refers. Many simply stated problems in map labeling turn out to be NP-hard to solve exactly, and as a whole, map labeling is an active research area with a variety of unresolved questions (see [47] for a tutorial introduction).
Cartographic Generalization
If cartographers believe that automatically labeled maps will never look really good, they believe even more that another aspect that plays a role in map visualization will always need human intervention, namely map generalization. Generalization of a map is the process of reducing the complexity and contents of a map by discarding less important information and retaining the more essential characteristics. This is most prominently used in producing a map at a low resolution, given a map at a high resolution. Generalization ensures that the reader of the produced low resolution map is not overwhelmed with all the details from the high resolution map, displayed in small size in a densely filled area. Generalization is viewed to belong to the art of map making, with a whole body of rules of its own that can guide the artist [9, 46]. Nevertheless, computational solutions of some subproblem help a lot, such as the simplification of a high resolution polygonal line to a polygonal line with fewer corner points that does not deviate too much from the given line. For line simplification, old algorithmic ideas [16] have seen efficient implementations [28] recently. Maps on demand, with a selected viewing window, to be shown on a screen with given resolution, imply the choice of a corresponding scale and therefore need the support of data structures that allow the retrieval up to a desired degree of detail [4]. Apart from the simplest aspects, automatic map generalization and access support are open problems.
Road Maps
Maps have been used for ages to plan trips. Hence, we want networks of roads, railways, and the like to be represented in a GIS, in addition to the data described above. This fits naturally with the geometric objects that are present in a GIS in any case, such as polygonal lines. A point where polygonal lines meet (a node) can then represent an intersection of roads (edges), with a choice which road to take as we move along. The specialty in storing roads comes from the fact that we want to be able to find paths between nodes efficiently, for instance in car navigation systems, while we are driving. The fact that not all roads are equally important can be expressed by weights on the edges. Because a shortest path
Geographic Information Systems 55-5
computation is carried out as a breadth first search on the edge weighted graph, in one way or another (e.g. bidirectional), it makes sense to partition the graph into pages so as to minimize the weight of edges that cross the cuts induced by the partition. Whenever we want to maintain road data together with other thematic data, such as land use data, it also makes sense to store all the data in one structure, instead of using an extra structure for the road network. It may come as no surprise that for some data structures, partitioning the graph and partitioning the other thematic aspects go together very well (compromising a little on both sides), while for others this is not easily the case. The compromise in partitioning the graph does almost no harm, because it is NP-complete to find the optimum partition, and hence a suboptimal solution of some sort is all we can get anyway. Even though this type of heuristic approaches for maintaining road networks in GIS are useful [69], it is by no means clear whether this is the best that can be achieved.
Spatiotemporal Data
Just like for many other database applications, a time component brings a new dimension to spatial data (even in the mathematical sense of the word, if you wish). How did the forest areas in New Mexico develop over the last 20 years? Questions like this one demonstrate that for environmental information systems, a specific branch of GISs, keeping track of developments over time is a must. Spatiotemporal database research is concerned with all problems that the combination of space with time raises, from models and languages, all the way through data structures and query algorithms, to architectures and implementations of systems [36]. In this chapter, we refrain from the temptation to discuss spatiotemporal data structures in detail; see Chapter 22 for an introduction into this lively field.
Data Mining
The development of spatial data over time is interesting not only for explicit queries, but also for data mining. Here, one tries to find relevant patterns in the data, without knowing beforehand the character of the pattern (for an introduction to the field of data mining, see [27]).
Let us briefly look at a historic example for spatial data mining: A London epidemiologist identified a water pump as the centroid of the locations of cholera cases, and after the water pump was shut down, the cholera subsided. This and other examples are described in [68]. If we want to find patterns in quite some generality, we need a large data store that keeps track of data extracted from different data sources over time, a so- called data warehouse. It remains as an important, challenging open problem to efficiently run a spatial data warehouse and mine the spatial data. The spatial nature of the data seems to add the extra complexity that comes from the high autocorrelation present in typical spatial data sets, with the effect that most knowledge discovery techniques today perform poorly. This omnipresent tendency for data to cluster in space has been stated nicely [74]: Everything is related to everything else but nearby things are more related than distant things. For a survey on the state of the art and the challenges of spatial data
Comments
Post a Comment