Data Structures in Web Information Retrieval:Introduction and Inverted Indices

Introduction

Current search engines process thousands of queries per second over a collection of billions of web pages with a sub-second average response time. There are two reasons for this astonishing performance: Massive parallelism and a simple yet eﬃcient data structure, called inverted index.

In this chapter we will describe inverted indices. The parallelism deployed by search engines is quite straightforward: Given a collection of documents and a user query the goal of information retrieval is to ﬁnd all documents that are relevant to a user query and return them in decreasing order of relevance. Since on the Web there are often thousands of matches for a given user query, Web search engines usually return just the top 10 results and retrieve more results only upon request. This can be easily parallelized over m machines:

Distribute the documents equally over m − 1 machines, ﬁnd the best up to 10 documents for each machine and return them to the machine without documents, which then merges the lists to determine the overall top 10.

Since the users are only presented with the top 10 results they are usually annoyed if these results contain duplicates or near-duplicates. Thus, it is crucial for a search engine to detect near-duplicate web pages. In Section 50.4 we will describe a technique for doing so based on ﬁngerprints, which we will introduce in Section 50.3.

Inverted Indices

Given a user query consisting of terms t1,..., tn, a search engine has to ﬁnd all documents relevant to the query. Usually Web search engines make the following simplifying assumption: A document is relevant only if it contains all query terms. To ﬁnd all these documents the search engine uses the inverted (ﬁle) index data structure. It consists of two parts:

• A data structure containing every word t that appears in at least one document together with a pointer pt for each word and potentially a count ft of the number of documents containing the word; and

• an inverted list for every term t that consists of all the documents containing t and is pointed to by pt.

One popular implementation is to store the sorted list of words in a search tree whose leaves are connected by a linked list. This imposes an order on the words. The inverted lists are implemented as one large (virtual) array such that t’s inverted list consists of all documents between the position in the array pointed to by pt and by ptl , where tt is the term following t in the sorted list. in the web scenario this array does not ﬁt onto a single machine and thus needs to be distributed over multiple machines, either in RAM or on disk. Another implementation would be to store the list of words in any search tree or hash function together with the pointers pt. A special terminator document id (like 0) is appended to the inverted lists and they can be stored either in one large array or in multiple arrays. To determine all documents containing the query terms t1,..., tn (called a Boolean AND) the intersection of their inverted lists can be computed by traversing their lists in parallel. If one of the query terms is negated (Boolean NOT), i.e. the user does not want this term to be contained in the query, the negation of the corresponding list is used. To speed up the computation in the case that a very frequent term is combined with a very infrequent term the documents in the inverted list are stored in sorted order and a one-level or two-level search tree is stored for each inverted list. The one-level search tree consists of every 1/f -th entry entry in the inverted list together with a pointer to the location of this entry in the inverted list, where f is a constant like 100. At query time entries in the inverted list of the frequent term can be skipped quickly to get to the next document that contains the infrequent term. The space requirement is only 1/f of the space requirement of the space requirement of the inverted lists.

Index Compression

Of course, data compression is crucial when indexing billions of documents. For that it is usually assumed that the documents are numbered consecutively from 1 to N and that the documents are stored in the inverted list by increasing document id. A simple but powerful technique is called delta-encoding: Instead of storing the actual document ids in the inverted list, the document id of the ﬁrst document is stored together with the diﬀerence between the i-th and the i + 1-st document containing the term. When traversing the inverted list the ﬁrst document id is summed up with delta-values seen so far to get the document id of the current document. Note that delta-encoding does not interfere with the one-level or two-level search tree stored on top of the inverted list. The advantage of delta-encoding is that it turns large integer (document ids) into mostly small integers (the delta values), depending of course on the value of ft. A variable length encoding scheme, like Golomb codes, of the delta values then leads to considerable space saving. We sketch Golomb codes in the next paragraph. See [13] for more details on variable compression schemes.

There if a trade-oﬀ between the space used for the index and the time penalty encountered at query time for decompressing the index. If the index is read from disk, the time to read the appropriate part of the index dominates the compression time and thus more compression can be used. If the index is in RAM, then less compression should be used, except if sophisticated compression is the only way to make the index ﬁt in RAM.

Let n be the number of unique words in all the documents and let f be the number of words in all the documents, including repetitions. Given an integer x it takes at least log x bits to represent x in binary. However, the algorithm reading the index needs to know how many bits to read. There are these two possibilities: (1) represent x is in binary by log N bits in binary with the top log N − log x bits set to 0; or (2) represent x in unary by x 1’s followed by a 0. Usually the binary representation is chosen, but the unary representation is more space eﬃcient when x < log N . Golumb codes combine both approaches as follows. Choose b ≈ 0.69 N ·n . The Golumb code for an integer x ≥ 1 consists of two parts:

• A unary encoding of q followed by 0, where q = l(x − 1)/bJ; and

• a binary encoding of the remainder r = x − qb − 1. If b is a power of 2, this requires log b bits.

Note that N ·n is the average distance between entries in the inverted index, i.e., the average value after the delta-encoding. The value of b was chosen to be roughly 69% of the average. Every entry of value b or less requires 1 bit for the unary part and log b bits for the binary part. This is a considerable space saving over the binary representation since log b + 1 < log N . As long as (x − 1)/b < log N − log b − 1, there is a space saving over implementing using the binary encoding. So only entries that are roughly log N times larger than the average require more space than the average.

If the frequency ft of a term t is known, then the inverted list of t can be compressed even better using bt ≈ 0.69 N instead of b. This implies that for frequent terms, bt is smaller than b and thus fewer bits are used for small integers, which occur frequently in the delta-encoding of frequent terms.

Index Granularity

Another crucial issue is the level of granularity used in the inverted index. So far, we only discussed storing document ids. However to handle some query operators, like quotes, that require the query term to be next to each other in the document, the exact position of the document is needed. There are two possible ways to handle this:

(1) One way is to consider all documents to be concatenated into one large document and then store the positions in this large document instead of the document ids. Additionally one special inverted list is constructed that stores the position of the ﬁrst word of each document. At query time an implicit Boolean AND operation is performed with this special inverted list to determine to which document a given position belongs. Note that the above compression techniques continue to work. This approach is very space eﬃcient but incurs a run-time overhead at query time for traversing the special inverted list and performing the Boolean AND operation.

(2) Another solution is to store (document id,position)-pairs in the inverted list and to delta-encode the document ids and also the position within the same document. This approach uses more space, but does not incur the run-time overhead.

Another data structure that a search engine could use are suﬃx trees (see Chapter 29). They require, more space and are more complicated, but allow more powerful

Search This Blog

algorithms