Data Structures in Web Information Retrieval:Finding Near-Duplicate Documents and Conclusions

Finding Near-Duplicate Documents

Users of search engine strongly dislike getting near-duplicate results on the same results page. Of course it depends on the user what s/he considers to be near-duplicate. However, most users would agree that pages with the same content except for diﬀerent ads or a diﬀerent layout or a diﬀerent header or footer are near-duplicates. By “near-duplicate” we mean these kinds of “syntactic” duplicates – semantic duplication is even harder detect.

Unfortunately, there are a variety of circumstances that create near-duplicate pages. The main reasons are (1) local copies of public-domain pages (for example the PERL-manual or the preamble of the US constitution) or various databases and (2) multiple visits of the same page by the crawler (the part of the search engine that retrieves web pages to store in the index) without detection of the replication. The latter can happen because there were small changes to the URL and the content of the page. Usually, crawlers ﬁngerprint the URL as well as the content of the page and thus can easily detect exact duplicates.

There are various approaches to detect near-duplicate pages. One approach that works very well in the Web setting was given by Broder [4] and was validated on the Web [6]. The basic idea is to associate a set of ﬁngerprints with each document such that the similarity of two pages is proportional to the size of the intersection of their respective sets.

We describe ﬁrst how to compute the ﬁngerprints and show then how they can be used to eﬃciently detect near-duplicates. The set of ﬁngerprints for a document is computed using the shingling technique introduced by Brin et al. [2]. Let a token be either a letter, a word, a line, or a sentence in a document. Each document consists of a sequence of token. We call a contiguous sequence of q tokens a q-shingle of the document. We want to compute ﬁngerprints for all q-shingles in a document. Using Rabin’s ﬁngerprints this can be done eﬃciently if we start from the ﬁrst shingle and then use a sliding window approach to compute the ﬁngerprint of the shingle currently in the window.

Let SD be the set of ﬁngerprints generated for document D. The idea is simply to deﬁne the similarity or resemblance r(A, B) of document A and B as

Experiments have indicated that a resemblance value close to 1 captures well the information notion of “syntactic” near-duplication that we discussed above.

Storing the whole set SD would take a lot of space. It suﬃces, however, to keep a ﬁxed number of ﬁngerprints from SD for each document D. This subset is called a sketch. As we will show below the sketches can be computed in time linear in the size of D and will be stored for each document. The resemblance of two documents can be approximated by using the sketches instead of the full sets of ﬁngerprints. Thus, the time is only linear in the size of the sketches.

Sketches are determined as follows: Recall that each ﬁngerprints requires k bits. Thus, SD ⊆ {1,... , n}, where n = 2k. Let π be a permutation chosen uniformly at random from the set of all permutations of [n]. Let X = min(min{π(SA)}, min{π(SB )}).

The crucial observation is that if min{π(SA )} = min{π(SB )}, then X = min{π(SA)} must belong to π(SA ) ∩ π(SB ). If min{π(SA )} /= min{π(SB )}, then X belongs to either π(SA) − π(SB ) or to π(SB ) − π(SA ), and, thus, X does not belong to π(SA ) ∩ π(SB ). It follows that min{π(SA)} = min{π(SB )} if and only if X belongs to π(SA ) ∩ π(SB ). Since π was chosen uniformly at random the probability of the latter is

We choose p independent random permutations. For each document the sketch consists of the permuted ﬁngerprint values min{π1(SD )}, min{π2(SD )},... , min{πp(SD )}. The re-semblance of two documents is then estimated by the intersection of their sketches, whose expected value is proportional to the resemblance.

In practice, π cannot be chosen uniformly at random which led to the study of min-wise independent permutations [5].

To detect the near-duplicates in a set of documents we build for each shingle in a sketch the list of documents to which this shingle belongs. Using this list we generate for each shingle and each pair of document containing this shingle an (document id, document id)- pair. Finally we simply sort these pairs to determine the resemblance for each pair. The running time is proportional to the number of (document id, document id)-pairs. If each shingle only belongs to a constant number of documents, it is linear in the number of unique shingles.

In a web search engine it often suﬃces to remove the near-duplicates at query time. To do this the sketch is stored with each document. The search engine determines the top 50 or so results and then performs a pair-wise near-duplicate detection with some resemblance threshold, removing the near duplicate results, to determine the top 10 results.

The described near-duplicate detection algorithm assumes that the similarity between documents depends on the size of the overlap in their ﬁngerprint sets. This corresponds to the Jaccard coeﬃcient of similarity used in information retrieval. A more standard approach in information retrieval is to assign a weight vector of terms to each document and to deﬁne the similarity of two documents as the cosine of their (potentially normalized) term vectors. Charikar [7] gives an eﬃcient near-duplicate detection algorithm based on this measure. He also presents a near-duplicate detection algorithms for another metric, the Earth Mover Distance.

Other similarity detection mechanisms are given in [2, 9–11]. Many near-duplicate web pages are created because a whole web host is duplicated. Thus, a large percentage of the near-duplicate web pages can be detected if near-duplicate web hosts can be found. Techniques for this problem are presented in [1, 8].

Conclusions

In this chapter we presented the dominant data structure for web search engines, the inverted index. We also described ﬁngerprints, which are useful at multiple places in a search engine, and document sketches for near-duplicate detection. Other useful data structures for web information retrieval are adjacency lists: All web search engines claim to perform some type of analysis of the hyperlink structure. For this, they need to store the list of incoming links for each document, a use of the classic adjacency list representation.

Web search engines also sometimes analyze the log of the user queries issued at the search engine. With hundreds of millions of queries per day these logs are huge and can only be processed with one-pass data stream algorithms.

Search This Blog

algorithms