Data Structures in Web Information Retrieval:Fingerprints
Fingerprints
Fingerprints are short strings that represent larger strings and have the following properties:
• If the fingerprints of two strings are different then the strings are guaranteed to be different.
• If the fingerprints of two strings are identical then there is only a small probability that the strings are different.
If two different strings have the same fingerprint, it is called a collision. Thus, fingerprints are a type of hash functions where the hash table is populated only very sparsely in exchange for a very low collision probability.
Fingerprints are very useful. For example, search engines can use them to quickly check whether a URL that is contained on a web page is already stored in the index. We will describe how they are useful for finding near-duplicate documents in the next section. One fingerprinting scheme is due to Rabin [12] and is equivalent to cyclic redundancy checks. We follow here the presentation by Broder [3]. Let s be a binary string and let st = 1s, i.e. st equals s prefixed with a 1. The string st = (s1, s2,... , sm) induces a polynomial S(t) over Z2 as follows:
S(t) = s1tm−1 + s2tm−2 + ··· + sm.
Let P (t) be an irreducible polynomial of degree k over Z2. The fingerprint of s is defined to be the polynomial
f (s) = S(t) mod P (t)
over Z2, which can be simply represented as a string of its coefficients. As shown in [3] the probability that two distinct strings of length m from a set of n strings have the same fingerprint is less than nm2/2k. Thus, given n and m the probability of collision can be reduced to the required value simply by increasing k, i.e. the length of a fingerprint.
Rabin’s fingerprints have the property that given two overlapping strings s = (s1, s2,..., sm) and t = (t1, t2,... , tm) such that ti+1 = si for all 1 ≤ i < m then the fingerprint of t can be computed from the fingerprint of s very efficiently: If
Hence f (t) can be computed from f (s) as follows: Assume f (s) is stored in a shift register and Q(t) is stored in another register. Shift left with tm as input and if r1 = 1 perform a bit-wise EX-OR operation with Q(t). 32-bit word computer.
See [3] for a software implementation for a typical Other ways of computing fingerprints are cryptographic checksums like Sha-1 or MD5.
Comments
Post a Comment