Hash Tables:Introduction and Hash Tables for Integer Keys

Introduction

A set abstract data type (set ADT) is an abstract data type that maintains a set S under the following three operations:

1. Insert(x): Add the key x to the set.

2. Delete(x): Remove the key x from the set.

3. Search(x): Determine if x is contained in the set, and if so, return a pointer to x.

One of the most practical and widely used methods of implementing the set ADT is with hash tables.

Note that the three set ADT operations can easily be implemented to run in O(log n) time per operation using balanced binary search trees (See Chapter 10). If we assume that the input data are integers in the set U = {0,... ,u − 1} then they can even be implemented to run in sub-logarithmic time using data structures for integer searching (Chapter 39). However, these data structures actually do more than the three basic operations we require. In particular if we search for an element x that is not present in S then these data structures can report the smallest item in S that is larger than x (the successor of x) and/or the largest item in S that is smaller than x (the predecessor of x).

Hash tables do away with this extra functionality of ﬁnding predecessors and successors and only perform exact searches. If we search for an element x in a hash table and x is not present then the only information we obtain is that x ∈/ S. By dropping this extra functionality hash tables can give better performance bounds. Indeed, any reasonable hash table implementation performs each of the three set ADT operations in O(1) expected time.

Hand book of Data Structures and Applications

The main idea behind all hash table implementations discussed in this chapter is to store a set of n = |S| elements in an array (the hash table) A of length m ≥ n. In doing this, we require a function that maps any element x to an array location. This function is called a hash function h and the value h(x) is called the hash value of x. That is, the element x gets stored at the array location A[h(x)]. The occupancy of a hash table is the ratio α = n/m of stored elements to the length of A.

The study of hash tables follows two very diﬀerent lines. Many implementations of hash tables are based on the integer universe assumption: All elements stored in the hash table come from the universe U = {0,... ,u − 1}. In this case, the goal is to design a hash function h : U → {0,... ,m − 1} so that for each i ∈ {0,... ,m − 1}, the number of elements x ∈ S such that h(x) = i is as small as possible. Ideally, the hash function h would be such that each element of S is mapped to a unique value in {0,... ,m − 1}. Most of the hash functions designed under the integer universe assumption are number-theoretic constructions. Several of these are described in Section 9.2.

Historically, the integer universe assumption seems to have been justiﬁed by the fact that any data item in a computer is represented as a sequence of bits that can be interpreted as a binary number. However, many complicated data items require a large (or variable) number of bits to represent and this make u the size of the universe very large. In many applications u is much larger than the largest integer that can ﬁt into a single word of computer memory. In this case, the computations performed in number-theoretic hash functions become ineﬃcient.

This motivates the second major line of research into hash tables. This research work is based on the random probing assumptionrandom probing assumption: Each element x that is inserted into a hash table is a black box that comes with an inﬁnite random probe sequence x0, x1, x2,... where each of the xi is independently and uniformly distributed in {0,... ,m − 1}. Hash table implementations based on the random probing assumption are described in Section 9.3.

Both the integer universe assumption and the random probing assumption have their place in practice. When there is an easily computing mapping of data elements onto machine word sized integers then hash tables for integer universes are the method of choice. When such a mapping is not so easy to compute (variable length strings are an example) it might be better to use the bits of the input items to build a good pseudorandom sequence and use this sequence as the probe sequence for some random probing data structure.

To guarantee good performance, many hash table implementations require that the oc- cupancy α be a constant strictly less than 1. Since the number of elements in a hash table changes over time, this requires that the array A be resized periodically. This is easily done, without increasing the amortized cost of hash table operations by choosing three constants 0 < α1 < α2 < α3 < 1 so that, whenever n/m is not the interval (α1, α3) the array A is resized so that its size is n/α2.

A simple amortization argument (Chapter 1) shows that the amortized cost of this resizing is O(1) per update (Insert/Delete) operation.

Hash Tables for Integer Keys

In this section we consider hash tables under the integer universe assumption, in which the key values x come from the universe U = {0,... ,u − 1}. A hash function h is a function whose domain is U and whose range is the set {0,... ,m − 1}, m ≤ u. A hash function h is said to be a perfect hash function for a set S ⊆ U if, for every x ∈ S, h(x) is unique. A perfect hash function h for S is minimal if m = |S|, i.e., h is a bijection between S and {0,... ,m − 1}. Obviously a minimal perfect hash function for S is desirable since it allows us to store all the elements of S in a single array of length n. Unfortunately, perfect hash functions are rare, even for m much larger than n. If each element of S is mapped independently and uniformly to a random element of {0,... ,m − 1} then the birthday paradox (See, for example, Feller [27]) states that, if m is much less than n2 then there will almost surely exist two elements of S that have the same hash value.

We begin our discussion with two commonly used hashing schemes that are heuristic in nature. That is, we can not make any non-trivial statements about the performance of these schemes when storing an arbitrary set S. We then discuss several schemes that have provably good performance.

Hashing by Division

In hashing by division, we use the hash function

h(x) = x mod m .

To use this hash function in a data structure, we maintain an array A[0],... , A[m − 1] where each element of this array is a pointer to the head of a linked list (Chapter 2). The linked list Li pointed to by the array element A[i] contains all the elements x such that h(x) = i. This technique of maintaining an array of lists is called hashing with chaining.

In such a hash table, inserting an element x takes O(1) time; we compute i = h(x) and append (or prepend) x to the list Li. However, searching for and/or deleting an element x is not so easy. We have to compute i = h(x) and then traverse the list Li until we either ﬁnd x or reach the end of the list. The cost of this is proportional to the length of Li. Obviously, if our set S consists of the elements 0, m, 2m, 3m,... , nm then all elements are stored in the list L0 and searches and deletions take linear time.

However, one hopes that such pathological cases do not occur in practice. For example, if the elements of S are uniformly and independently distributed in U and u is a multiple of m then the expected size of any list Li is only n/m. In this case, searches and deletions take O(1 + α) expected time. To help avoid pathological cases, the choice of m is important. In particular, m a power of 2 is usually avoided since, in a binary computer, taking the remainder modulo a power of 2 means simply discarding some high-order bits. Taking m to be a prime not too close to a power of 2 is recommended [37].

Hashing by Multiplication

The implementation of a hash table using hashing by multiplication is exactly the same as that of hashing by division except that the hash function

is used. Here A is a real-valued constant whose choice we discuss below. The advantage of the multiplication method is that the value of m is not critical. We can take m to be a power of 2, which makes it convenient for use on binary computers.

Although any value of A gives a hash function, some values of A are better than others. (Setting A = 0 is clearly not a good idea.)

Knuth [37] suggests using the golden ratio for A, i.e., setting

This choice of A is motivated by a theorem, ﬁrst conjectured by Oder feld and later proven by Swierczkowski [59]. This theorem states that the sequence mA mod m, 2mA mod m, 3mA mod m, ..., nmA mod m partitions the interval (0, m) into n + 1 intervals having only three distinct lengths. Furthermore, the next element (n + 1)mA mod m in the sequence is always contained in one of the largest intervals.1 Of course, no matter what value of A we select, the pigeonhole principle implies that for u ≥ nm then there will always exist some hash value i and some S ⊆ U of size n such that h(x) = i for all x ∈ S. In other words, we can always ﬁnd a set S all of whose elements get stored in the same list Li. Thus, the worst case of hashing by multiplication is as bad as hashing by division.

Universal Hashing

The argument used at the end of the previous section applies equally well to any hash function h. That is, if the table size m is much smaller than the universe size u then for any hash function there is some large (of size at least 1u/ml) subset of U that has the same hash value. To get around this diﬃculty we need a collection of hash functions from which we can choose one that works well for S. Even better would be a collection of hash functions such that, for any given S, most of the hash functions work well for S. Then we could simply pick one of the functions at random and have a good chance of it working well. Let H be a collection of hash functions, i.e., functions from U onto {0,... ,m − 1}. We say that H is universal if, for each x, y ∈ U the number of h ∈ H such that h(x) = h(y) is at most |H|/m. Consider any S ⊆ U of size n and suppose we choose a random hash function h from a universal collection of hash functions. Consider some value x ∈ U . The probability that any key y ∈ S has the same hash value as x is only 1/m. Therefore, the expected number of keys in S, not equal to x, that have the same hash value as x is only

Therefore, if we store S in a hash table using the hash function h then the expected time to search for, or delete, x is O(1 + α).

From the preceding discussion, it seems that a universal collection of hash functions from which we could quickly select one at random would be very handy indeed. With such a collection at our disposal we get an implementation of the set ADT that has O(1) insertion time and O(1) expected search and deletion time.

Carter and Wegman [8] describe three diﬀerent collections of universal hash functions. If the universe size u is a prime number² then

is a collection of universal hash functions. Clearly, choosing a function uniformly at random from H can be done easily by choosing two random values k1 ∈ {1,... ,u − 1} and k2 ∈ {0,... ,u − 1}. Thus, we have an implementation of the set ADT with O(1) expected time per operation.

Static Perfect Hashing

The result of Carter and Wegman on universal hashing is very strong, and from a practical point of view, it is probably the strongest result most people will ever need. The only thing that could be improved about their result is to make it deterministic, so that the running times of all operations are O(1) worst-case. Unfortunately, this is not possible, as shown by Dietzfel binger et al. [23].

Since there is no hope of getting O(1) worst-case time for all three set ADT operations, the next best thing would be to have searches that take O(1) worst-case time. In this section we describe the method of Fred man, Koml´os and Sze mer´edi [28]. This is a static data structure that takes as input a set S ⊆ U and builds a data structure of size O(n) that can test if an element x is in S in O(1) worst-case time. Like the universal hash functions from the previous section, this method also requires that u be a prime number. This scheme uses hash functions of the form

Let Bk,m(S, i) be the number of elements x ∈ S such that hk,m(x) = i, i.e., the number of elements of S that have hash value i when using the hash function hk,m . The function Bk,m gives complete information about the distribution of hash values of S. The main lemma

used by Fredman et al. is that, if we choose k ∈ U uniformly at random then

There are two important special cases of this result.

In the sparse case we take m = n2/α, for some constant 0 < α < 1. In this case, the expectation in (9.1) is less than α. Therefore, by Markov’s inequality, the probability that this sum is greater than or equal to 1 is at most α. But, since this sum is a non-negative integer, then with probability at least 1 − α it must be equal to 0. In other words, with probability at least 1 − α, Bk,m(S, i) ≤ 1 for all 0 ≤ i ≤ m − 1, i.e., the hash function hk,m is perfect for S. Of course this implies that we can ﬁnd a perfect hash function very quickly by trying a small number of random elements k ∈ U and testing if they result in perfect hash functions. (The expected number of elements that we will have to try is only 1/(1 − α).) Thus, if we are willing to use quadratic space then we can perform searches in O(1) worst-case time.

In the dense case we assume that m is close to n and discover that, for many values of k, the hash values are distributed fairly evenly among the set 1,... , m. More precisely, if we use a table of size m = n, then

By Markov’s inequality this means that

Again, we can quickly ﬁnd a value of k satisfying (9.2) by testing a few randomly chosen values of k.

These two properties are enough to build a two-level data structure that uses linear space and executes searches in worst-case constant time. We call the following data structure the FKS-α data structure, after its inventors Fredman, Koml´os and Szemer´edi. At the top level, the data structure consists of an array A[0],..., A[m − 1] where m = n. The elements of this array are pointers to other arrays A0,..., Am−1, respectively. To decide what will be stored in these other arrays, we build a hash function hk,m that satisﬁes the conditions of (9.2). This gives us the top-level hash function hk,m(x) = (kx mod u) mod m. Each element x ∈ S gets stored in the array pointed to by A[hk,m (x)].

What remains is to describe how we use the arrays A0,..., Am−1. Let Si denote the set of elements x ∈ S such that hk,m (s) = i. The elements of Si will be stored in Ai.

The size of Si is ni = Bk,m(S, i). To store the elements of Si we set the size of Ai to mi = ni2/α = Bk,n(S, i)2/α. Observe that, by (9.2), all the Ai’s take up a total space of O(n), i.e., ),m−1 mi = O(n). Furthermore, by trying a few randomly selected integers we can quickly ﬁnd a value ki such that the hash function hki ,mi is perfect for Si. Therefore, we store the element x ∈ Si at position Ai[hki ,mi (x)] and x is the unique element stored at that location. With this scheme we can search for any value x ∈ U by computing two hash values i = hk,m (x) and j = hki ,mi (x) and checking if x is stored in Ai[j].

Building the array A and computing the values of n0,..., nm−1 takes O(n) expected time since for a given value k we can easily do this in O(n) time and the expected number of values of k that we must try before ﬁnding one that satisﬁes (9.2) is O(1). Similarly, building each subarray Ai takes O(ni2) expected time, resulting in an overall expected running time of O(n). Thus, for any constant 0 < α < 1, an FKS-α data structure can be constructed in O(n) expected time and this data structure can execute a search for any x ∈ U in O(1) worst-case time.

Dynamic Perfect Hashing

The FKS-α data structure is nice in that it allows for searches in O(1) time, in the worst case. Unfortunately, it is only static; it does not support insertions or deletions of elements. In this section we describe a result of Dietzfel binger et al. [23] that shows how the FKS-α data structure can be made dynamic with some judicious use of partial rebuilding (Chapter 10). The main idea behind the scheme is simple: be lazy at both the upper and lower levels of the FKS-α data structure. That is, rebuild parts of the data structure only when things go wrong. At the top level, we relax the condition that the size m of the upper array A is exactly n and allow A to have size anywhere between n and 2n. Similarly, at the lower level we allow the array Ai to have a size mi anywhere between ni2/α and 2ni2/α.

Periodically, we will perform a global rebuilding operation in which we remove all n elements from the hash table. Some elements which have previously been marked as deleted will be discarded, thereby reducing the value of n. We put the remaining elements in a list, and recompute a whole new FKS-(α/2) data structure for the elements in the list. This data structure is identical to the standard FKS-(α/2) data structure except that, at the top level we use an array of size m = 2n.

Searching in this data structure is exactly the same as for the static data structure. To search for an element x we compute i = hk,m (x) and j = hki ,mi (x) and look for x at location Ai[j]. Thus, searches take O(1) worst-case time.

Deleting in this data structure is done in the laziest manner possible. To delete an element we only search for it and then mark it as deleted. We will use the convention that this type of deletion does not change the value of n since it does not change the number of elements actually stored in the data structure. While doing this, we also keep track of the number of elements that are marked as deleted. When this number exceeds n/2 we perform a global rebuilding operation. The global rebuilding operation takes O(n) expected time, but only occurs during one out of every n/2 deletions. Therefore, the amortized cost of this operation is O(1) per deletion.

The most complicated part of the data structure is the insertion algorithm and its analysis. To insert a key x we know, because of how the search algorithm works, that we must ultimately store x at location Ai[j] where i = hk,m (x) and j = hki ,mi (x). However, several things can go wrong during the insertion of x:

1. The value of n increases by 1, so it may be that n now exceeds m. In this case we perform a global rebuilding operation and we are done.

2. We compute i = hk,m (x) and discover that ),m−1 ni2 > 3n/α. In this case, the hash function hk,m used at the top level is no longer any good since it is producing an overall hash table that is too large. In this case we perform a global rebuilding operation and we are done.

3. We compute i = hk,m(x) and discover that, since the value of ni just increased by one, ni2/α > mi. In this case, the array Ai is too small to guarantee that we can quickly ﬁnd a perfect hash function. To handle this, we copy the elements of Ai into a list L and allocate a new array Ai with the new size mi = 2ni2/α. We then ﬁnd a new value ki such that hki ,mi is a perfect hash function for the elements of L and we are done.

4. The array location Ai[j] is already occupied by some other element y. But in this case, we know that Ai is large enough to hold all the elements (otherwise we would already be done after Case 3), but the value ki being used in the hash function hki ,mi is the wrong one since it doesn’t give a perfect hash function for Si. Therefore we simply try new values for ki until we ﬁnd a ﬁnd a value ki that yields a perfect hash function and we are done.

If none of the preceding 4 cases occurs then we can simply place x at location Ai[j] and we are done.

Handling Case 1 takes O(n) expected time since it involves a global rebuild of the entire data structure. However, Case 1 only happens during one out of every Θ(n) insertions, so the amortized cost of all occurrences of Case 1 is only O(1) per insertion.

Handling Case 2 also takes O(n) expected time. The question is: How often does Case 2 occur? To answer this question, consider the phase that occurs between two consecutive occurrences of Case 1. During this phase, the data structure holds at most m distinct elements. Call this set of elements S. With probability at least (1 − α) the hash function hk,m selected at the beginning of the phase satisﬁes (9.2) so that Case 2 never occurs during the phase. Similarly, the probability that Case 2 occurs exactly once during the phase is at most α(1 − α). In general, the probability that Case 2 occurs exactly i times during a phase is at most αi(1 − α). Thus, the expected cost of handling all occurrences of Case 2

during the entire phase is at most

But since a phase involves Θ(n) insertions this means that the amortized expected cost of handling Case 2 is O(1) per insertion.

Next we analyze the total cost of handling Case 3. Deﬁne a subphase as the period of time between two global rebuilding operations triggered either as a result of a deletion, Case 1 or Case 2. We will show that the total cost of handling all occurrences of Case 3 during a subphase is O(n) and since a subphase takes Θ(n) time anyway this does not contribute to the cost of a subphase by more than a constant factor. When Case 3 occurs at the array Ai it takes O(mi) time. However, while handling Case 3, mi increases by a constant factor, so the total cost of handling Case 3 for Ai is dominated by the value of mi at the end of the subphase. But we maintain the invariant that ),m−1 mi = O(n) during the entire subphase. Thus, handling all occurrences of Case 3 during a subphase only requires O(n) time.

Finally, we consider the cost of handling Case 4. For a particular array Ai, consider the subsubphase between which two occurrences of Case 3 cause Ai to be rebuilt or a global rebuilding operation takes place. During this subsubphase the number of distinct elements that occupy Ai is at most α√mi. Therefore, with probability at least 1 − α any randomly chosen value of ki ∈ U is a perfect hash function for this set. Just as in the analysis of Case 2, this implies that the expected cost of handling all occurrences of Case 3 at Ai during a subsubphase is only O(mi). Since a subsubphase ends with rebuilding all of Ai or a global rebuilding, at a cost of Ω(mi) all the occurrences of Case 4 during a subsubphase do not contribute to the expected cost of the subsubphase by more than a constant factor.

To summarize, we have shown that the expected cost of handling all occurrences of Case 4 is only a constant factor times the cost of handling all occurrences of Case 3. The cost of handling all occurrences of Case 3 is no more than a constant factor times the expected cost of all global rebuilds. The cost of handling all the global rebuilds that occur as a result of Case 2 is no more than a constant factor times the cost of handling all occurrences of global rebuilds that occur as a consequence of Case 1. And ﬁnally, the cost of all global rebuilds that occur as a result of Case 1 or of deletions is O(n) for a sequence of n update operations. Therefore, the total expected cost of n update operation is O(n).

Search This Blog

algorithms