How to pick a hash function, part 2
In part 1 we discussed universal hashing and introduced a classic family of universal hash functions, the Carter-Wegman hash function for integers and its generalizations for bigger data structures.
While that works fine, there is a simpler way that is equally good. We don’t need to deal with prime numbers and modulo, we can make do with just multiplications, additions and bit shifts, and get an equally good universal hash function family!
Hashing integers
We want a function that hashes a w-bit integer into m bits ().
Philipp Wölfel1 defined the following function in his Ph.D. thesis:
Pick two random w-bit integers a and b, with odd a. Then use:
In other words: compute , ignore overflow, and take the top m bits. Can’t get much simpler than this!
Incredibly this is a universal hash function family with optimal collision rate. For any pair of different x and y, the probability of collision is .
In C:
unsigned hash(unsigned x, int m) {
return (a * x + b) >> (w - m);
}
Hashing bigger data
Now suppose we have a bigger data structure consisting of n w-bit words, .
Pick random 2w-bit numbers and use:
There is an extra optimization possible - we can replace some multiplications by additions by taking input numbers in pairs. Suppose n is even.
Both of these hash functions are universal and guarantee collision probability of at most for any given two inputs.
Proof for h1
Wölfel1 has a rather complicated proof in his thesis of the collision probability, but we can see this in a simpler way.
First we need a little lemma:
Lemma: If r is an odd number, then is a 1-1 correspondence between w-bit numbers. It also maps odd numbers to odd numbers.
Proof: If , then and, since r is odd, , and so . Thus f is a 1-1 function. Also it clearly maps odd numbers to odd numbers. QED.
Take two different numbers x and y. We want to bound hash collision probability for these two inputs. Let k be the smallest bit position in which x and y differ. Therefore: , where r is an odd number.
Let . The hash function is the top m bits of . Also define D as follows:
By the lemma, since a is random odd and r is odd, D is a random odd integer shifted left by k bits.
Also since b does not appear in this formula for D, the random variable is independent of D. This is why we needed this +b term in the hash function.
Let’s split D into the top m bits and the rest: .
If , then and L = 0. Hashes differ by H and are therefore always different with 100% certainty in this case.
If then H is a uniformly random m-bit random variable independent of . Exactly one value of H makes the hashes equal, so collision probability is exactly .
Proof for h2
Suppose we have two different inputs: , .
Since they are different at some position, we may as well assume without loss of generality that . Again let k bit the first bit where they differ: .
Define . The hash function is the top m bits of g(x).
By the lemma, D is a random integer shifted left by k bits. It is also independent of E (because E doesn’t depend on ) and of g(y) (because g(y) has the independent +b term).
Split D into the top m bits and the rest: . We know that , and therefore H is a uniformly random number, independent of E, g(y) and L.
There is exactly one value of H that will make the top m bits of g(x) and g(y) match, therefore collision probability is always exactly .
Proof for h3
The proof is very similar as in the previous case.
And the same proof works. We just incorporated the extra (non-random) term into E, which doesn’t change anything that follows.
References
-
Wölfel, Philipp. Über die Komplexität der Multiplikation in eingeschränkten Branchingprogrammmodellen. Diss. Universität Dortmund, 2004. ↩ ↩2