How to pick a hash function, part 2
In part 1 we discussed universal hashing and introduced a classic family of universal hash functions, the CarterWegman hash function for integers and its generalizations for bigger data structures.
While that works fine, there is a simpler way that is equally good. We don’t need to deal with prime numbers and modulo, we can make do with just multiplications, additions and bit shifts, and get an equally good universal hash function family!
Hashing integers
We want a function that hashes a wbit integer into m bits ($m \le w$).
Philipp Wölfel^{1} defined the following function in his Ph.D. thesis:
Pick two random wbit integers a and b, with odd a. Then use:
$h_1(x) = (ax + b) \bmod 2^w \text{ div } 2^{wm}$In other words: compute $ax+b$, ignore overflow, and take the top m bits. Can’t get much simpler than this!
Incredibly this is a universal hash function family with optimal collision rate. For any pair of different x and y, the probability of collision is $\Pr(h_1(x) = h_1(y)) \le 2^{m}$.
In C:
unsigned hash(unsigned x, int m) {
return (a * x + b) >> (w  m);
}
Hashing bigger data
Now suppose we have a bigger data structure consisting of n wbit words, $(x_1, \ldots, x_n)$.
Pick random 2wbit numbers $a_1, \ldots, a_n, b$ and use:
$h_2(x_1,\ldots,x_n) = (a_1x_1 + \ldots + a_nx_n + b) \bmod 2^{2w} \text{ div } 2^{2wm}$There is an extra optimization possible  we can replace some multiplications by additions by taking input numbers in pairs. Suppose n is even.
$h_3(x_1,\ldots,x_n) = ((x_1 + a_2)(x_2 + a_1) + \ldots + (x_{n1} + a_n)(x_n + a_{n1}) + b) \bmod 2^{2w} \text{ div } 2^{2wm}$Both of these hash functions are universal and guarantee collision probability of at most $2^{m}$ for any given two inputs.
Proof for h_{1}
Wölfel^{1} has a rather complicated proof in his thesis of the collision probability, but we can see this in a simpler way.
First we need a little lemma:
Lemma: If r is an odd number, then $f(x) = rx \bmod 2^w$ is a 11 correspondence between wbit numbers. It also maps odd numbers to odd numbers.
Proof: If $rx \equiv ry \pmod {2^w}$, then $2^w  r(xy)$ and, since r is odd, $2^w  (xy)$, and so $x \equiv y \pmod{2^w}$. Thus f is a 11 function. Also it clearly maps odd numbers to odd numbers. QED.
Take two different numbers x and y. We want to bound hash collision probability for these two inputs. Let k be the smallest bit position in which x and y differ. Therefore: $xy = r2^k$, where r is an odd number.
Let $g(x) = (ax+b) \bmod 2^w$. The hash function $h_1(x)$ is the top m bits of $g(x)$. Also define D as follows:
$D \equiv g(x)  g(y) \equiv (ax + b)  (ay+b) \equiv a(xy) \equiv ar2^k \pmod{2^w}$By the lemma, since a is random odd and r is odd, D is a random odd integer shifted left by k bits.
Also since b does not appear in this formula for D, the random variable $g(y) = (ay+b)\bmod 2^w$ is independent of D. This is why we needed this +b term in the hash function.
Let’s split D into the top m bits and the rest: $D = H 2^{wm} + L$.
$g(x) \equiv g(y) + D \equiv g(y) + H 2^{wm} + L$If $k \ge wm$, then $H \neq 0$ and L = 0. Hashes differ by H and are therefore always different with 100% certainty in this case.
If $k \lt wm$ then H is a uniformly random mbit random variable independent of $g(y) + L$. Exactly one value of H makes the hashes equal, so collision probability is exactly $2^{m}$.
Proof for h_{2}
Suppose we have two different inputs: $(x_1,\ldots,x_n)$, $(y_1,\ldots,y_n)$.
Since they are different at some position, we may as well assume without loss of generality that $x_1\ne y_1$. Again let k bit the first bit where they differ: $x_1y_1 = r2^k$.
Define $g(x) = (a_1x_1 + \ldots + a_nx_n + b)\bmod 2^{2w}$. The hash function $h_2(x)$ is the top m bits of g(x).
$g(x)  g(y) \equiv a_1(x_1y_1) + (a_2(x_2y_2) + \ldots + a_n(x_ny_n)) \equiv a_1r2^k + E \equiv D + E \pmod{2^{2w}}$By the lemma, D is a random integer shifted left by k bits. It is also independent of E (because E doesn’t depend on $a_1$) and of g(y) (because g(y) has the independent +b term).
Split D into the top m bits and the rest: $D = H2^{2wm} + L$. We know that $k \lt w \le {2wm}$, and therefore H is a uniformly random number, independent of E, g(y) and L.
$g(x) \equiv g(y) + D + E \equiv g(y) + H 2^{wm} + L + E$There is exactly one value of H that will make the top m bits of g(x) and g(y) match, therefore collision probability is always exactly $2^{m}$.
Proof for h_{3}
The proof is very similar as in the previous case.
$g(x)  g(y) \equiv (x_1+a_2)(x_2+a_1)  (y_1+a_2)(y_2+a_1) + (\ldots) \\ \equiv a_1(x_1y_1) + a_2(x_2y_2) + (x_1x_2  y_1y_2) + (\ldots) \equiv a_1r2^k + E$And the same proof works. We just incorporated the extra (nonrandom) term $(x_1x_2  y_1y_2)$ into E, which doesn’t change anything that follows.
References

Wölfel, Philipp. Über die Komplexität der Multiplikation in eingeschränkten Branchingprogrammmodellen. Diss. Universität Dortmund, 2004. ↩ ↩^{2}