In part 1 we discussed universal hashing and introduced a classic family of universal hash functions, the Carter-Wegman hash function for integers and its generalizations for bigger data structures.

While that works fine, there is a simpler way that is equally good. We don’t need to deal with prime numbers and modulo, we can make do with just multiplications, additions and bit shifts, and get an equally good universal hash function family!

Hashing integers

We want a function that hashes a w-bit integer into m bits (mwm \le w).

Philipp Wölfel1 defined the following function in his Ph.D. thesis:

Pick two random w-bit integers a and b, with odd a. Then use:

h1(x)=(ax+b)mod2w div 2wm\displaystyle h_1(x) = (ax + b) \bmod 2^w \text{ div } 2^{w-m}

In other words: compute ax+bax+b, ignore overflow, and take the top m bits. Can’t get much simpler than this!

Incredibly this is a universal hash function family with optimal collision rate. For any pair of different x and y, the probability of collision is Pr(h1(x)=h1(y))2m\Pr(h_1(x) = h_1(y)) \le 2^{-m} .

In C:

unsigned hash(unsigned x, int m) {
  return (a * x + b) >> (w - m);

Hashing bigger data

Now suppose we have a bigger data structure consisting of n w-bit words, (x1,,xn)(x_1, \ldots, x_n).

Pick random 2w-bit numbers a1,,an,ba_1, \ldots, a_n, b and use:

h2(x1,,xn)=(a1x1++anxn+b)mod22w div 22wm\displaystyle h_2(x_1,\ldots,x_n) = (a_1x_1 + \ldots + a_nx_n + b) \bmod 2^{2w} \text{ div } 2^{2w-m}

There is an extra optimization possible - we can replace some multiplications by additions by taking input numbers in pairs. Suppose n is even.

h3(x1,,xn)=((x1+a2)(x2+a1)++(xn1+an)(xn+an1)+b)mod22w div 22wm\displaystyle h_3(x_1,\ldots,x_n) = ((x_1 + a_2)(x_2 + a_1) + \ldots + (x_{n-1} + a_n)(x_n + a_{n-1}) + b) \bmod 2^{2w} \text{ div } 2^{2w-m}

Both of these hash functions are universal and guarantee collision probability of at most 2m2^{-m} for any given two inputs.

Proof for h1

Wölfel1 has a rather complicated proof in his thesis of the collision probability, but we can see this in a simpler way.

First we need a little lemma:

Lemma: If r is an odd number, then f(x)=rxmod2wf(x) = rx \bmod 2^w is a 1-1 correspondence between w-bit numbers. It also maps odd numbers to odd numbers.

Proof: If rxry(mod2w)rx \equiv ry \pmod {2^w}, then 2wr(xy)2^w | r(x-y) and, since r is odd, 2w(xy)2^w | (x-y), and so xy(mod2w)x \equiv y \pmod{2^w}. Thus f is a 1-1 function. Also it clearly maps odd numbers to odd numbers. QED.

Take two different numbers x and y. We want to bound hash collision probability for these two inputs. Let k be the smallest bit position in which x and y differ. Therefore: xy=r2kx-y = r2^k, where r is an odd number.

Let g(x)=(ax+b)mod2wg(x) = (ax+b) \bmod 2^w. The hash function h1(x)h_1(x) is the top m bits of g(x)g(x). Also define D as follows:

Dg(x)g(y)(ax+b)(ay+b)a(xy)ar2k(mod2w)\displaystyle D \equiv g(x) - g(y) \equiv (ax + b) - (ay+b) \equiv a(x-y) \equiv ar2^k \pmod{2^w}

By the lemma, since a is random odd and r is odd, D is a random odd integer shifted left by k bits.

Also since b does not appear in this formula for D, the random variable g(y)=(ay+b)mod2wg(y) = (ay+b)\bmod 2^w is independent of D. This is why we needed this +b term in the hash function.

Let’s split D into the top m bits and the rest: D=H2wm+LD = H 2^{w-m} + L.

g(x)g(y)+Dg(y)+H2wm+L\displaystyle g(x) \equiv g(y) + D \equiv g(y) + H 2^{w-m} + L

If kwmk \ge w-m, then H0H \neq 0 and L = 0. Hashes differ by H and are therefore always different with 100% certainty in this case.

If k<wmk \lt w-m then H is a uniformly random m-bit random variable independent of g(y)+Lg(y) + L. Exactly one value of H makes the hashes equal, so collision probability is exactly 2m2^{-m}.

Proof for h2

Suppose we have two different inputs: (x1,,xn)(x_1,\ldots,x_n), (y1,,yn)(y_1,\ldots,y_n).

Since they are different at some position, we may as well assume without loss of generality that x1y1x_1\ne y_1. Again let k bit the first bit where they differ: x1y1=r2kx_1-y_1 = r2^k.

Define g(x)=(a1x1++anxn+b)mod22wg(x) = (a_1x_1 + \ldots + a_nx_n + b)\bmod 2^{2w}. The hash function h2(x)h_2(x) is the top m bits of g(x).

g(x)g(y)a1(x1y1)+(a2(x2y2)++an(xnyn))a1r2k+ED+E(mod22w)\displaystyle g(x) - g(y) \equiv a_1(x_1-y_1) + (a_2(x_2-y_2) + \ldots + a_n(x_n-y_n)) \equiv a_1r2^k + E \equiv D + E \pmod{2^{2w}}

By the lemma, D is a random integer shifted left by k bits. It is also independent of E (because E doesn’t depend on a1a_1) and of g(y) (because g(y) has the independent +b term).

Split D into the top m bits and the rest: D=H22wm+LD = H2^{2w-m} + L. We know that k<w2wmk \lt w \le {2w-m}, and therefore H is a uniformly random number, independent of E, g(y) and L.

g(x)g(y)+D+Eg(y)+H2wm+L+E\displaystyle g(x) \equiv g(y) + D + E \equiv g(y) + H 2^{w-m} + L + E

There is exactly one value of H that will make the top m bits of g(x) and g(y) match, therefore collision probability is always exactly 2m2^{-m}.

Proof for h3

The proof is very similar as in the previous case.

g(x)g(y)(x1+a2)(x2+a1)(y1+a2)(y2+a1)+()a1(x1y1)+a2(x2y2)+(x1x2y1y2)+()a1r2k+E\displaystyle g(x) - g(y) \equiv (x_1+a_2)(x_2+a_1) - (y_1+a_2)(y_2+a_1) + (\ldots) \equiv a_1(x_1-y_1) + a_2(x_2-y_2) + (x_1x_2 - y_1y_2) + (\ldots) \equiv a_1r2^k + E

And the same proof works. We just incorporated the extra (non-random) term (x1x2y1y2)(x_1x_2 - y_1y_2) into E, which doesn’t change anything that follows.