Sometimes we want to build a static dictionary data structure. This is when we have a set of data that doesn’t have to change dynamically. When it changes, we can just rebuild the dictionary from scratch. So we first build the data structure, and then perform lookups without changing it any more.

Example scenarios where we might want to do this:

  • Spell checker. We have a set of words that doesn’t change, and want to perform lookups to see if a word is spelled correctly.
  • Search engine. We periodically crawl a lot of web pages and build an index containing all the words, with links to the pages. Once the index is built, we don’t change it, until we do a new crawl again

What we want is:

  1. Linear build time in expectation. So for nn elements of size SS words each, we want to be able to build the data structure in O(nS)O(nS) expected time. If elements are of constant size, this is O(n)O(n) time.
  2. Guaranteed quick access time to all elements. We want O(S)O(S) time per access. If the element size is constant, this is O(1)O(1) time.
  3. Not much memory overhead. We are aiming for memory use of nS+o(n)nS + o(n). Note that nSnS is just the space required to store the elements. So this means that for large nn the fraction of extra memory required for bookkeeping is small compared to the actual data.

As usual, we are using the “word RAM” computational model. What we mean by a “word” is a piece of memory that can store a pointer or a number in the range of 0 to nn. So a word has at least logn\log n bits. Note that if we have nn distinct elements to store in the hash table, they necessarily have at least logn\log n bits each, or else they couldn’t be distinct.

For a while it was an open problem whether such a data structure (or even just properties 1 and 2) is even possible. In 1984 Fredman, Komlós, Szemerédi1 described a data structure that does this.

One could try to just use a regular hash table with universal hashing. It sort of works, but it fails properties 2 and 3. We get constant access time, but only in expectation. No matter how many hash functions we try, there will almost surely be some buckets that have more than a few entries in them. In fact it can be shown that if we use a completely random hash function to store nn elements in nn buckets, then with high probability the largest bucket will have Θ(logn/loglogn)\Theta(\log n / \log \log n) entries. For universal hashing it could be a lot worse. A few customers could get angry that their queries are always slow! We don’t want that. We would also have significant memory overhead to store all the buckets and pointers.

Guaranteeing worst case access time

Start by defining a family of universal hash functions such that the probability of a collision when hashing into mm buckets is bounded by c/mc/m for some constant cc. Normally we can make c1c \approx 1.

Now let’s create nn buckets and randomly select a hash function HH that maps elements to those buckets. The expected number of collisions, i.e. the number of pairs of data elements that hash to the same bucket, is bounded by:

0i<j<nPr(H(xi)=H(xj))(n2)cn<n22cn=cn2\sum_{0 \le i < j < n} \Pr(H(x_i) = H(x_j)) \le \binom{n}{2} \frac{c}{n} < \frac{n^2}{2} \frac{c}{n} = \frac{cn}{2}

The probability that the number of collisions exceeds twice the expectation, i.e. cncn, is at most 50% (otherwise the expected value would be larger). If that happens, we just try again and pick a different HH. It will take on average 2 tries to get a valid HH.

Thus we have at most cncn collisions.

Let bib_i be the number of elements in bucket ii. Then the number of collisions can also be expressed as:

i=0n1(bi2)cn\sum_{i=0}^{n-1} \binom{b_i}{2} \le cn

Now comes the main trick: in each bucket let’s make another, second-level mini hash table!

simple static hashing

We size a given secondary table based on its number of elements. Make it max(2c(bi2),1)=max(cbi(bi1),1)\max\left(2c\binom{b_i}{2}, 1\right) = \max(cb_i(b_i-1), 1) (note that this is quadratic in the number of elements). Select a random hash function hih_i. We want no collisions in the second level hash table.

The expected number of collisions in a given second level hash table is bounded by:

(bi2)cmax(2c(bi2),1)12\binom{b_i}{2} \frac{c}{\max\left(2c\binom{b_i}{2},1\right)} \le \frac{1}{2}

So with probability at least 50% we don’t get any collision at all! In case we get a collision, just try a different hih_i. After an average of 2 attempts we will get zero collisions in that table.

So we don’t need any secondary collision resolution method. Just store pointers to elements directly in mini-buckets.

A lookup is now constant time:

  1. Calculate i=H(x)i = H(x)
  2. Calculate j=hi(x)j = h_i(x)
  3. The element, if it exists, will be in bucket ii, entry jj.

Total space used for second level hash tables is:

i=0n1max(2c(bi2),1)2ci=0n1(bi2)+n2ccn+n=(2c2+1)n=O(n)\sum_{i=0}^{n-1} \max\left(2c\binom{b_i}{2}, 1\right) \le 2c\sum_{i=0}^{n-1}\binom{b_i}{2} + n \le 2c \cdot cn + n = (2c^2+1)n = O(n)

Memory overhead is therefore Θ(n)\Theta(n). It is not negligible, especially if the actual data elements are small.

Compression

Now we are aiming to reduce the memory overhead to o(n)o(n), i.e. something that is relatively small for large n.

First, let’s increase the number of main buckets by a factor of zz (to be determined later), so we have znzn buckets. Seems like this will only make it worse, but bear with me. We will pack them really tightly!

The expected number of collisions is now smaller:

i=0zn1(bi2)(n2)czn<n22czn=cn2z\sum_{i=0}^{zn-1} \binom{b_i}{2} \le \binom{n}{2} \frac {c}{zn} < \frac{n^2}{2} \frac{c}{zn} = \frac{cn}{2z}

Again, we make sure that we don’t get more than twice that, i.e. cn/zcn/z. If we do, just try a different hash function HH.

If bi=1b_i = 1, we are not going to have a second level hash table. Instead, we will store a pointer to the element directly in the main bucket.

For bi2b_i \ge 2, we make a second level hash table of size 2c(bi2)=cbi(bi1)2c\binom{b_i}{2} = cb_i(b_i-1). This again means that we can easily get zero collisions in each such table.

The total size of all second level hash tables (and thus also the number of such tables) is bounded by:

i=0zn12c(bi2)<2ccnz=2c2nz=O(nz)\sum_{i=0}^{zn-1} 2c\binom{b_i}{2} < 2c \frac{cn}{z} = 2c^2 \frac{n}{z} = O\left(\frac{n}{z}\right)

compressed table 1

Now let’s put all the data elements into a dedicated array. Order them so that the ones that appear in single-element buckets are ordered first, in the same order as they appear in the buckets. Also put all the secondary table hash functions and pointers into a separate array, again in the same order as they appear in main buckets. In the main bucket table we just need to store what kind of entry it is (none, single element, or secondary table), and an index into the appropriate array.

compressed table 2

Finally, we group the main buckets into groups of size gg. In each group we store just one index of the first single element in the group (if any), and one index of the first secondary table in the group (if any). In each individual bucket we just store an offset from those. The above bucket table now looks like this:

compressed table 3

There are zn/gzn/g groups. In each group, we need 2 words for the first element index and the first table index, for a total of 2zn/g2zn/g words.

We also have znzn individual buckets. Each bucket has the type of bucket (there are 3 types, so 2 bits), and an offset. However, those offsets are small, between 0 and gg. So buckets only need 2+logg2 + \log g bits each. We can pack them into O(znlogg/logn)O(zn \log g / \log n) words, because we know we can fit logn\log n bits in a word.

The total space usage for all the elements, all the secondary tables, and all the buckets is therefore:

nS+O(nz+zng+znlogglogn)nS + O\left(\frac{n}{z} + \frac{zn}{g} + \frac{zn\log g}{\log n}\right)

Now we’ll pick gg and zz to make this as small as possible. Start with the group size gg. We want the last two terms to be approximately equal, which means glogglogng\log g \approx \log n, or g=Θ(logn/loglogn)g=\Theta(\log n / \log \log n).

This makes the memory usage:

nS+O(nz+znloglognlogn)nS + O\left(\frac{n}{z} + \frac{zn\log \log n}{\log n}\right)

Finally optimize the number of buckets znzn. Again we want the two terms to be approximately equal, or z=Θ(logn/loglogn)z = \Theta(\sqrt{\log n / \log \log n}).

The final memory usage then is:

nS+O(nloglognlogn)=nS+o(n)nS + O\left(n \sqrt{\frac{\log \log n}{\log n}}\right) = nS + o(n)

Asymptotically smaller than linear overhead! Just as we wanted.

References

  1. Fredman, Michael L., János Komlós, and Endre Szemerédi. “Storing a sparse table with O(1) worst case access time.” Journal of the ACM (JACM) 31.3 (1984): 538-544.