Very fast hash table lookup in C (e.g. by MPH)

I need a very fast hash table in C (or C++). Conditions are like this:

There exist N known keys which shall map to an object (with some state)
There exist more unknown keys which do not map to anything

Because all keys which map to an object are known at startup (but not at compile-time), it's okay if building the hashtable is expensive. However, it's required that lookup is (very) fast.

I thought about using cmph (for a perfect minimal perfect hash function). The hash table would be built with the N keys and at runtime I would do a query like this:

const cmph_uint32 id = cmph_search(hash, &key, sizeof(key));

if (id >= N) {
   return; // object not found
}

const MyState *state = &states[id];
if (state->key != key) {
  return; // object not found
}

// object found

By storing the actual key in the state, it should be possible to detect if we have an invalid collision. However, I'm not sure if calling cmph_search with a "unknown" key is undefined behavior (e.g. weird memory access or something).

Maybe someone has a better idea? Or maybe someone knows if calling cmph_search with a unknown key is fine?

Solution

Hard to tell just by looking at the (pretty much non-existent) documentation of CMPH. Digging into the source code seems simpler. The implementation of the internal hash function used by CMPH can be found in the hash() function, which ends up calling __jenkins_hash_vector(). This hash function was originally designed by Robert J. Jenkins Jr. in 1997 and can be found here. As far as the function is concerned, nothing weird happens with the key used by the function, so this hash function can be safely used even for invalid (non present) keys.

The cmph_search() function calls the correct *_search() function based on the algorithm you configured (CHD, BDZ, BMZ, and so on). Then hash() is called and the resulting values are used in different ways depending on the algorithm.

For simpler algorithms such as BMZ, BMZ8 and FCH I can see that the hash is simply used to index an internal array (mphf->data->g). All the accesses are performed modulo the size of this array (mphf->data->n) so this looks fine. Just from this, I would say if you are using these algorithms you are safe. For more complex algorithms like BDZ it's a bit harder to understand what is really going on and where/how the calculated hashes are actually used for.

Taking a look at the tests implemented in the library source, (for example at this one), we can see that the author uses a logic similar to yours to detect whether a key is a duplicate or unknown:

    /* ... */

    cmph_uint32 siz = cmph_size(mphf);
    hashtable = (cmph_uint8*)malloc(siz*sizeof(cmph_uint8));
    memset(hashtable, 0, (size_t)siz);
    //check all keys
    for (i = 0; i < source->nkeys; ++i)
    {
        cmph_uint32 h;
        char *buf;
        cmph_uint32 buflen = 0;
        source->read(source->data, &buf, &buflen);
        h = cmph_search(mphf, buf, buflen);
        if (!(h < siz))
        {
            fprintf(stderr, "Unknown key %*s in the input.\n", buflen, buf);
            ret = 1;
        } else if(hashtable[h])
        {
            fprintf(stderr, "Duplicated or unknown key %*s in the input\n", buflen, buf);
            ret = 1;
        } else hashtable[h] = 1;

        if (verbosity)
        {
            printf("%s -> %u\n", buf, h);
        }
        source->dispose(source->data, buf, buflen);
    }

    /* ... */

The only thing that's missing from the above code is storing and comparing the keys like you are doing in your example. At the end of the day, it looks to me like calling cmph_search() with an unknown key is fine. Understanding whether the key is unknown or not given the resulting hash is then the job of the library user.