I use a hash table as the main data structure for a project in C. I use this implementation of SipHash as the hash function. I use chained lists to handle collisions. The index for a given element to be inserted or searched in the hash table is computed as follows:
size_t genindex(const char *key, const size_t ht_len, const uint8_t *k) {
uint64_t hash = siphash((const uint8_t *)key, strlen(key), k);
size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));
return index;
}
This approach creates a lot of collisions as about 60% of my table is unused. The data I store in the hash table is relatively small (about 60 elements). I could use another collision handling method, but this requires a lot more work and I will consider it if I fail to find a solution. I think I should modify the index resize method.
Are there other method to resize the hash to fit it to the hash table size ht_len
than using a modulo operation ?
Note: the table size is fixed at its initialization. It is generated as the closest prime number to the data size and its chosen between 2^i and 2^(i+1), the powers of two lower and greater than the data size.
As others have commented, using a hash table seems overly complicated for the purpose.
However, the problem with the insufficient spreading is probably due to this line:
size_t index = (size_t)(hash & (uint64_t) (ht_len - 1));
which is supposed to calculate "modulo ht_len
". However, using &
for modulo only works if ht_len
is a power of two.
Doing &
with ht_len-1
will generally spread the result over a set of indices of 2^n values where "n" is the number of ones in the binary representation of ht_len-1
. Or put in another way: all indices, i
, not fulfilling i & (ht_len-1) == i
will be excluded.
For example, consider the values 0, ..., 15. If ht_len
is 8, we AND with 8-1=7 and get:
0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7
(i.e. the results are spread over the 8 possible remainders).
If ht_len
is 7, we AND with 7-1=6 and get:
0, 0, 2, 2, 4, 4, 6, 6, 0, 0, 2, 2, 4, 4, 6, 6
(i.e. only 4 results are possible).
Doing actual modulo 7
would give:
0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1
So in short, the mentioned line should be changed to calculate modulo correctly:
size_t index = (size_t)(hash % (uint64_t) (ht_len));