Recently, I attended an interview and faced a good question regarding hash collisions.
Question : Given a list of strings, print out the anagrams together.
Example :
i/p : {act, god, animal, dog, cat}
o/p : act, cat, dog, god
I want to create hashmap and put the word as key and value as list of anagrams
To avoid collision, I want to generate unique hash code for anagrams instead of sorting and using the sorted word as key.
I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will add next word to the value list
Can anyone suggest a good algorithm ?
Hashing with the sorted string is pretty nice, i'd have done that probably, but it could indeed be slow and cumbersome. Here's another thought, not sure if it works - pick a set of prime numbers, as small as you like, the same size as your character set, and build a fast mapping function from your chars to that. Then for a given word, map each character into the matching prime, and multiply. finally, hash using the result.
This is very similar to what Heuster suggested, only with less collisions (actually, I believe there will be no false collisions, given the uniqueness of the prime decomposition of any number).
simple e.g. -
int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code
inline int prime_map(char c) {
// check c is in legal char set bounds
return primes[c - first_char];
}
...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
key *= prime_map(*ptr);
ptr++;
}
hash[key].add_to_list(word);
[edit]
A few words about the uniqueness - any integer number has a single breakdown to multiplications of primes, so given an integer key in the hash you can actually reconstruct all possible strings that would hash to it, and only these words. Just break into primes, p1^n1 * p2^n2 * ... and convert each prime to the matching char. the char for p1 would appear n1 times, and so on. You can't get any new prime you didn't explicitly used, being prime means you can't get it by any multiplication of other primes.
This brings another possible improvement - if you can construct the string, you just need to mark the permutations you saw when populating the hash. since the permutations can be ordered by lexicographic order, you can replace each one with a number. This would save the space of storing the actual strings in the hash, but would require more computations so it's not necessarily a good design choice. Still, it makes a nice complication of the original question for interviews :)