let's say I have:
n = 14
n
is the result of the following sums of integers:
[5, 2, 7] -> 5 + 2 + 7 = 14 = n
[3, 4, 5, 2] -> 3 + 4 + 5 + 2 = 14 = n
[1, 13] -> 1 + 13 = 14 = n
[13, 1] -> 13 + 1 = 14 = n
[4, 3, 5, 2] -> 4 + 3 + 5 + 2 = 14 = n
...
I would need a hash function h
so that:
h([5, 2, 7]) = h([3, 4, 5, 2]) = h([1, 13]) = h([13, 1]) = h([4, 3, 5, 2]) = h(...)
I.e. it doesn't matter the order of the integer terms and as long as their integer sum is the same, their hash should also the same.
I need to do this without computing the sum n
, because the terms as well as n
can be very high and easily overflow (they don't fit the bits of an int
), that's why I am asking this question.
Are you aware or maybe do you have an insight on how I can implement such a hash function? Given a list/sequence of integers, this hash function must return the same hash if the sum of the integers would be the same, but without computing the sum.
Thank you for your attention.
EDIT: I elaborated on @derpirscher's answer and modified his function a bit further as I had collisions on multiples of BIG_PRIME
(this example is in JavaScript):
function hash(seq) {
const BIG_PRIME = 999999999989;
const MAX_SAFE_INTEGER_DIV_2_FLOOR = Math.floor(Number.MAX_SAFE_INTEGER / 2);
let h = 0;
for (i = 0; i < seq.length; i++) {
let value = seq[i];
if (h > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
h = h % BIG_PRIME;
}
if (value > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
value = value % BIG_PRIME;
}
h += value;
}
return h;
}
My question now would be: what do you think about this function? Are there some edge cases I didn't take into account?
Thank you.
EDIT 2:
Using the above function hash([1,2]);
and hash([4504 * BIG_PRIME +1, 4504 * BIG_PRIME + 2])
will collide as mentioned by @derpirscher.
Here is another modified of version of the above function, which computes the modulo % BIG_PRIME
only to one of the two terms if either of the two are greater than MAX_SAFE_INTEGER_DIV_2_FLOOR
:
function hash(seq) {
const BIG_PRIME = 999999999989;
const MAX_SAFE_INTEGER_DIV_2_FLOOR = Math.floor(Number.MAX_SAFE_INTEGER / 2);
let h = 0;
for (let i = 0; i < seq.length; i++) {
let value = seq[i];
if (
h > MAX_SAFE_INTEGER_DIV_2_FLOOR &&
value > MAX_SAFE_INTEGER_DIV_2_FLOOR
) {
if (h > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
h = h % BIG_PRIME;
} else if (value > MAX_SAFE_INTEGER_DIV_2_FLOOR) {
value = value % BIG_PRIME;
}
}
h += value;
}
return h;
}
I think this version lowers the number of collisions a bit further.
What do you think? Thank you.
EDIT 3:
Even though I tried to elaborate on @derpirscher's answer, his implementation of hash
is the correct one and the one to use.
Use his version if you need such an hash function.
You could calculate the sum modulo some big prime. If you want to stay within the range of int
, you need to know what the maximum integer is, in the language you are using. Then select a BIG_PRIME
that's just below maxint / 2
Assuming an int
to be 4 bytes, maxint = 2147483647
thus the biggest prime < maxint/2
would be 1073741789
;
int hash(int[] seq) {
BIG_PRIME = 1073741789;
int h = 0;
for (int i = 0; i < seq.Length; i++) {
h = (h + seq[i] % BIG_PRIME) % BIG_PRIME;
}
return h;
}
As at every step both summands will always be below maxint/2
you won't get any overflows.
Edit
From a mathematical point of view, the following property which may be important for your use case holds:
(a + b + c + ...) % N == (a % N + b % N + c % N + ...) % N
But yeah, of course, as in every hash function you will have collisions. You can't have a hash function without collisions, because the size of the domain of the hash function (ie the number of possible input values) is generally much bigger than the the size of the codomain (ie the number of possible output values).
For your example the size of the domain is (in principle) infinite, as you can have any count of numbers from 1 to 2000000000 in your sequence. But your codomain is just ~2000000000 elements (ie the range of int)