pythonpostgresqlhashhashlibsigned-integer

Which string hashing algorithm produces 32-bit or 64-bit signed integers?


I want to hash strings of variable length (6-60 characters long) to 32-bit signed integers in order to save disk space in PostgreSQL.

I don't want to encrypt any data, and the hashing function needs to be reproducible and callable from Python. The problem is that I can only find Algorithms that produce unsigned integers (like CityHash), which therefore produce values of up to 2^32 instead of 2^31.

This is what I have thus far:

import math
from cityhash import CityHash32

string_ = "ALPDAKQKWTGDR"
hashed_string = CityHash32(string_)
print(hashed_string, len(str(hashed_string)))
max_ = int(math.pow(2, 31) - 1)
print(hashed_string > max_)

Solution

  • Ryan answered the question in the comments. Simply subtract 2147483648 (= 2^31) from the hash result.

    CityHash32(string_) - math.pow(2, 31)
    

    or

    CityHash64(string_) - math.pow(2, 63)
    

    Ryan also mentioned that using SHA-512 and truncating the result to the desired number of digits will lead to less collisions than the method above.