pythonhashsetsympyhash-collision

Multiple elements in set with the same hash


I was reading through SymPy 1.13.0 release notes when an entry caught my attention (emphasis mine):

The hash function for Floats and expressions involving Float now respects the hash invariant that if a == b then hash(a) == hash(b). This ensures that it is possible to use such expressions in a guaranteed deterministic way in Python's core set and dict data structures. It is however not recommended to mix up sympy Basic objects with non-sympy number types such as core Python's float or int in the same set/dict.

I mixed up Python's numbers with SymPy's numbers in a set, and I can't explain what's going on (sympy 1.13.0).

I thought elements of a set all have different hashes. For example, in the following code block there are 4 strings, a, b, c1, c2. They are distinct objects in memory. However, c1 is equal to c2, so they have the same hash. Hence, set([a, b, c1, c2]) only contains three elements, as expected:

a, b = "a", "b"
c1 = "This is a test"
c2 = "This is a test"
t = [a, b, c1, c2]
print("ids:", ", ".join([str(id(e)) for e in t]))
print("hashes:", ", ".join([str(hash(e)) for e in t]))
s = set(t)
print(s)

# hashes: -8635426522778860779, 3774980000107278733, 3487163462126586929, 3487163462126586929
# ids: 11791568, 11792624, 135187170594160, 135187170598448
# {'This is a test', 'a', 'b'}

In the following exampe there are three different objects in memory, n1, n2, n3. They all generates the same hash. I expected the resulting set to only contain one element. Instead, it contains two elements, apparently sharing the same hash.

from sympy import *
n1 = 2.0
n2 = Float(2.0)
n3 = Float(2.0, precision=80)
t = [n1, n2, n3]
print("ids:", ", ".join([str(id(e)) for e in t]))
print("hashes:", ", ".join([str(hash(e)) for e in t]))
s = set(t)
print(s)

# ids: 135913432385776, 135912654307472, 135912654307632
# hashes: 2, 2, 2
# {2.0, 2.0000000000000000000000}

What is going on?


Solution

  • To put it in other words: it is possible to elements with the same hash to be in the same set, or in the same dictionary as keys - after hash comparison, Python also performs an equality check (==, using the __eq__ method) to ensure the the object found is indeed the one that was being searched for.

    When two (or more) elements with the same hash are in the same set, it's called a "hash collision" - the inner workings for dicts and sets simply compare the repeated-hash elements linearly with eq when searching.

    Here is a quick timing example with a custom class deliberately created to cause hash collisions:

    
    In [12]: class Clash(int):
        ...:     def __hash__(self):
        ...:         return 0
        ...:
    
    In [13]: class NoClash(int):
        ...:     pass
        ...:
    
    
    # for inputs 14-15: I tried to fill a set with "Clash" elements, but I worried about it taking too much time, so I aborted it - with 10_000 I got sub 1second:
    
    In [16]: %time clash_set = {Clash(i) for i in range(10_001)}
    CPU times: user 826 ms, sys: 93 μs, total: 826 ms
    Wall time: 826 ms
    
    In [17]: %time noclash_set = {NoClash(i) for i in range(10_001)}
    CPU times: user 3.45 ms, sys: 0 ns, total: 3.45 ms
    Wall time: 3.46 ms
    
    
    In [18]: %time Clash(10_000) in clash_set
    CPU times: user 411 μs, sys: 7 μs, total: 418 μs
    Wall time: 423 μs
    Out[18]: True
    
    In [19]: %time 10_000 in noclash_set
    CPU times: user 12 μs, sys: 1 μs, total: 13 μs
    Wall time: 18.4 μs
    Out[19]: True
    
    
    In [20]: %time NoClash(10_000) in noclash_set
    CPU times: user 3 μs, sys: 0 ns, total: 3 μs
    Wall time: 4.29 μs
    Out[20]: True
    
    

    In time: actually, as sets and dicts won't have 2**64 slots, and usually not 2 ** 32 slots for data, being actually proportional to the data size, what is used is a remainder (or other "hash from the hash") of the hash value to compute the actual slot index a data item is placed upon. Say that ordinary sets are initially allocated to contain 128 elements (a wild guess) - in practice, when populating this, there will be many more collisions (at a certain threshold, the internal set heuristics will allocate more memory and re-arrange the slots in order to keep efficiency with the growing data content)