pythonpandashashoperating-systemhashlib

Pandas DataFrame Hash Values Differ Between Unix and Windows


I've noticed that hash values created from Pandas DataFrames change depending whether the below snippet is executed on Unix or Windows.

import pandas as pd
import numpy as np
import hashlib

df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                          columns=['a', 'b', 'c'])

hashvalue_new = hashlib.md5(df.values.flatten().data).hexdigest()
print(hashvalue_new)

The above code prints d0ecb84da86002807de1635ede730f0a on Windows machines and 586962852295d584ec08e7214393f8b2 on Unix machines. Can someone more knowledgeable (or smarter) than me explain to me why this is happening and suggest a way to create a consistent hash value across platforms? I'm running Python 3.8.5 and pandas 1.2.5.


Solution

  • EDIT January 16th

    Why is this happening?

    Following some further digging I'm now fairly certain I understand the issue. The data type (dtype) of Numpy's integers is the operating system's C long type. On Windows this defaults to a 32-bit integer, while on linux/unix it defaults to a 64-bit integer. This has been discussed elsewhere on Stackoverflow in much more detail.

    How to create consistent hash values across platforms?

    Solution 1 (Original)

    I'm able to achieve consistent results by relying on pandas.util.hash_pandas_object, as suggested elsewhere here on Stackoverflow. In full my solution looks like:

    import pandas as pd
    import numpy as np
    import hashlib
    
    df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), 
         columns=['a', 'b', 'c'])
    
    hashvalue_new = hashlib.md5(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
    

    Which consistently gives me 9762ced20d27292712e6a2065b6d6226 across operating systems.

    Solution 2

    One can force the underlying numpy array to by dtype int64:

    df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
         dtype="int64"), 
         columns=['a', 'b', 'c'])
    
    hashvalue_new = hashlib.md5(df.values.flatten().data).hexdigest()
    

    Which consistently gives me 586962852295d584ec08e7214393f8b2 across operating systems.