I've noticed that hash values created from Pandas DataFrames change depending whether the below snippet is executed on Unix or Windows.
import pandas as pd
import numpy as np
import hashlib
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
hashvalue_new = hashlib.md5(df.values.flatten().data).hexdigest()
print(hashvalue_new)
The above code prints d0ecb84da86002807de1635ede730f0a
on Windows machines and 586962852295d584ec08e7214393f8b2
on Unix machines. Can someone more knowledgeable (or smarter) than me explain to me why this is happening and suggest a way to create a consistent hash value across platforms? I'm running Python 3.8.5 and pandas 1.2.5.
Following some further digging I'm now fairly certain I understand the issue. The data type (dtype) of Numpy's integers is the operating system's C
long
type. On Windows this defaults to a 32-bit integer, while on linux/unix it defaults to a 64-bit integer. This has been discussed elsewhere on Stackoverflow in much more detail.
I'm able to achieve consistent results by relying on pandas.util.hash_pandas_object
, as suggested elsewhere here on Stackoverflow. In full my solution looks like:
import pandas as pd
import numpy as np
import hashlib
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
hashvalue_new = hashlib.md5(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
Which consistently gives me 9762ced20d27292712e6a2065b6d6226
across operating systems.
One can force the underlying numpy array to by dtype
int64
:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
dtype="int64"),
columns=['a', 'b', 'c'])
hashvalue_new = hashlib.md5(df.values.flatten().data).hexdigest()
Which consistently gives me 586962852295d584ec08e7214393f8b2
across operating systems.