pandashash

Are Pandas hash functions stable over time? (pd.util.hash_pandas_object)


I want to create ID columns using hash functions in my pandas dataframes. The pipeline will be reprocessed over time, and I need to ensure that the hash functions in Pandas are stable across different versions and environments.

I am using a composite key consisting of multiple columns to generate these hashes. I am currently using pd.util.hash_pandas_object for its speed, but I couldn't find information in the documentation regarding its stability over time. Is pd.util.hash_pandas_object stable across different versions of Pandas? If not, could you suggest a fast and stable alternative for hashing composite keys in DataFrames?


Solution

  • It seems fairly stable up to now.

    Assuming this example:

    import pandas as pd
    
    print(pd.__version__)
    
    df = pd.DataFrame({'col1': [0, 1, 2], 'col2': ['A', 'B', 'C']})
    pd.util.hash_pandas_object(df)
    

    Output:

    # pandas 1.0.3
    0     3633373482604536162
    1     5258867552551810711
    2    13022556061186435711
    dtype: uint64
    
    # pandas 1.4.3
    0     3633373482604536162
    1     5258867552551810711
    2    13022556061186435711
    dtype: uint64
    
    # pandas 2.2.2
    0     3633373482604536162
    1     5258867552551810711
    2    13022556061186435711
    dtype: uint64
    

    Note however that the function is sensitive to the dtype:

    # automatic conversion is fine
    pd.util.hash_pandas_object(df.convert_dtypes())
    
    0     3633373482604536162
    1     5258867552551810711
    2    13022556061186435711
    dtype: uint64
    
    # upcasting is not
    pd.util.hash_pandas_object(df.astype({'col1': float}))
    
    0     3633373482604536162
    1    12198058518291636952
    2     7562945033953410876
    dtype: uint64