pythonsql-serverhashsha256

Why does CAST(HASHBYTES('SHA2_256', ...) AS BIGINT) in Synapse SQL not match PySpark SHA-256 hash to BIGINT conversion?


I'm trying to replicate a Synapse SQL HASHBYTES logic in PySpark. In SQL, we use this line to generate an ETL hash key:

SELECT CAST(HASHBYTES('SHA2_256', CONCAT_WS('|', [col1], [col2], ...)) AS BIGINT) AS EtlHashKey
FROM sample_table;

This hash is used as a unique row identifier. Now, I want to generate the same hash key in PySpark using a DataFrame. Here’s what I’ve tried:

import hashlib
import struct

def compute_hashkey(*values):
    concat = '|'.join(['' if v is None else str(v) for v in values])
    hash_bytes = hashlib.sha256(concat.encode('utf-8')).digest()
    hashkey = struct.unpack('<q', hash_bytes[:8])[0]  # little-endian signed long long
    return hashkey

I verified that the raw SHA-256 hex hash matches perfectly with SQL’s HASHBYTES('SHA2_256', ...) output (except SQL shows a 0x prefix).

However, the BIGINT value I get in PySpark using struct.unpack('<q', hash_bytes[:8]) does not match the one returned by SQL.

Sample Input:

col1 col2
abc 123

SQL: EtlHashKey = -7418430994879866706

PySpark: EtlHashKey = -5865807261440166157

What I've ruled out:

My Ask: How exactly does SQL Server / Synapse cast the HASHBYTES output to a BIGINT? why hex hash value matches but not hashkey

What is the correct way in Python to replicate CAST(HASHBYTES('SHA2_256', ...) AS BIGINT) so that the resulting 64-bit integer matches?

Any official documentation or byte interpretation details from SQL Server internals would be helpful.


Solution

  • SHA-256 hashes are 32 bytes long. SQL Server takes the right-most 8 bytes as bigint. But your Python code appears to be taking the first 8 bytes.

    I'm not very proficient in Python, but it looks like you need -8 to get the last 8 bytes, and >q to get big-endian.

    hashkey = struct.unpack('>q', hash_bytes[-8:])[0]
    

    Note that taking only 8 bytes of 32 means the hash isn't very unique.