I'm trying to replicate a Synapse SQL HASHBYTES logic in PySpark. In SQL, we use this line to generate an ETL hash key:
SELECT CAST(HASHBYTES('SHA2_256', CONCAT_WS('|', [col1], [col2], ...)) AS BIGINT) AS EtlHashKey
FROM sample_table;
This hash is used as a unique row identifier. Now, I want to generate the same hash key in PySpark using a DataFrame. Here’s what I’ve tried:
import hashlib
import struct
def compute_hashkey(*values):
concat = '|'.join(['' if v is None else str(v) for v in values])
hash_bytes = hashlib.sha256(concat.encode('utf-8')).digest()
hashkey = struct.unpack('<q', hash_bytes[:8])[0] # little-endian signed long long
return hashkey
I verified that the raw SHA-256 hex hash matches perfectly with SQL’s HASHBYTES('SHA2_256', ...) output (except SQL shows a 0x prefix).
However, the BIGINT value I get in PySpark using struct.unpack('<q', hash_bytes[:8]) does not match the one returned by SQL.
Sample Input:
col1 | col2 |
---|---|
abc | 123 |
SQL: EtlHashKey = -7418430994879866706
PySpark: EtlHashKey = -5865807261440166157
What I've ruled out:
My Ask: How exactly does SQL Server / Synapse cast the HASHBYTES output to a BIGINT? why hex hash value matches but not hashkey
What is the correct way in Python to replicate CAST(HASHBYTES('SHA2_256', ...) AS BIGINT) so that the resulting 64-bit integer matches?
Any official documentation or byte interpretation details from SQL Server internals would be helpful.
SHA-256 hashes are 32 bytes long. SQL Server takes the right-most 8 bytes as bigint
. But your Python code appears to be taking the first 8 bytes.
I'm not very proficient in Python, but it looks like you need -8
to get the last 8 bytes, and >q
to get big-endian.
hashkey = struct.unpack('>q', hash_bytes[-8:])[0]
Note that taking only 8 bytes of 32 means the hash isn't very unique.