sqlsql-serverhashhashbytes

Hashing a specific column using HASHBYTES (SHA1) is causing that column to have more distinct rows than the unhashed column


To put it simply i have a very large database with hundreds of thousands of entries and hundreds of different columns.

Some of those columns need to be hashed in order to save space, etc.. However when i try to hash them like this:

select distinct
columnA + hashbytes('sha1', [Column_in_question]) 
from [dbo].[Tabled_in_question]

I end up with more rows than if i just did this:

select distinct
columnA + [Column_in_question]
from [dbo].[Tabled_in_question]

My best guess is that the select distinct is not case sensitive, whereas Hashbytes is. But i don't really know how i can test this or fix it.

Any ideas?


Solution

  • you are right the difference is the case sensitivity

    you can check it using

    select distinct
    convert(VARBINARY(10), [Column_in_question]),
    columnA + hashbytes('sha1', [Column_in_question]) 
    from [dbo].[Tabled_in_question]
    

    the collation of db is most probably CI (case insensitive) but hashbytes use.. bytes, and as you can see converting text to varbinary, they are different

    try this to change the collation and comparision rules

    select distinct
    columnA + [Column_in_question] collate LATIN1_GENERAL_BIN
    from [dbo].[Tabled_in_question]