mysqlhashbit-manipulationhamming-distancephash

How can I calculate the difference between two hashes in a MySQL query?


I'm attempting to calculate the Hamming distance between an input hash and database-stored hashes. These are perceptual hashes, so the Hamming distance between them are important to me and tell me how similar two different images are (see http://en.wikipedia.org/wiki/Perceptual_hashing, http://jenssegers.com/61/perceptual-image-hashes, http://stackoverflow.com/questions/21037578/). Hashes are 16 hexadecimal characters long, and look like this:

b1d0c44a4eb5b5a9
1f69f25228ed4a31
751a0b19f0c2783f

My database looks like this:

CREATE TABLE `hashes` (
  `id` int(11) NOT NULL,
  `hash` binary(8) NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;

INSERT INTO `hashes` (`id`, `hash`) VALUES
    (1, 0xb1d0c44a4eb5b5a9),
    (2, 0x1f69f25228ed4a31),
    (3, 0x751a0b19f0c2783f);

Now, I know I can query for a Hamming distance like so:

SELECT BIT_COUNT(0xb1d0c44a4eb5b5a9 ^ 0x751a0b19f0c2783f)

Which will output 38, as expected. However, I can't seem to reference a column name for this comparison. The following does not work as expected.

SELECT BIT_COUNT(hash ^ 0x751a0b19f0c2783f) FROM hashes

Does anyone know how I can calculate a Hamming distance like in my first SELECT query above using the columns in my database? I've tried a myriad of scenarios using hex(), unhex(), conv(), and cast() in different ways. This is in MySQL.

Update My query above appears to work as expected when running in MySQL v8 (thanks to @LukStorms for pointing this out). You can use my fiddle below and change the version in the top left. My question now is: how can I ensure the behavior works in all versions of MySQL?

Fiddle: https://www.db-fiddle.com/f/mpqsUpZ1sv2kmvRwJrK5xL/0


Solution

  • The problem seems to be related to your choice of datatype which is a string type. Using a numeric datatype works in MySQL 5.7 as well as 8.0:

    CREATE TABLE `hashes` (
      `id` int(11) NOT NULL,
      `hash` bigint unsigned NOT NULL
    ) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;
    
    INSERT INTO `hashes` (`id`, `hash`) VALUES
        (1, 0xb1d0c44a4eb5b5a9),
        (2, 0x1f69f25228ed4a31),
        (3, 0x751a0b19f0c2783f);
    
    SELECT id, HEX(hash), BIT_COUNT(hash ^ 0x751a0b19f0c2783f)
    FROM hashes;
    

    Output:

    id  HEX(hash)           BIT_COUNT(hash ^ 0x751a0b19f0c2783f)
    1   B1D0C44A4EB5B5A9    38
    2   1F69F25228ED4A31    34
    3   751A0B19F0C2783F    0
    

    Demo on dbfiddle

    The difference in treatment between MySQL 5.7 and 8.0 of using a string type can be seen with this query:

    SELECT id, hash, HEX(hash), HEX(hash ^ 0x751a0b19f0c2783f)
    FROM hashes;
    

    MySQL 5.7:

    id  hash                                                        HEX(hash)           HEX(hash ^ 0x751a0b19f0c2783f)
    1   {"type":"Buffer","data":[177,208,196,74,78,181,181,169]}    B1D0C44A4EB5B5A9    751A0B19F0C2783F
    2   {"type":"Buffer","data":[31,105,242,82,40,237,74,49]}       1F69F25228ED4A31    751A0B19F0C2783F
    3   {"type":"Buffer","data":[117,26,11,25,240,194,120,63]}      751A0B19F0C2783F    751A0B19F0C2783F
    

    MySQL 8.0

    id  hash                                                        HEX(hash)           HEX(hash ^ 0x751a0b19f0c2783f)
    1   {"type":"Buffer","data":[177,208,196,74,78,181,181,169]}    B1D0C44A4EB5B5A9    C4CACF53BE77CD96
    2   {"type":"Buffer","data":[31,105,242,82,40,237,74,49]}       1F69F25228ED4A31    6A73F94BD82F320E
    3   {"type":"Buffer","data":[117,26,11,25,240,194,120,63]}      751A0B19F0C2783F    0000000000000000
    

    MySQL 8.0 is performing the XOR correctly, returning a variable, while MySQL 5.7 is returning the value being XOR'ed, indicating that it is treating the BINARY string as 0 in a numeric context.