How can I calculate the difference between two hashes in a MySQL query?
Asked Answered
P

3

6

I'm attempting to calculate the Hamming distance between an input hash and database-stored hashes. These are perceptual hashes, so the Hamming distance between them are important to me and tell me how similar two different images are (see http://en.wikipedia.org/wiki/Perceptual_hashing, http://jenssegers.com/61/perceptual-image-hashes, http://stackoverflow.com/questions/21037578/). Hashes are 16 hexadecimal characters long, and look like this:

b1d0c44a4eb5b5a9
1f69f25228ed4a31
751a0b19f0c2783f

My database looks like this:

CREATE TABLE `hashes` (
  `id` int(11) NOT NULL,
  `hash` binary(8) NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;

INSERT INTO `hashes` (`id`, `hash`) VALUES
    (1, 0xb1d0c44a4eb5b5a9),
    (2, 0x1f69f25228ed4a31),
    (3, 0x751a0b19f0c2783f);

Now, I know I can query for a Hamming distance like so:

SELECT BIT_COUNT(0xb1d0c44a4eb5b5a9 ^ 0x751a0b19f0c2783f)

Which will output 38, as expected. However, I can't seem to reference a column name for this comparison. The following does not work as expected.

SELECT BIT_COUNT(hash ^ 0x751a0b19f0c2783f) FROM hashes

Does anyone know how I can calculate a Hamming distance like in my first SELECT query above using the columns in my database? I've tried a myriad of scenarios using hex(), unhex(), conv(), and cast() in different ways. This is in MySQL.

Update My query above appears to work as expected when running in MySQL v8 (thanks to @LukStorms for pointing this out). You can use my fiddle below and change the version in the top left. My question now is: how can I ensure the behavior works in all versions of MySQL?

Fiddle: https://www.db-fiddle.com/f/mpqsUpZ1sv2kmvRwJrK5xL/0

Preachy answered 2/2, 2019 at 20:45 Comment(7)
BIT_COUNT does not calculate hamming distance it just counts the bits which are set.Shortstop
@RaymondNijland this is why I perform an XOR as well. XOR -> count set bets is essentially hamming distance if the hashes are binary. this works in my first select queryPreachy
@AndrewMorton These are perceptual hashes, thus the Hamming is very important to me if I'm trying to query for duplicate/similar images. See: en.wikipedia.org/wiki/Perceptual_hashing jenssegers.com/61/perceptual-image-hashes #21038078Preachy
@AndrewMorton very interesting to me as well. I've updated my questionPreachy
@LukStorms If they are equal, the expected Hamming distance is in fact zero (no difference). That said, not all the Hamming distances are correctPreachy
Interesting enough, this db<>fiddle here doesn't seem to have a problem with it. A bug they fixed in MySql 8?Beal
@Beal that is verrrryyy interesting. i wonder if it is a bug? the fiddle I posted to also allows you to switch it to mysql8, where it works as wellPreachy
T
5

The problem seems to be related to your choice of datatype which is a string type. Using a numeric datatype works in MySQL 5.7 as well as 8.0:

CREATE TABLE `hashes` (
  `id` int(11) NOT NULL,
  `hash` bigint unsigned NOT NULL
) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;

INSERT INTO `hashes` (`id`, `hash`) VALUES
    (1, 0xb1d0c44a4eb5b5a9),
    (2, 0x1f69f25228ed4a31),
    (3, 0x751a0b19f0c2783f);

SELECT id, HEX(hash), BIT_COUNT(hash ^ 0x751a0b19f0c2783f)
FROM hashes;

Output:

id  HEX(hash)           BIT_COUNT(hash ^ 0x751a0b19f0c2783f)
1   B1D0C44A4EB5B5A9    38
2   1F69F25228ED4A31    34
3   751A0B19F0C2783F    0

Demo on dbfiddle

The difference in treatment between MySQL 5.7 and 8.0 of using a string type can be seen with this query:

SELECT id, hash, HEX(hash), HEX(hash ^ 0x751a0b19f0c2783f)
FROM hashes;

MySQL 5.7:

id  hash                                                        HEX(hash)           HEX(hash ^ 0x751a0b19f0c2783f)
1   {"type":"Buffer","data":[177,208,196,74,78,181,181,169]}    B1D0C44A4EB5B5A9    751A0B19F0C2783F
2   {"type":"Buffer","data":[31,105,242,82,40,237,74,49]}       1F69F25228ED4A31    751A0B19F0C2783F
3   {"type":"Buffer","data":[117,26,11,25,240,194,120,63]}      751A0B19F0C2783F    751A0B19F0C2783F

MySQL 8.0

id  hash                                                        HEX(hash)           HEX(hash ^ 0x751a0b19f0c2783f)
1   {"type":"Buffer","data":[177,208,196,74,78,181,181,169]}    B1D0C44A4EB5B5A9    C4CACF53BE77CD96
2   {"type":"Buffer","data":[31,105,242,82,40,237,74,49]}       1F69F25228ED4A31    6A73F94BD82F320E
3   {"type":"Buffer","data":[117,26,11,25,240,194,120,63]}      751A0B19F0C2783F    0000000000000000

MySQL 8.0 is performing the XOR correctly, returning a variable, while MySQL 5.7 is returning the value being XOR'ed, indicating that it is treating the BINARY string as 0 in a numeric context.

Tanatanach answered 2/2, 2019 at 22:52 Comment(9)
interesting. why can't I store numbers & retrieve them as such from binary types? this works in mysql 8Preachy
@Preachy see my update. It shows that MySQL 5.7 is treating the string as 0 in a numeric context, where MySQL 8 is casting it to a numberTanatanach
hm... it looks like bigint is not large enough to store 'b1d0c44a4eb5b5a9', so i guess this was my miscalculation in wanting a binary(8) in the first place.Preachy
since the largest possible value is 0xFFFFFFFFFFFFFFFF, I think I should use a decimal(64,0)Preachy
@Preachy you need to use bigint unsigned as I did in my demoTanatanach
I am, but for some reason it says Numeric value out of range: 1264 Out of range value when I insert w/ UPDATE. see: db-fiddle.com/f/mpqsUpZ1sv2kmvRwJrK5xL/4Preachy
You need to keep the value in hex format i.e. UPDATE `hashes` SET `hash` = 0x1f69f25228ed4a31 WHERE id = 1; db-fiddle.com/f/mpqsUpZ1sv2kmvRwJrK5xL/5Tanatanach
You could use CONV e.g. UPDATE `hashes` SET `hash` = CONV('1f69f25228ed4a31', 16, 10) WHERE id = 1; db-fiddle.com/f/mpqsUpZ1sv2kmvRwJrK5xL/7Tanatanach
No worries. All part of the service :-)Tanatanach
W
2

This is not a number, so it can't used for mathematic calculations:

`hash` binary(8) NOT NULL

Use bigint instead:

`hash` bigint unsigned NOT NULL
Whorl answered 2/2, 2019 at 22:51 Comment(1)
interesting. why can't I store numbers & retrieve them as such from binary types? this works in mysql 8Preachy
B
2

Try this:

SELECT id, HEX(hash), CAST(CONV(HEX(hash),16,10) AS UNSIGNED), BIT_COUNT(CAST(CONV(HEX(hash),16,10) AS UNSIGNED) ^ 0x751a0b19f0c2783f) FROM hashes;
Brubaker answered 3/2, 2019 at 0:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.