Hamming distance on binary strings in SQL
Asked Answered
G

2

27

I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:

SELECT * FROM table 
  ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC 
  LIMIT 10

(in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B), where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).

Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)

edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.

edit(2): The DB server is MySQL 5.1

edit(3): My answer below contains the code that I just described above.

edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.

Genro answered 23/1, 2011 at 22:45 Comment(9)
Feel free also to suggest different ways to store the hashes if this could prove useful in finding a better solution.Genro
If you'd store the hash in 8 integers (perhaps in addition to the binary storage), the calculation becomes much easier.Chevy
I'm really curious as for why you would want to calculate the distance :)Angiology
I'm working on a distributed hash table, and that step is needed to find keys on other nodes (i.e. if you don't have a key locally, you forward the query to the nodes whose ID is closer to the key - where closer is measured in terms of the Hamming distance); obviously both IDs and keys are of the same length (256 bit in my case).Genro
What do you mean by "the table will probably grow quite large"? Will a full scan with the adequate function be acceptable, or do you need a way to use some indexes and avoid the full scan?Jointed
I mean that I'd rather avoid fetching the whole table in PHP and then choosing there the 10 entries I need (see the "LIMIT 10" clause in the code above) if I can do it on the SQL server.Genro
i'm trying to do hamming distance on mysql with phasher. i just tried this query but it seems not return nearest record with similar signature. can you explain it?Tweedy
@Tweedy I'm not psychic, so I can't tell what's wrong in your code without looking. In my case it worked pretty well (see my answer below).Genro
See also this related answer: you could expand on it to make all the computations in the DB as functions https://mcmap.net/q/82890/-hamming-distance-similarity-searches-in-a-databaseBernete
G
17

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );
Genro answered 24/1, 2011 at 14:55 Comment(6)
I just ran some tests: running the query in the original question using the function as defined here on a table with 100000 rows takes somewhere around 2.5s. Since I don't really need the exact answer to my query, I can just sample the table by adding a WHERE RAND() < 0.05 (to sample a random 5% of the table) and this brings the time down to 0.2s. Still, if some SQL guru out there can point out a better way to do it, I'd love to hear it.Genro
Other test: I created a view that trasnforms each BINARY(32) to four BIGINTs. This lowers the time from 2.5s to 0.6s.Genro
OK, I found out that if I actually use a table where I store the hash as 4 BIGINTs, the same query completes in 0.02s. Defintely using BINARY(32) is a Bad Idea(TM).Genro
hi there , i'm trying to do hamming distance with mysql too. how exactly query to search relevance records with hash string?Tweedy
Did you ever try using the BIT column type? dev.mysql.com/doc/refman/5.0/en/bit-type.htmlAbbess
Splitting the binary in ints has also other interesting applications for speeding things up even further when you want to find hamming distance less than a certain value. See https://mcmap.net/q/82890/-hamming-distance-similarity-searches-in-a-databaseBernete
C
1

Interesting question, I've found a way to do this for a binary(3) that might work as well for a binary(32):

drop table if exists BinaryTest;
create table  BinaryTest (hash binary(3));
insert BinaryTest values (0xAAAAAA);

set @supplied = cast(0x888888 as binary);

select  length(replace(concat(
            bin(ascii(substr(hash,1,1)) ^ ascii(substr(@supplied,1,1))),
            bin(ascii(substr(hash,2,1)) ^ ascii(substr(@supplied,2,1))),
            bin(ascii(substr(hash,3,1)) ^ ascii(substr(@supplied,3,1)))
        ),'0',''))
from    BinaryTest;

The replace removes any all zeroes, and the length of remainder is the number of ones. (The conversion to binary omits leading zeroes, so counting the zeroes would not work.)

This prints 6, which matches the number of ones in

0xAAAAAA ^ 0x888888 = 0x222222 = 0b1000100010001000100010
Chevy answered 23/1, 2011 at 23:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.