Hamming Distance optimization for MySQL or PostgreSQL?
Asked Answered
U

3

6

I trying to improve search similar images pHashed in MySQL database. Right now I comparing pHash counting hamming distance like this:

SELECT * FROM images WHERE BIT_COUNT(hash ^ 2028359052535108275) <= 4

Results for selecting (engine MyISAM)

  • 20000 rows ; query time < 20ms
  • 100000 rows ; query time ~ 60ms # this was just fine, until its reached 150000 rows
  • 300000 rows ; query time ~ 150ms

So query time encrease depends of the number of rows in table.


I also try solutions found on stackoverflow Hamming distance on binary strings in SQL

SELECT * FROM images WHERE 
BIT_COUNT(h1 ^ 11110011) + 
BIT_COUNT(h2 ^ 10110100) + 
BIT_COUNT(h3 ^ 11001001) + 
BIT_COUNT(h4 ^ 11010001) + 
BIT_COUNT(h5 ^ 00100011) + 
BIT_COUNT(h6 ^ 00010100) + 
BIT_COUNT(h7 ^ 00011111) + 
BIT_COUNT(h8 ^ 00001111) <= 4

rows 300000 ; query time ~ 240ms


I changed database engine to PostgreSQL. Translate this MySQL query to PyGreSQL Without success. rows 300000 ; query time ~ 18s


Is there any solution to optimize above queries? I mean optimization not depended of the number of rows.

I have limited ways (tools) to solve this problem. MySQL so far seemed to be the simplest solution but I can deploy code on every open source database engine that will work with Ruby on dedicated machine. There is some ready solutions for MsSQL https://mcmap.net/q/1776881/-hamming-weight-population-count-in-t-sql (not tested). Maybe someone know how to translate it for MySQL or PostgreSQL.

Please, post answers based on some code or observations. We have a lot of theoretical issues about hamming distance on stackoverflow.com

Thanks!

Underpin answered 17/2, 2013 at 19:30 Comment(1)
hey, i'm trying to do a similar image search just like you. but i returned always is 0? can you provide me sample code about related search with hash string ?Puckett
C
3

When considering the efficiency of algorithms, computer scientists use the concept of the order denoted O(something) where something is a function of n, the number of things being computed, in this case rows. So we get, in increasing time:

  • O(1) - independent of the number of items
  • O(log(n)) - increases as the logarithm of the items
  • O(n) - increases in proportion of the items (what you have)
  • O(n^2) - increases as the square of the items
  • O(n^3) - etc
  • O(2^n) - increases exponentially
  • O(n!) - increases with the factorial of the number

The last 2 are effectively uncomputable for any reasonable number of n (80+).

Only the most significant term matters since this dominates for large n so n^2 and 65*n^2+787*n+4656566 are both O(n^2)

Bearing in mind that this is a mathematical construction and the time an algorithm takes with real software on real hardware using real data may be heavily influenced by other things (e.g. an O(n^2) memory operation may take less time than an O(n) disk operation).

For your problem, you need to run through each row and compute BIT_COUNT(hash ^ 2028359052535108275) <= 4. This is an O(n) operation.

The only way this could be improved is by utilizing an index since a b-tree index retrieval is an O(log(n)) operation.

However, because your column field is contained within a function, an index on that column cannot be used. You have 2 possibilities:

  1. This is an SQL server solution and I don't know if it is portable to MySQL. Create a persisted calculated column in your table with the formula BIT_COUNT(hash ^ 2028359052535108275) and put an index on it. This will not be suitable if you need to change the bit mask.
  2. Work out a way of doing the bitwise arithmetic without using the BIT_COUNT function.
Coquille answered 19/2, 2013 at 3:27 Comment(3)
Solution 1 cannot be used, because of course bit mask needs to be changed on each request. Solution 2 too abstract - seems I have solution, but I can't tell it, because I want to make money :)Cretonne
Writing postgres extensions can be a solution, if you know C well. Working project github.com/lalinsky/acoustid-server/blob/master/postgresql/…Underpin
FWIW, you can use a tree structure to speed this sort of query up. You use a BK-tree, which gives you O(log(n)) time (albeit with the distance affecting the value of n pretty significantly). In any event, you can reduce a full table scan to < 10% table scan for edit distances of <= 2, in many cases.Eanore
N
3

This solution made things a bit faster for me. It makes a derived table for each hash compare, and returns only the results that are less than the ham distance. This way, it's not doing the BIT_COUNT on a pHash that has already exceeded the ham. It returns all matches in about 2.25 seconds on 2.6 million records.

It's InnoDB, and I have very few indexes.

If somebody can make it faster, I'll appreciate you.

SELECT *, BIT_COUNT(pHash3 ^ 42597524) + BC2 AS BC3 
FROM ( 
    SELECT *, BIT_COUNT(pHash2 ^ 258741369) + BC1 AS BC2 
    FROM ( 
        SELECT *, BIT_COUNT(pHash1 ^ 5678910) + BC0 AS BC1 
        FROM ( 
            SELECT `Key`, pHash0, pHash1, pHash2, pHash3, BIT_COUNT(pHash0 ^ 1234567) as BC0 
            FROM files 
            WHERE  BIT_COUNT(pHash0 ^ 1234567) <= 3 
        ) AS BCQ0 
        WHERE BIT_COUNT(pHash1 ^ 5678910) + BC0 <= 3 
    ) AS BCQ1 
    WHERE BIT_COUNT(pHash2 ^ 258741369) + BC1 <= 3 
    ) AS BCQ2 
WHERE BIT_COUNT(pHash3 ^ 42597524) + BC2 <= 3

This is the equivalent query, but without the derived tables. Its return time is almost 3 times as long.

SELECT `Key`, pHash0, pHash1, pHash2, pHash3 
FROM Files 
WHERE BIT_COUNT(pHash0 ^ 1234567) + BIT_COUNT(pHash1 ^ 5678910) + BIT_COUNT(pHash2 ^ 258741369) + BIT_COUNT(pHash3 ^ 42597524) <=3

Keeping in mind that the lower the ham value on the first one, the faster it will run.

Newberry answered 12/8, 2014 at 20:22 Comment(4)
I can't take credit for this but thought I'd point you to the answer here: #35066175Reddick
Thanks, @Brian-F-Leighty, it's actually my own question you pointed to. And yes, the answer has shaved millennia off of my queries.Newberry
Sorry should of looked to see if you were on the other question. I just know I'm planning to use the same approach and thought I'd share. Good to know it's working well for you.Reddick
No apologies necessary, and my thanks are in order! I have come a long way with this project, so please feel free to use me as a resource.Newberry
P
1

Here are the results for my benchmarks. Phash is calculated with the imagehash library in Python and stored as two BIGINTs in the database.

This test was ran on 858,433 images in a mariadb database that does not use sharding. I found sharding to actually slow down the process, however that was with the function method so that may be different without it or on a large database.

The table these are running on is an in-memory only table. A local table is kept and upon startup of the database the id, phash1, and phash2 are copied to an in-memory table. The id is returned to match to the innodb table once something is found.

Total Images: 858433

Image 1: ece0455d6b8e9470

Function HAMMINGDISTANCE_16:

RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)

Method: HAMMINGDISTANCE_16 Function

Query:

SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('ece0455d6b8e9470', 1,  8), 16, 10), CONV(SUBSTRING('ece0455d6b8e9470', 9,  8), 16, 10)) <= 3;

Time: 2.1760 seconds

Method: BIT_COUNT

Query:

SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1,  8), 16, 10)) +  BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9,  8), 16, 10)) <= 3; 

Time: 0.1547 seconds

Method: Multi-Select BIT_COUNT inner is filephash_1

Query:

SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;

Time: 0.1878 seconds

Method: Multi-Select BIT_COUNT inner is filephash_2

Query:

SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9,  8), 16, 10)) as BC1 FROM `phashs` WHERE  BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9,  8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1,  8), 16, 10)) + BC1   <= 3;

Time: 0.1860 seconds

Image 2: 813ed36913ec8639

Function HAMMINGDISTANCE_16:

RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)

Method: HAMMINGDISTANCE_16 Function Query:

SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('813ed36913ec8639', 1,  8), 16, 10), CONV(SUBSTRING('813ed36913ec8639', 9,  8), 16, 10)) <= 3;

Time: 2.1440 seconds

Method: BIT_COUNT Query:

SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1,  8), 16, 10)) +  BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9,  8), 16, 10)) <= 3; 

Time: 0.1588 seconds

Method: Multi-Select BIT_COUNT inner is filephash_1

Query:

SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;

Time: 0.1671 seconds

Method: Multi-Select BIT_COUNT inner is filephash_2

Query:

SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9,  8), 16, 10)) as BC1 FROM `phashs` WHERE  BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9,  8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1,  8), 16, 10)) + BC1   <= 3;

Time: 0.1686 seconds

Pettifog answered 11/2, 2023 at 22:22 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Chenab

© 2022 - 2024 — McMap. All rights reserved.