Here are the results for my benchmarks. Phash is calculated with the imagehash library in Python and stored as two BIGINTs in the database.
This test was ran on 858,433 images in a mariadb database that does not use sharding. I found sharding to actually slow down the process, however that was with the function method so that may be different without it or on a large database.
The table these are running on is an in-memory only table. A local table is kept and upon startup of the database the id, phash1, and phash2 are copied to an in-memory table. The id is returned to match to the innodb table once something is found.
Total Images: 858433
Image 1: ece0455d6b8e9470
Function HAMMINGDISTANCE_16:
RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)
Method: HAMMINGDISTANCE_16 Function
Query:
SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10), CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 2.1760 seconds
Method: BIT_COUNT
Query:
SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 0.1547 seconds
Method: Multi-Select BIT_COUNT inner is filephash_1
Query:
SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3;
Time: 0.1878 seconds
Method: Multi-Select BIT_COUNT inner is filephash_2
Query:
SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) as BC1 FROM `phashs` WHERE BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('ece0455d6b8e9470', 9, 8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('ece0455d6b8e9470', 1, 8), 16, 10)) + BC1 <= 3;
Time: 0.1860 seconds
Image 2: 813ed36913ec8639
Function HAMMINGDISTANCE_16:
RETURN BIT_COUNT(A0 ^ B0) + BIT_COUNT(A1 ^ B1)
Method: HAMMINGDISTANCE_16 Function
Query:
SELECT `id` FROM `phashs` WHERE HAMMINGDISTANCE_16(filephash_1, filephash_2, CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10), CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 2.1440 seconds
Method: BIT_COUNT
Query:
SELECT `id` FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 0.1588 seconds
Method: Multi-Select BIT_COUNT inner is filephash_1
Query:
SELECT `id` FROM ( SELECT `id`, `filephash_2`, BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) as BC0 FROM `phashs` WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) <= 3 ) BCQ0 WHERE BC0 + BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3;
Time: 0.1671 seconds
Method: Multi-Select BIT_COUNT inner is filephash_2
Query:
SELECT `id` FROM (SELECT `id`, `filephash_1`, BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) as BC1 FROM `phashs` WHERE BIT_COUNT(filephash_2 ^ CONV(SUBSTRING('813ed36913ec8639', 9, 8), 16, 10)) <= 3) BCQ1 WHERE BIT_COUNT(filephash_1 ^ CONV(SUBSTRING('813ed36913ec8639', 1, 8), 16, 10)) + BC1 <= 3;
Time: 0.1686 seconds