I am trying to find similar hashes (hexadecimal hash) using hamming and Levenshtein distance. Lets say two hashes are similar if their hamming distance is less than 10 (number of differing bits).
Hash 1= ffffff (base 16)
Hash 2= fffff0 (base 16)
The hamming distance between two hashes is 4. They are similar. Because,
Hash 1= 11111111 11111111 11111111 (base 2)
Hash 2= 11111111 11111111 11110000 (base 2)
I have 8 million such hashes. I am wondering what will be a suitable data structure for storing the 8 million hashes. I initially tried "Trie" but consider the following scenario,
Hash 1 = 0fabde (00001111 10101011 11011110)
Hash 2 = adcbfe (10101010 11001011 11111110)
The hamming distance is 7. So I cannot do prefix search.
I know that i can use XOR and Integer.bitCount() to get the number of differing bits, but I have one target hash and 8 million hashes to search against i.e Given a hash i have to find all the similar hashes in 8 million hashes that we have in repository.
Is there any way store the hashes effectively so that my search base is reduced?