Similar image search by pHash distance in Elasticsearch
Asked Answered
S

7

45

Similar image search problem

  • Millions of images pHash'ed and stored in Elasticsearch.
  • Format is "11001101...11" (length 64), but can be changed (better not).

Given subject image's hash "100111..10" we want to find all similar image hashes in Elasticsearch index within hamming distance of 8.

Of course, query can return images with greater distance than 8 and script in Elasticsearch or outside can filter the result set. But total search time must be within 1 second or so.

Our current mapping

Each document has nested images field that contains image hashes:

{
  "images": {
    "type": "nested", 
    "properties": {
      "pHashFingerprint": {"index": "not_analysed", "type": "string"}
    }
  }
}

Our poor solution

Fact: Elasticsearch fuzzy query supports Levenshtein distance of max 2 only.

We used custom tokenizer to split 64 bit string into 4 groups of 16 bits and do 4 group search with four fuzzy queries.

Analyzer:

{
   "analysis": {
      "analyzer": {
         "split4_fingerprint_analyzer": {
            "type": "custom",
            "tokenizer": "split4_fingerprint_tokenizer"
         }
      },
      "tokenizer": {
         "split4_fingerprint_tokenizer": {
            "type": "pattern",
            "group": 0,
            "pattern": "([01]{16})"
         }
      }
   }
}

Then new field mapping:

"index_analyzer": "split4_fingerprint_analyzer",

Then query:

{
   "query": {
      "filtered": {
         "query": {
            "nested": {
               "path": "images",
               "query": {
                  "bool": {
                     "minimum_should_match": 2,
                     "should": [
                        {
                           "fuzzy": {
                              "phashFingerprint.split4": {
                                 "value": "0010100100111001",
                                 "fuzziness": 2
                              }
                           }
                        },
                        {
                           "fuzzy": {
                              "phashFingerprint.split4": {
                                 "value": "1010100100111001",
                                 "fuzziness": 2
                              }
                           }
                        },
                        {
                           "fuzzy": {
                              "phashFingerprint.split4": {
                                 "value": "0110100100111001",
                                 "fuzziness": 2
                              }
                           }
                        },
                        {
                           "fuzzy": {
                              "phashFingerprint.split4": {
                                 "value": "1110100100111001",
                                 "fuzziness": 2
                              }
                           }
                        }
                     ]
                  }
               }
            }
         },
         "filter": {}
      }
   }
}

Note that we return documents that have matching images, not the images themselves, but that should not change things a lot.

The problem is that this query returns hundreds of thousands of results even after adding other domain-specific filters to reduce initial set. Script has too much work to calculate hamming distance again, therefore query can take minutes.

As expected, if increasing minimum_should_match to 3 and 4, only subset of images that must be found are returned, but resulting set is small and fast. Below 95% of needed images are returned with minimum_should_match == 3 but we need 100% (or 99.9%) like with minimum_should_match == 2.

We tried similar approaches with n-grams, but still not much success in the similar fashion of too many results.

Any solutions of other data structures and queries?

Edit:

We noticed, that there was a bug in our evaluation process, and minimum_should_match == 2 returns 100% of results. However, processing time afterwards takes on average 5 seconds. We will see if script is worth optimising.

Saxony answered 25/9, 2015 at 15:37 Comment(10)
If B is the integer number of bits set in each fingerprint (0 <= B <= 64). Then you can store B with each document, and initially filter out all records where B < (sourceB - 8) and B > (sourceB + 8). Should reduce your fingerprints under consideration by a minimum of 4x given an even distribution.Lemonade
While it is true that Elasticsearch fuzzy query and most other APIs with a fuzziness param only support a max edit distance of 2, what about fuzzy_like_this query? Their docs do note that there is an exception for that here. I think that might allow you to avoid using the hacky solution you currently have. And of course, you are not getting results withing hamming distance 8 but Levenshtein distance 8 so I am not sure if you will recalculate that.Crustaceous
@PeterDixon-Moses, that is an interesting idea, although that is a very large range to search on this binomial distribution so reduction is very small.Saxony
@eemp, thanks, we tried fuzzy_like_this, but it takes minutes to complete in ES and it's a deprecated function. Good idea though.Saxony
Actually a few million elements isn't that much, even 100 million 64-bit integers (so 8 bytes each) is only 800 MB of RAM and fits easily on GPU. I couldn't find a good reference but I expect CUDA to stream through that dataset in 10s of milliseconds and produce the exact list as an output. In high-dimensional spaces especially fuzzy matching might not benefit much from indexing and datastructures. Even when sorting they could go through 1740 million 32-bit keys / second!Perversion
I like the CUDA idea. Without a domain-specific hashing scheme, really the only way to speed up your computation in Elasticsearch is by sharding your data across multiple cluster-nodes. Whereas on a GPU you can write an optimized Hamming tester that can be parallelized 32-ways without any network or disk IO.Lemonade
Of course, this ^above^ is probably premature optimization, unless you need subsecond response time. (Your use case is recall-focused, so someone is clearly invested in reviewing thousands of hamming=8 matches which would take enough user-time/effort to render search performance less critical.)Lemonade
Re: You're latest update (minimum_should_match returns 100%): That's great!!! If you can find a way to push your Hamming Scoring out to each Elasticsearch cluster node, you'll save on I/O and post-processing. Look at function_score queries using script_score. Groovy supports the XOR operator (^) and you can use Java's Integer.bitCount on the XOR output to give you Hamming Weight.Lemonade
I'm looking at the same problem, but this time starting with a hex-string. So my script first parses the hex string into a BigInteger and then compares with the incoming string using @PeterDixon-Moses suggestion of XOR and then bitCount. It still takes far too long to process. In fact, this is the second scoring script I've tried with ES (I've previously tried many iterations of a euclidean distance one), and I've not managed to get one to scale yet. I'd like to question them as to how they expect these functions to scale elastically, but I can't see to join the user group!Depravity
Spotify has an open source tool called [annoy][1] that does this out of the box. [1]: github.com/spotify/annoyHeppman
P
20

I have simulated and implemented a possible solution, which avoids all expensive "fuzzy" queries. Instead at index-time you take N random samples of M bits out of those 64 bits. I guess this is an example of Locality-sensitive hashing. So for each document (and when querying) sample numberx is always taken from same bit positions to have consistent hashing across documents.

Queries use term filters at bool query's should clause with relatively low minimum_should_match threshold. Lower threshold corresponds to higher "fuzziness". Unfortunately you need to re-index all your images to test this approach.

I think { "term": { "phash.0": true } } queries didn't perform well because on average each filter matches 50% of documents. With 16 bits / sample each sample matches 2^-16 = 0.0015% of documents.

I run my tests with following settings:

  • 1024 samples / hash (stored to doc fields "0" - "ff")
  • 16 bits / sample (stored to short type, doc_values = true)
  • 4 shards and 1 million hashes / index, about 17.6 GB of storage (could be minimized by not storing _source and samples, only the original binary hash)
  • minimum_should_match = 150 (out of 1024)
  • Benchmarked with 4 million docs (4 indexes)

You get faster speed and lower disk usage with fewer samples but documents between hamming distances of 8 and 9 aren't as well separated (according to my simulations). 1024 seems to be the maximum number of should clauses.

Tests were run on a single Core i5 3570K, 24 GB RAM, 8 GB for ES, version 1.7.1. Results from 500 queries (see notes below, results are too optimistic):

Mean time: 221.330 ms
Mean docs: 197

Percentiles:
   1st = 140.51ms
   5th = 150.17ms
  25th = 172.29ms
  50th = 207.92ms
  75th = 233.25ms
  95th = 296.27ms
  99th = 533.88ms

I'll test how this scales to 15 million documents but it takes 3 hours to generate and store 1 million documents to each index.

You should test or calculate how low you should set minimum_should_match to get the desired trade-off between missed matches and incorrect matches, this depends on the distribution of your hashes.

Example query (3 out of 1024 fields shown):

{
  "bool": {
    "should": [
      {
        "filtered": {
          "filter": {
            "term": {
              "0": -12094,
              "_cache": false
            }
          }
        }
      },
      {
        "filtered": {
          "filter": {
            "term": {
              "_cache": false,
              "1": -20275
            }
          }
        }
      },
      {
        "filtered": {
          "filter": {
            "term": {
              "ff": 15724,
              "_cache": false
            }
          }
        }
      }
    ],
    "minimum_should_match": 150
  }
}

Edit: When I started doing further benchmarks I noticed that I had generated too dissimilar hashes to different indexes, thus searching from those resulted in zero matches. Newly generated documents result in about 150 - 250 matches / index / query and should be more realistic.

New results are shown in the graph before, I had 4 GB of memory for ES and remaining 20 GB for OS. Searching 1 - 3 indexes had good performance (median time 0.1 - 0.2 seconds) but searching more than this resulted in lots of disk IO and queries started taking 9 - 11 seconds! This could be circumvented by taking fewer samples of the hash but then recall and precision rates wouldn't be as good, alternatively you could have a machine with 64 GB of RAM and see how far you'll get.

Percentiles of query times (in ms) for varying number of indexes searched.

Edit 2: I re-generated data with _source: false and not storing hash samples (only the raw hash), this reduced storage space by 60% to about 6.7 GB / index (of 1 million docs). This didn't affect query speed on smaller datasets but when RAM wasn't sufficient and disk had to be used queries were about 40% faster.

Percentiles of query times (in ms) for varying number of indexes searched.

Edit 3: I tested fuzzy search with edit distance of 2 on a set of 30 million documents, and compared that to 256 random samples of the hash to get approximate results. Under these conditions methods are roughly the same speed but fuzzy gives exact results and doesn't need that extra disk space. I think this approach is only useful for "very fuzzy" queries like hamming distance of greater than 3.

Perversion answered 7/10, 2015 at 17:27 Comment(3)
Nice! +1 for optimizing performance (with a tunable performance vs. recall solution no less!) Definitely agree that having 64 binary indexed fields doesn't help reduce the cardinality enough to yield good performance. Though the initial goal was 100% recall at Hamming Distance 8 regardless of performance, OP seems to be willing to sacrifice reasonable recall for speed.Lemonade
You're right that a good LSH projection is the happy medium here, and random samples are the best you could do without analyzing the dataset. But given that each feature bit represents some aspect of the image, it's a safe bet that the distribution of features is not random, and that intelligently grouping some of these features together based on frequency in the dataset could help reduce the cardinality of each filter operation (and reduce the number of filter operations to << 1024 to help performance).Lemonade
At the moment I am re-running these tests with _source: false and storing only the raw 64-bit hash (as string), sampled patterns are indexed but not stored. This reduces disk usage by about 50% and I am interested to see if it helps query performance or not. Also I'll run these with 256 samples and compare performance with built-in fuzziness search with edit distance of 2.Perversion
P
11

I also implemented the CUDA approach with some good results even on a laptop GeForce 650M graphics card. Implementation was easy with Thrust library. I hope the code doesn't have bugs (I didn't thoroughly test it) but it shouldn't affect benchmark results. At least I called thrust::system::cuda::detail::synchronize() before stopping the high-precision timer.

typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;

// Maybe there is a simple 64-bit solution out there?
__host__ __device__ inline int hammingWeight(uint32_t v)
{
    v = v - ((v>>1) & 0x55555555);
    v = (v & 0x33333333) + ((v>>2) & 0x33333333);

    return ((v + (v>>4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

__host__ __device__ inline int hammingDistance(const uint64_t a, const uint64_t b)
{
    const uint64_t delta = a ^ b;
    return hammingWeight(delta & 0xffffffffULL) + hammingWeight(delta >> 32);
}

struct HammingDistanceFilter
{
    const uint64_t _target, _maxDistance;

    HammingDistanceFilter(const uint64_t target, const uint64_t maxDistance) :
            _target(target), _maxDistance(maxDistance) {
    }

    __host__ __device__ bool operator()(const uint64_t hash) {
        return hammingDistance(_target, hash) <= _maxDistance;
    }
};

Linear searching was as easy as

thrust::copy_if(
    hashesGpu.cbegin(), hashesGpu.cend(), matchesGpu.begin(),
    HammingDistanceFilter(target_hash, maxDistance)
)

Searching was 100% accurate and way faster than my ElasticSearch answer, in 50 milliseconds CUDA could stream through 35 million hashes! I'm sure newer desktop cards are even way faster than this. Also we get very low variance and consistent linear growth of search time as we go through more and more data. ElasticSearch hit bad memory problems on larger queries due to inflated sampling data.

So here I'm reporting results of "From these N hashes, find those which are within 8 Hamming distance from a single hash H". I run these 500 times and reported percentiles.

Search performance

There is some kernel launch overhead but after the search space is more than 5 million hashes the searching speed is fairly consistent at 700 million hashes / second. Naturally the upper bound on number of hashes to be searched is set by GPU's RAM.

Search performance

Update: I re-run my tests on GTX 1060 and it scans about 3800 million hashes per second :)

Perversion answered 27/10, 2015 at 8:28 Comment(4)
Do you have the rest of the code somewhere? Wondering what the performance is of initial loading of the dataset ...Lette
Actually yes, the main file is here: github.com/nikonyrh/stackoverflow-scripts/blob/master/… Naturally the code is a bit ad-hoc but it should be easy to follow. I imagine the most expensive step is loading the hashes from somewhere, be it from a while or a database. But anyway it should be very fast. Actually you could even use CUDA "unified memory" to allocate larger datasets than what can fit on your GPU's RAM.Perversion
Well that is a question whetrer thrust::copy_if scales accross GPUs. I see no reason why it wouldn't.Perversion
I am trying to see if I can use this for a python webscraping script such that it periodically queries the database of hashes with a query hash and can ultimately get the content inside matchesGpu. I was looking at using pycuda to compile and execute the CUDA code from within python but am unsure if that would require me to forego the use of the thrust API.. Do you have any knowledge on the subject or can provide me with any pointers?Lipstick
D
5

I've started on a solution to this myself. I've only tested so far against a data set of around 3.8million documents, and I intend to push that upwards of tens-of-millions now.

My solution so far, is this:

Write a native scoring function and register it as a plugin. Then call this when querying to adjust the _score value of documents as they come back.

As a groovy script, the time taken to run the custom scoring function was extremely unimpressive, but writing it as a native scoring function (as demonstrated in this somewhat aged blog post: http://www.spacevatican.org/2012/5/12/elasticsearch-native-scripts-for-dummies/) was orders of magnitude faster.

My HammingDistanceScript looked something like this:

public class HammingDistanceScript extends AbstractFloatSearchScript {

    private String field;
    private String hash;
    private int length;

    public HammingDistanceScript(Map<String, Object> params) {
        super();
        field = (String) params.get("param_field");
        hash = (String) params.get("param_hash");
        if(hash != null){
            length = hash.length() * 8;
        }
    }

    private int hammingDistance(CharSequence lhs, CharSequence rhs){          
        return length - new BigInteger(lhs, 16).xor(new BigInteger(rhs, 16)).bitCount();
    }

    @Override
    public float runAsFloat() {
        String fieldValue = ((ScriptDocValues.Strings) doc().get(field)).getValue();
        //Serious arse covering:
        if(hash == null || fieldValue == null || fieldValue.length() != hash.length()){
            return 0.0f;
        }

        return hammingDistance(fieldValue, hash);
    }
}

It's worth mentioning at this point that my hashes are hex-encoded binary strings. So, the same as yours, but hex-encoded to reduce storage size.

Also, I'm expecting a param_field parameter, which identifies which field value I want to do hamming distance against. You don't need to do this, but I'm using the same script against multiple fields, so I do :)

I use it in queries like this:

curl -XPOST 'http://localhost:9200/scf/_search?pretty' -d '{
  "query": {
    "function_score": {     
      "min_score": MY IDEAL MIN SCORE HERE,
      "query":{
       "match_all":{}
      },
      "functions": [
        {
          "script_score": {
            "script": "hamming_distance",
            "lang" : "native",
            "params": {
              "param_hash": "HASH TO COMPARE WITH",
              "param_field":"phash"
            }
          }
        }
      ]
    }
  }
}'

I hope this helps in some way!

Other information that may be useful to you if you go this route:

1. Remember the es-plugin.properties file
This has to be compiled into the root of your jar file (if you stick it in /src/main/resources then build your jar it'll go in the right place).

Mine looked like this:

plugin=com.example.elasticsearch.plugins.HammingDistancePlugin
name=hamming_distance
version=0.1.0
jvm=true
classname=com.example.elasticsearch.plugins.HammingDistancePlugin
java.version=1.7
elasticsearch.version=1.7.3

2. Reference your custom NativeScriptFactory impl in elasticsearch.yml
Just like on aged blog post.

Mine looked like this:

script.native:
    hamming_distance.type: com.example.elasticsearch.plugins.HammingDistanceScriptFactory

If you don't do this, it still shows up on the plugins list (see later) but you'll get errors when you try to use it saying that elasticsearch can't find it.

3. Don't bother using the elasticsearch plugin script to install it
It's just a pain the ass and all it seems to do is unpack your stuff - a bit pointless. Instead, just stick it in %ELASTICSEARCH_HOME%/plugins/hamming_distance and restart elasticsearch.

If all has gone well, you'll see it being loaded on elasticsearch startup:

[2016-02-09 12:02:43,765][INFO ][plugins                  ] [Junta] loaded [mapper-attachments, marvel, knapsack-1.7.2.0-954d066, hamming_distance, euclidean_distance, cloud-aws], sites [marvel, bigdesk]

AND when you call the list of plugins it'll be there:

curl http://localhost:9200/_cat/plugins?v

produces something like:

name        component                version type url
Junta       hamming_distance         0.1.0   j

I'm expecting to be able to test against upwards of tens-of-millions of documents within the next week or so. I'll try and remember to pop back and update this with the results, if it helps.

Depravity answered 9/2, 2016 at 13:58 Comment(4)
I understand that you have O(n) algorithm with a very fast processing speed. This looks like a perfect solution working as a post processing tool to our original fuzzy query method. Maybe you can combine them both? You would only need to process a fraction of those millions of records - speed should be great.Saxony
@Saxony To be clear: do you mean use the fuzziness query on tokenized hashes, to clear up those results that are way out, and then clean up the rest using this scoring script?Depravity
@Saxony PS: At the moment this scoring script returns in ~700ms for ~3.8 million results.Depravity
@Saxony With a hash of 64 bits, I can get results in around 1 second from 31m documents using a combo of fuzziness and native scoring scripts on elasticsearch 2.3.1. With simple scoring script, 15 seconds for the first query, and 5 thereafter (looks like es is caching something)Depravity
T
4

A recently proposed FENSHSES method from [1] seems to be the state-of-the-art one to conduct r-neighbor search in Hamming space on Elasticsearch.

[1] Mu, C, Zhao, J., Yang, G., Yang, B. and Yan, Z., 2019, October. Fast and Exact Nearest Neighbor Search in Hamming Space on Full-Text Search Engines. In International Conference on Similarity Search and Applications (pp. 49-56). Springer, Cham.

Teage answered 26/12, 2019 at 3:44 Comment(0)
L
2

Here's an inelegant, but exact (brute force) solution that requires deconstructing your feature hash into individual boolean fields so you can run a query like this:

"query": {
    "bool": {
      "minimum_should_match": -8,
      "should": [
          { "term": { "phash.0": true } },
          { "term": { "phash.1": false } },
          ...
          { "term": { "phash.63": true } }
        ]
    }
}

I'm not sure how this will perform vs. fuzzy_like_this, but the reason the FLT implementation is being deprecated is that it has to visit every term in the index to compute edit-distance.

(whereas here/above, you are leveraging Lucene's underlying inverted-index data-structure and optimized set operations which should work to your advantage given you probably have fairly sparse features)

Lemonade answered 29/9, 2015 at 15:18 Comment(8)
You inspired another strategy, which we are going to try out. Instead of creating 64 fields, just add one field which stores all 64 states in one string: 001 010 020 031 040 ... 631 (position + state) which is equivalent to your suggestion, but can differ in performance.Saxony
Sure. If you want to optimize that further, you could just store the set bits as integers (array of values from 1-64) and use negation (the not query) to test for unset bits. That way you can reduce collection size and get away from string-comparison of numeric values.Lemonade
Hey, thank you for the support! We did integer array. 43rd set-bit's value was 42 and not-set-bit's was +100. Namely 142. So no negation was necessary, only simple terms query of 64 values. However, performance was very bad, it took 5-20 s just for the response. Any insight on optimising that?Saxony
Glad this answer inspired a workable solution! No issues with recall I hope? (Getting back 100% of your Hamming-Distance 8 images?)Lemonade
There are a number of ways to explore improving performance. Would you be willing to open up another post with more details around your collection size, and cluster configuration (number of nodes/replicas, type of hardware (memory/cpu), type of storage (SSD vs spinning disk vs NAS), etc...)Lemonade
For the record, we have also tried this exact 64 boolean field solution, and performance is mostly the same. Above 5 seconds for 15 million hashes.Saxony
@Saxony Did you try storing the set bits only in a positional string (eg: 00 03 ... 63) and using the whitespace tokenizer in conjunction with minimum_should_match? Was that what was giving you 5-20s response times?Depravity
@Saxony What about if you store set bits (or unset and go the not route) as individual fields, rather than array, eg: phash_00: 1, phash_03:1 ... phash_63: 1 and in your query looking for only set bits with a minimum_should_match? Does this increase speed as for lots of results they won't even HAVE those set bit fields?Depravity
N
2

I have used @ndtreviv's answer as a starting point. Here are my notes for ElasticSearch 2.3.3:

  1. es-plugin.properties file is now called plugin-descriptor.properties

  2. You do not reference NativeScriptFactory in elasticsearch.yml, instead you create an additional class next to your HammingDistanceScript.


import org.elasticsearch.common.Nullable;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.script.ExecutableScript;
import org.elasticsearch.script.NativeScriptFactory;
import org.elasticsearch.script.ScriptModule;

import java.util.Map;

public class StringMetricsPlugin extends Plugin {
    @Override
    public String name() {
        return "string-metrics";
    }

    @Override
    public  String description() {
        return "";
    }

    public void onModule(ScriptModule module) {
        module.registerScript("hamming-distance", HammingDistanceScriptFactory.class);
    }

    public static class HammingDistanceScriptFactory implements NativeScriptFactory {
        @Override
        public ExecutableScript newScript(@Nullable Map<String, Object> params) {
            return new HammingDistanceScript(params);
        }
        @Override
        public boolean needsScores() {
            return false;
        }
    }
}
  1. Then reference this class in your plugin-descriptor.properties file:

plugin=com.example.elasticsearch.plugins. StringMetricsPlugin
name=string-metrics
version=0.1.0
jvm=true
classname=com.example.elasticsearch.plugins.StringMetricsPlugin
java.version=1.8
elasticsearch.version=2.3.3
  1. You query by supplying the name you used in this line: module.registerScript("hamming-distance", HammingDistanceScriptFactory.class); in 2.

Hope this helps the next poor soul that has to deal with the shitty ES docs.

Neutralism answered 20/6, 2016 at 14:8 Comment(3)
Not yet. But I still consider this an interim solution because it is O(n). Will have to come up with something smarter for production later.Neutralism
Ha! I literally just updated our plugins to work with 2.3.1 and had to go through this. I wish I'd read this first! Ah well. For what it's worth, I've open-sourced the plugins we're using, although I haven't committed the changes for ES 2.x yet. github.com/CameraForensics/elasticsearch-pluginsDepravity
PS: As below, with my native plugin I can get results back on 31m docs within 15 seconds the first time, and 5 secs thereafter. Using a fuzziness filter first to limit the result set before scoring, I'm getting results back in less than a second in some cases. In most, somewhere around 1-2 seconds. That's on a 64bit hash though.Depravity
T
2

Here is 64bit solution to @NikoNyrh's answer. Hamming distance can be calculated by just using XOR operator with builtin __popcll function of CUDA.

struct HammingDistanceFilter
{
    const uint64_t _target, _maxDistance;

    HammingDistanceFilter(const uint64_t target, const uint64_t maxDistance) :
            _target(target), _maxDistance(maxDistance) {
    }

    __device__ bool operator()(const uint64_t hash) {
        return __popcll(_target ^ hash) <= _maxDistance;
    }
};
Timmy answered 8/1, 2018 at 18:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.