Locality-sensitive hashing - Elasticsearch
Asked Answered
H

1

17

is there any plugin allowing LSH on Elasticsearch? If yes, could you point me to the location and tell me a little how to use it? Thanks

Edit: I found out that ES uses MinHash plugin. How could I compare documents to one another with this? What would be a good setting to find duplicates?

Hypso answered 25/9, 2015 at 8:8 Comment(0)
M
6
  1. There is a Elasticsearch MinHash Plugin. You can use it to extract minhash value every time you index a document and query the document by minhash later.

    1. Install MinHash plugin:

      $ $ES_HOME/bin/plugin install org.codelibs/elasticsearch-minhash/2.3.1
      
    2. Add a minhash analyzer when creating your index:

      $ curl -XPUT 'localhost:9200/my_index' -d '{
        "index":{
          "analysis":{
            "analyzer":{
              "minhash_analyzer":{
                "type":"custom",
                "tokenizer":"standard",
                "filter":["minhash"]
              }
            }
          }
        }
      }'  
      
    3. Put minhash_value field into an index mapping:

      $ curl -XPUT "localhost:9200/my_index/my_type/_mapping" -d '{
        "my_type":{
          "properties":{
            "message":{
              "type":"string",
              "copy_to":"minhash_value"
            },
            "minhash_value":{
              "type":"minhash",
              "minhash_analyzer":"minhash_analyzer"
            }
          }
        }
      }'
      
    4. The minhash value is calculated automatically when adding document to the index you have created with minhash analyzer.
    5. a. Use More like this query can be used to do "like" search on the minhash_value field:

      GET /_search
      {
          "query": {
              "more_like_this" : {
                  "fields" : ["minhash_value"],
                  "like" : "KV5rsUfZpcZdVojpG8mHLA==",
                  "min_term_freq" : 1,
                  "max_query_terms" : 12
              }
          }
      }
      

      b. You can also use fuzzy query but it accepts the query to differ from the result by 2 (maximum).

      GET /_search
      {
          "query": {
             "fuzzy" : { "minhash_value" : "KV5rsUfZpcZdVojpG8mHLA==" }
          }
      } 
      

      You can find more about the fuzzy query here.

  2. Or you can create the hash value outside of elasicsearch (write a code to extract hash value) and everytime you index a document you can run the code and attach the hash value to the document you are indexing. And later search with the hash value using More Like This query or Fuzzy query as described above.
  3. Last but not least, you can write elasticsearch plugin yourself as above (which suits you hashing algorithm) and do the same step above.
Metalepsis answered 21/12, 2016 at 2:39 Comment(5)
Can only use fuzzy queries on keyword and text fields - not on [minhash_value] which is of type [minhash]Boche
Even more_like_this queries do not work, they only support text and keyword fields. Any workaround for this?Sparkman
is there still no solution for this somehow?Zambrano
there is a "copy_bits_to" operator, but I can't show me that field?Zambrano
how to create hash value outside of elasticsearch?Alcyone

© 2022 - 2024 — McMap. All rights reserved.