Elasticsearch: Levenshtein sorting
Asked Answered
C

2

16

I have a query that works sufficiently, but I want to sort the results of this by using levenshtein between the query param and the field in question.

Right now I'm doing the query in ES and then I do the sorting in my application. Right now I'm testing the script field in sort. This is the script

import  org.elasticsearch.common.logging.*;
ESLogger logger = ESLoggerFactory.getLogger('levenshtein_script');

def str1 = '%s'.split(' ').sort().join(' ');
def str2 = doc['%s'].values.join(' '); //Needed since the field is analyzed. This will change when I reindex the data.
def dist = new int[str1.size() + 1][str2.size() + 1]
(0..str1.size()).each { dist[it][0] = it }
(0..str2.size()).each { dist[0][it] = it }
(1..str1.size()).each { i ->
   (1..str2.size()).each { j ->
       dist[i][j] = [dist[i - 1][j] + 1, dist[i][j - 1] + 1, dist[i - 1][j - 1] + ((str1[i - 1] == str2[j - 1]) ? 0 : 1)].min()
   }
}
def result = dist[str1.size()][str2.size()]
logger.info('Query param: ['+str1+'] | Term: ['+str2+'] | Result: ['+result+']');
return result;

Basically this is a template (check the %s) that I fill in my application like this

sortScript = String.format(EDIT_DISTANCE_GROOVY_FUNC, fullname, FULLNAME_FIELD_NAME);

The problem is this http://code972.com/blog/2015/03/84-elasticsearch-one-tip-a-day-avoid-costly-scripts-at-all-costs. Which is understandable.

My question is, how can I do what I need (sort the results by levenshtein) inside elasticsearch so I can avoid the overhead in my application. Can I use lucene expressions for this? Do you have an example? Is there some other way that I can accomplish this?

I'm using ElasticSearch 1.7.5 as a service. So native plugins should not be the first solution (I don't know even if it's possible, I'll have to check with my provider, but if it's the only viable solution I will do just that).

UPDATE

So it seems a good solution would be to save it in config/scripts folder as it will be compiled once https://www.elastic.co/blog/running-groovy-scripts-without-dynamic-scripting. The script can be indexed instead of saving it https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html . This is much more convenient for my use case. Does this have the same behaviour regarding the compilation of the script? Will it be compiled only once?

Character answered 12/9, 2016 at 3:41 Comment(6)
If the only requirement is to sort the results by Levenshtein distance, you can convert your query into a Fuzzy search. elastic.co/guide/en/elasticsearch/reference/current/…. It uses Damerau-Levenshtein by default, but can be switched to Classic Levenshtein by settingCochise
Out of curiosity, when you say you want to remove the overhead from your application - is it because its slow? or you want to not have additional search logic in your application code?Cochise
Somehow my first comment was posted before I finished - You can set it to classic Levenshtein by setting transpositions to false. More details here: elastic.co/blog/found-fuzzy-searchCochise
@Cochise thnx for the response. It's true about the algorithm, but for the scoring part it uses tf/idf too, in addition to Damerau-Levenshtein. No it's not slow, I just wanted to see if I can remove the additional logic from my application.Character
You could try to use Elastic fuzzy query for matching the documents and maybe use rescore query for sorting the top n hits using a groovy script file.Photobathic
@ArchitSaxena Thank you for your help. I've thought about rescore, but for my use case it is a bit of an overkill.Character
D
3

It's important to note that Groovy is deprecated in Elasticsearch 5.x and it will be removed in Elasticsearch 6.0. You will either want to look at using Painless scripting to replace this functionality or create a native Java script that possibly uses Lucene's LuceneLevenshteinDistance to do this for you.

Your script is also pretty scary in that it adds a number of loops (mostly hidden by Groovy helpers) and potentially large memory allocations into the mix. I have serious doubt's about its performance at scale.

I also noticed the presence of %s in the script, which I assume means that your own code replaces the field name dynamically. You should always use params for this purpose, then use the parameter as a variable in the script. This avoids having to compile a version of a script per field name. (I expect you had to do this to make it file-based)

Does this have the same behaviour regarding the compilation of the script?

Yes, file-based scripts are the most secure (because they require access to the machine itself to install). File-based scripts are compiled, just like inline and index-based scripts.

The downside to file-based scripts is that you need to add them to every node. Not on that, but every node needs the same version of the script. This means that, if you ever choose to update it, then it's better to add a new script and reference it, rather than to replace it.

File-based scripts are picked up every 60 seconds by default.

Will it be compiled only once?

Yes, per node.

Daysidayspring answered 10/1, 2017 at 20:24 Comment(0)
A
1
PUT _scripts/levenshtein
{
  "script": {
    "lang": "painless",
    "source": """
      String a = params.a;
      String b = params.b;

      int[][] dp = new int[a.length() + 1][b.length() + 1];

      for (int i = 0; i <= a.length(); i++) {
        for (int j = 0; j <= b.length(); j++) {
          if (i == 0) {
            dp[i][j] = j;
          } else if (j == 0) {
            dp[i][j] = i;
          } else {
            int cost = a.charAt(i - 1) == b.charAt(j - 1) ? 0 : 1;
            int deletion = dp[i - 1][j] + 1;
            int insertion = dp[i][j - 1] + 1;
            int substitution = dp[i - 1][j - 1] + cost;
            int tempMin = (int) Math.min(deletion, insertion);
            dp[i][j] = (int) Math.min(tempMin, substitution);
          }
        }
      }

      int distance = dp[a.length()][b.length()];
      int maxLength = (int) Math.max(a.length(), b.length());
      return maxLength > 0 ? 1.0 - ((double) distance / (double) maxLength) : 0.0;
    """
  }
}

Then call it with

GET /my_index/_search
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "id": "levenshtein",
        "params": {
          "a": "kitten",
          "b": "doc['text_field'].value"
        }
      }
    }
  }
}

or use it as a field

GET /my_index/_search
{
  "query": {
        "match_all": {}
      },
    "script_fields": {
      "leven": {
        "script": {
          "id":"levenshtein",
          "params": {
            "a":"kitten xyz",
            "b": "doc['text_field'].value"
          }
        }
      }
    }
}
Anatomize answered 13/6 at 3:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.