ElasticSearch use "best match" of ngram terms instead of "synonym"?
Asked Answered
G

2

9

Is it possible to tell ElasticSearch to use "best match" of all grams instead of using grams as synonyms?

By default ElasticSearch uses grams as synonyms and returns poorly matching documents. It's better to showcase with example, let's say we have two people in index:

alice wang
sarah kerry

We search for ali12345:

{
  query: {
    bool: {
      should: {
        match: { name: 'ali12345' }
      }
    }
  }
}

and it will return alice wang.

How is it possible? Because by default ElasticSearch uses grams as synonyms, so, even if just one gram matches - the document will be matched.

If you inspect the query you'll see that it treats grams as a synonyms

...
"explanation": {
  "value": 5.274891,
  "description": "weight(Synonym(name: ali name:li1 name:i12 name:123 name:234 name:345 ) in 0) [PerFieldSimilarity], result of:",
...

I wonder if it's possible to tell it to use "best match" query, to achieve something like:

{
  query: {
    bool: {
      should: [
        { term: { body: 'ali' }},
        { term: { body: 'li1' }},
        { term: { body: 'i12' }},
        { term: { body: '123' }},
        { term: { body: '234' }},
        { term: { body: '345' }},
      ],
      minimum_should_match: '75%'
    }
  }
}

Questions:

  1. It's possible of course generate this query manually, but then you have to apply ngram parsing and other analyzer pipeline manually. So I wonder if it could be done by ElasticSearch?

  2. What would be the performance of such query for long string, when there are tens of grams/terms? Will it be using some smart optimisations like with searching similar documents (see more_like_this) - when it tries to use not all the terms but only terms with highest tf-idf?

P.S.

The index configuration

{
  mappings: {
    object: {
      properties: {
        name: {
          type:     'text',
          analyzer: 'trigram_analyzer'
        }
      }
    }
  },

  settings: {
    analysis: {
      filter: {
        trigram_filter: { type: 'ngram', min_gram: 3, max_gram: 3 }
      },
      analyzer: {
        trigram_analyzer: {
          type:        'custom',
          tokenizer:   'keyword',
          filter:      [ 'trigram_filter' ]
        }
      }
    }
  }
}
Glede answered 9/12, 2017 at 13:17 Comment(8)
What are you trying to do actually? What is not ok with the current match query approach you are using?Threepiece
@AndreiStefan the default match would find alice wang as a match for the ali12345 query. Which is clearly wrong. Also (although I'm not sure about that) the relevance calculated in a similar broken way.Glede
It finds that ali12345 because of ngrams. If you don't want that ali12345 to match why ngrams then?Threepiece
It finds that ali12345 because of ngrams. If you don't want that ali12345 to match why ngrams then?Threepiece
@AndreiStefan there are lots of ways to use ngrams. I want something similar to cosine similarity.Glede
Have you looked at scripted similarity? elastic.co/guide/en/elasticsearch/reference/current/… I am not familiar with cosine similarity and would have to do a bit reading about it, for which I don't have time now. Pointing out the scripted similarity here, in case it helps you.Threepiece
@AndreiStefan thanks for the link, but scripted similarity would be too slow. The workaround in my post actually solves my issue, I just wondered if there's a better option. And as I pointed out ElasticSearch actually uses kinda "trigram best match" approach when it searches for similar documents.Glede
I am facing similar problem. Any updates on solving this issue?Protolanguage
Y
2

Perhaps you have already found the reason, but ali12345 is matching alice wang because the analyzer at search time is the same one used for index time, including ngrams.

Such that:

At index time: for text alice wang, these terms are created [ali, lic, ice, ...]

At search time: for text ali12345, these terms are created [ali, li1, i12, ...]

As we can see we have a match with term ali

To avoid this problem, ElasticSearch provides the possibility to specify a different analyzer for search time. In the mapping for field name you can add another property search_analyzer that is normally very much similar to the main analyzer but without an ngram tokenfilter. This would prevent [ali, li1, i12] from being generated during search analysis resulting in 0 matches to alice wang

Feel free to look into more details and explanations on this page: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

Yon answered 2/10, 2020 at 17:19 Comment(0)
C
1

I know this question is old, but just in case...

you should be able to use the minimumShouldMatch clause on the trigram query to specify how many trigrams must match for a record to be considered a hit. you could use something like "3<75%", which means "if there 3 or less trigrams, then 100% must match. are there 4 or more trigrams, then 75% must match"

Carbide answered 28/10, 2019 at 8:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.