Is it possible to tell ElasticSearch to use "best match" of all grams instead of using grams as synonyms?
By default ElasticSearch uses grams as synonyms and returns poorly matching documents. It's better to showcase with example, let's say we have two people in index:
alice wang
sarah kerry
We search for ali12345
:
{
query: {
bool: {
should: {
match: { name: 'ali12345' }
}
}
}
}
and it will return alice wang
.
How is it possible? Because by default ElasticSearch uses grams as synonyms, so, even if just one gram matches - the document will be matched.
If you inspect the query you'll see that it treats grams as a synonyms
...
"explanation": {
"value": 5.274891,
"description": "weight(Synonym(name: ali name:li1 name:i12 name:123 name:234 name:345 ) in 0) [PerFieldSimilarity], result of:",
...
I wonder if it's possible to tell it to use "best match" query, to achieve something like:
{
query: {
bool: {
should: [
{ term: { body: 'ali' }},
{ term: { body: 'li1' }},
{ term: { body: 'i12' }},
{ term: { body: '123' }},
{ term: { body: '234' }},
{ term: { body: '345' }},
],
minimum_should_match: '75%'
}
}
}
Questions:
It's possible of course generate this query manually, but then you have to apply ngram parsing and other analyzer pipeline manually. So I wonder if it could be done by ElasticSearch?
What would be the performance of such query for long string, when there are tens of grams/terms? Will it be using some smart optimisations like with searching similar documents (see
more_like_this
) - when it tries to use not all the terms but only terms with highesttf-idf
?
P.S.
The index configuration
{
mappings: {
object: {
properties: {
name: {
type: 'text',
analyzer: 'trigram_analyzer'
}
}
}
},
settings: {
analysis: {
filter: {
trigram_filter: { type: 'ngram', min_gram: 3, max_gram: 3 }
},
analyzer: {
trigram_analyzer: {
type: 'custom',
tokenizer: 'keyword',
filter: [ 'trigram_filter' ]
}
}
}
}
}
match
query approach you are using? – Threepiecematch
would findalice wang
as a match for theali12345
query. Which is clearly wrong. Also (although I'm not sure about that) the relevance calculated in a similar broken way. – Glede