Elasticsearch spell check suggestions even if first letter missed
Asked Answered
D

2

7

I create an index like this:

curl --location --request PUT 'http://127.0.0.1:9200/test/' \
--header 'Content-Type: application/json' \
--data-raw '{
    "settings" : {
        "number_of_shards" : 1
    },
    "mappings" : {
        "properties" : {
            "word" : { "type" : "text" }
        }
    }
}'

when I create a document:

curl --location --request POST 'http://127.0.0.1:9200/test/_doc/' \
--header 'Content-Type: application/json' \
--data-raw '{ "word":"organic" }'

And finally, search with an intentionally misspelled word:

curl --location --request POST 'http://127.0.0.1:9200/test/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
  "suggest": {
    "001" : {
      "text" : "rganic",
      "term" : {
        "field" : "word"
      }
    }
  }
}'

The word 'organic' lost the first letter - ES never gives suggestion options for such a mispell (works absolutely fine for any other misspells - 'orgnic', 'oragnc' and 'organi'). What am I missing?

Deviation answered 30/12, 2019 at 7:54 Comment(3)
I think this is happening because of the prefix_length parameter: elastic.co/guide/en/elasticsearch/reference/current/… . It defaults to 1, i.e. at least 1 letter from the beginning of the term has to match. I don't yet have an answer for how to do what you want or whether it's possible with ES' suggest feature, I'll make this a full answer once I know.Saskatoon
The steps to reproduce work perfectly except for 1 character in the index name of your final POST, the search action itself. The target URL is http://127.0.0.1:9200/test1/_search but you set up index test throughout. I believe you meant to target index test rather than test1 with the search.Saskatoon
You're right, Emanuil, I've correctedDeviation
S
6

This is happening because of the prefix_length parameter: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html . It defaults to 1, i.e. at least 1 letter from the beginning of the term has to match. You can set prefix_length to 0 but this will have performance implications. Only your hardware, your setup and your dataset can show you exactly what those will be in practice in your case, i.e. try it :). However, be careful - Elasticsearch and Lucene devs set the default to 1 for a reason.

Here's a query which for me returns the suggestion result you're after on Elasticsearch 7.4.0 after I perform your setup steps.

curl --location --request POST 'http://127.0.0.1:9200/test/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
  "suggest": {
    "001" : {
      "text" : "rganic",
      "term" : {
        "field" : "word",
        "prefix_length": 0
      }
    }
  }
}'
Saskatoon answered 4/1, 2020 at 0:52 Comment(1)
Feel free to accept the answer if it's the relevant info :)Saskatoon
N
1

You need to use the CANDIDATE GENERATORS with phrase suggester check this out from Elasticsearch in Action book page 444

Having multiple generators and filters lets you do some neat tricks. For instance, if typos are likely to happen both at the beginning and end of words, you can use multi- ple generators to avoid expensive suggestions with low prefix lengths by using the reverse token filter, as shown in figure F.4. You’ll implement what’s shown in figure F.4 in listing F.4: ■ First, you’ll need an analyzer that includes the reverse token filter.

■ Then you’ll index the correct product description in two fields: one analyzed with the standard analyzer and one with the reverse analyzer.

From Elasticsearch docs

The following example shows a phrase suggest call with two generators: the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.

So you can achieve this by using the reverse analyzer with the post-filter and pre-filter

And as you can see they said:

This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions.

Check this Figure from Elasticsearch In Action book I believe it will make the idea more clear.

A screenshot from the book explains how elastic search will give us the correct phrase

For more information refer to the docs https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters-phrase.html#:~:text=The%20phrase%20suggester%20uses%20candidate,individual%20term%20in%20the%20text.

If explained the full idea then this will be a very long answer but I gave you the key and you can go and do your research about using the phrase suggester with multiple generators.

Nassir answered 25/3, 2021 at 9:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.