I want to publish some detailed use cases.
Edge n-gram tokenizer (default)
By default this tokenizer treats all text as a single token because by default token can contain any characters (including spaces).
GET {ELASTICSEARCH_URL}/_analyze
{
"tokenizer": "edge_ngram",
"text": "How are you?"
}
Result:
["H", "Ho"]
Explanation: one token, min_gram
= 1, max_gram
= 2.
Edge n-gram tokenizer (custom without token_chars)
PUT {ELASTICSEARCH_URL}/custom_edge_ngram
{
"settings": {
"analysis": {
"analyzer": {
"custom_edge_ngram": {
"tokenizer": "custom_edge_ngram_tokenizer"
}
},
"tokenizer": {
"custom_edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 7
}
}
}
}
}
GET {ELASTICSEARCH_URL}/custom_edge_ngram/_analyze
{
"analyzer": "custom_edge_ngram",
"text": "How old are you?"
}
Result:
["Ho", "How", "How ", "How o", "How ol", "How old"]
Explanation: still one token, min_gram
= 2, max_gram
= 7.
Edge n-gram tokenizer (custom with token_chars)
PUT {ELASTICSEARCH_URL}/custom_edge_ngram_2
{
"settings": {
"analysis": {
"analyzer": {
"custom_edge_ngram": {
"tokenizer": "custom_edge_ngram_tokenizer"
}
},
"tokenizer": {
"custom_edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 7,
"token_chars": ["letter"]
}
}
}
}
}
GET {ELASTICSEARCH_URL}/custom_edge_ngram_2/_analyze
{
"analyzer": "custom_edge_ngram",
"text": "How old are you?"
}
Result:
["Ho", "How", "ol", "old", "ar", "are", "yo", "you"]
Explanation: 4 tokens How
, old
, are
, you
(tokens can contain only letters because of token_chars
), min_gram
= 2, max_gram
= 7, but max token length in the sentence is 3.
Edge n-gram token filter
Tokenizer converts text to stream of tokens.
Token filter works with each token of the stream.
Token filter can modify stream by adding, updating, deleting its tokens.
Let's use standard
tokenizer.
GET {ELASTICSEARCH_URL}/_analyze
{
"tokenizer": "standard",
"text": "How old are you?"
}
Result:
["How", "old", "are", "you"]
Now let's add token filter.
GET {ELASTICSEARCH_URL}/_analyze
{
"tokenizer": "standard",
"filter": [
{ "type": "edge_ngram",
"min_gram": 2,
"max_gram": 7
}
],
"text": "How old are you?"
}
Result:
["Ho", "How", "ol", "old", "ar", "are", "yo", "you"]
Explanation: edge_nram
for each token.
t, th, the, q, qu, qui, ...
but offsets and positions are different. Filter:{"token": "qui", "start_offset": 4, "end_offset": 9, "position": 2}
. Tokenizer:{"token": "qui", "start_offset": 4, "end_offset": 7, "position": 6}
– Dormouse