Elasticsearch shingles and stopwords
Asked Answered
C

2

6

The example at https://www.elastic.co/guide/en/elasticsearch/guide/current/shingles.html mentions that the standard filter for stopwords introduces a negative effect when searching with shingles, due to the filter replacing stopwords with an underscore and producing tokens with underscores (which won't match "regular" text queries).

However, it suggests using a enable_position_increments parameter that is not supported by Lucene anymore (and produces an error at least on ES 2.4).

Is there anyway to solve this problem, or achieve the same results, without using the unsupported enable_position_increments? Or are the underscores a minor problem that can be worked around?

I was also thinking if this could be a non issue if you use the same analyzer for search and indexing: if a query includes stopwords, will they be replaced by _ and thus generate tokens that will match the indexed shingles (even if the stopwords were different)?

Carburet answered 23/2, 2017 at 12:35 Comment(0)
C
4

I've found that a possible solution is to set the filler_token parameter to an empty string on the shingle filter, so the underscore will simply be omitted from the tokens:

"filter_shingle": {
                "type": "shingle",
                "max_shingle_size": 5,
                "min_shingle_size": 2,
                "output_unigrams": "false",
                "filler_token": ""
            }

Can someone comment on whether this achieves the same results, or if it creates any unforeseen problems concerning scoring or matching? The results from _analyze seem correct, the _ is omitted.

Carburet answered 23/2, 2017 at 16:20 Comment(2)
After testing shingles with and without underscores, I get the exact same scores for both methods for the example at elastic.co/guide/en/elasticsearch/guide/current/shingles.htmlCarburet
Be careful with this because it may cause unexpected results. For example lets say there is a stop words filter run before the shingle. A string of "The Brown Fox" will return [" Brown", " Brown Fox", ...] (Notice the space that is left). This could throw off queries like a match phrase since a space at the beginning of the query will be needed.Lyse
S
3

I use this way to deal with this situation

"filter_shingle": {
                "type": "shingle",
                "max_shingle_size": 2,
                "min_shingle_size": 2,
                "output_unigrams": "true",
                "filler_token": ""
            }.

"analyzer":[   
  "my_shingle":{
    "filter":["lowercase","stop","filter_shingle","trim"],
    "tokenizer": "standard"
  }
]
Scleroma answered 22/7, 2020 at 7:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.