How to practially use a keywordanalyzer in azure-search?
Asked Answered
C

1

5

a little relating and continuing to this question: Azure Search Analyzer

I want to use a keywordanalyzer for word collections.

We have documents (products) with different fields like product_name, brand, categorie and so on.
To implement a keyword based ranking (scoring) I would like to add a Collection(Edm.String) field which contains different (untokenized!!) keywords, like: "brown teddy" or "green bean".
To achieve this I thought about using a keywordanalyzer with the following definition:

// field definition:
{
"name": "keyWordList",
"type": "Collection(Edm.String)",
"analyzer": "keywordAnalyzer"
}
...

"analyzers": [ {
"name":"keywordAnalyzer",
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"keywordTokenizer",
"tokenFilters":[ "lowercase", "classic" ]
} ]
...

"tokenizers": [{
"name": "keywordTokenizer",
"@odata.type": "#Microsoft.Azure.Search.KeywordTokenizer"
}

Now after having uploaded some documents, I just can't find the fields by entering exactly the chosen keywords. For example the is a document with the following field-data:

"keyWordList": [ "Blue Bear", "blue bear", "blue bear123" ]

Im not able to find any result by querying the following search:

{ search:"blue bear", count:"true", queryType:"full" }

Here is what I tried as well:

  • using the predefined keywordanalyzer instead of a customized one -> no success
  • instead of using Collection(Edm.String) I just tested it with a normal String field, containing only one keyword -> no success
  • splitting up the analyzer in the field definition-block into searchAnalyzer="lowercaseAnalyzer" and filterAnalyzer="keywordAnalyzer" vice versa -> no success

In the end the only result I could get was via sending the whole seach phase as a single term. But this should be done by the analyzer, right?!

{ search:"\"blue bear\"", count:"true", queryType:"full" }

Users don't know if they search for an existing keyword or perform a tokenized search. That's why this won't be an option.

Is there any solution to this issue of mine? Or is there maybe a better / easier approach for this kind of keyword (high scoring) seach?

Thanks!

Cynar answered 29/11, 2016 at 2:59 Comment(0)
N
14

Short answer:

The behavior you're observing is correct.

Semantically, your search query blue bear means: find all documents that match the term blue or the term bear. Since you are using the keyword tokenizer the terms that you indexed are blue bear and blue bear123. The terms blue and bear individually don't exist in your index. That's why only the phrase query returns the result you are expecting.


Long answer:

Let me explain how the analyzer is applied during query processing and how it's applied during document indexing.

On the indexing side, the analyzer you defined processes elements of the keyWordList collection independently. The terms that end up in your inverted index are:

  • blue bear (since you're using the lowercase filter blue bear and Blue Bear are tokenized to the same term).
  • blue bear123

    As you'd expect blue bear is one term - not split into two on space - since you're using the keyword tokenizer. Same applies to blue bear123

On the query processing side, two things happen:

  1. Your search query is rewritten too: blue|bear (find documents blue or bear). This is because searchMode=any is used by default. If you used searchMode=all, your search query would be rewritten to blue+bear (find documents with blue and bear).

    The query parser takes your search query string and separates query operators (such as +, |, * etc.) from query terms. Then it decomposes the search query into subqueries of supported types e.g., terms followed by the suffix operator ‘*’ become a prefix query, quoted terms a phrase query etc. Terms that are not preceded or followed by any the supported operators become individual term queries.

    In your example, the query parser decomposed your query string blue|bear into two term queries with terms blue and bear respectively. The search engine looks for documents that match any of those queries (searchMode=any).

  2. Query terms of the identified subqueries are processed by the search analyzer.

    In your example, terms blue and bear are processed by the analyzer individually. They are not modified since they are already lowercase. None of those tokens exist in your index, thus no results are returned.

    If you query looked as follows: "Blue Bear" (with quotes) it would be rewritten to "Blue Bear" - notice no change, the OR operator has not been put between the words since now you're looking for a phrase. The query parser passes the entire phrase term (two words) to the analyzer which in turn outputs a single, lowercased term: blue bear. This token matches what's in your index.

The key lesson here is that the query parser processes the query string before the analyzers are applied. The analyzers are applied to individual terms of subqueries identified by the query parser.

I hope this helps you understand the behavior you're observing. Note, you can test the output of your custom analyzer using the Analyze API.

Naumachia answered 29/11, 2016 at 4:11 Comment(6)
Thank you for your very detailed answer. Could you give me any suggestion how to use the keywordanalyzer in my case or shall I better use another approach? All I can think of at the moment is to manupilate the seach query by duplicating the original seach terms, embrasing the duplication with quotes: search term "seach term" That feels pretty dirty so any other idea? The goal is that the user can make a seach like: long blue shirt which results in a usual tokenized search or like: bmw x3 2015 which is contained as keyword and result in a higher score than: bmw new x3Cynar
Is your intention to promote (score higher) documents that contain all the terms vs. any of the individual terms? To help you with your scenario I need to understand what you're trying to achieve. Can you give me a few examples of queries and the documents they should/shouldn't match and how they should be ranked?Naumachia
Often our customers expand user queries before they pass them to the search service. For example, for search terms long shirt your query could look like search=("long shirt")^3 || (long+shirt)^2 || long shirt. It means, search for documents with the phrase "long shirt" or documents with terms long and shirt, or documents with terms long or shirt. Notice I'm using the boosting operator ^ to promote the documents that matched the first and second subquery. I'm using full lucene query syntax: docs.microsoft.com/en-us/rest/api/searchservice/…Naumachia
The example in your second post is similar to what I had in mind. So this may even be common practice...good to know! I was playing with the keyword analyzer in addition with a higher scoring (weight) and I found, that the same terms scored higher without using this analyzer. The search was still a composed one, framed by quotes... I may come back to you after having done some more statistical tests. But I still wonder about the use case of the keyword analyzer. In which cases would one use it then?Cynar
It's most commonly used by customers that index identifiers of any sort e.g., product IDs. They want to make sure they will be indexed as isNaumachia
Thanks for the details Yahnoosh, my scenario is that of product IDs. what we use is a Make Model Part number combination and we need to perform search so that partial words can help track the correct documents. The problem is that a user may not specify the Make and may be relying on Model or Part number only so we can't use Make based filters. What strategy would you recommend?Roose

© 2022 - 2024 — McMap. All rights reserved.