Elasticsearch - use a "tags" index to discover all tags in a given string
Asked Answered
B

3

0

I have an elasticsearch v2.x cluster with a "tags" index that contains about 5000 tags: {tagName, tagID}. Given a string, is it possible to query the tags index to get all tags that are found in that string? Not only do I want exact matches, but I also want to be able to control for fuzzy matches without being too generous. By too generous, a tag should only match if all tokens in the tag are found within a certain proximity of each other (say 5 words).

For example, given the string:

Model 22340 Sound Spectrum Analyzer

The following tags should match:

sound analyzer sound spectrum analyzer

BUT NOT

sound meter light spectrum chemical analyzer

Behavior answered 17/6, 2016 at 20:11 Comment(3)
Of course you can. You can achieve what you want to get using only just match query with standard analyzer.Locket
Can you post an example as an answer? I would love to give you credit.Behavior
I've posted an example as a new answer. :)Locket
B
2

I don't think it's possible to create an accurate elasticsearch query that will auto-tag a random string. That's basically a reverse query. The most accurate way to match a tag to a document is to construct a query for the tag, and then search the document. Obviously this would be terribly inefficient if you need to iterate over each tag to auto-tag a document.

To do a reverse query, you want to use the Elasticsearch Percolator API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html

The API is very flexible and allows you to create fairly complex queries into documents with multiple fields.

The basic concept is this (assuming your tags have an app specific ID field):

  1. For each tag, create a query for it, and register the query with the percolator (using the tag's ID field).

  2. To auto-tag a string, pass your string (as a document) to the Percolator, which will match it against all registered queries.

  3. Iterate over the matches. Each match includes the _id of the query. Use the _id to reference the tag.

This is also a good article to read: https://www.elastic.co/blog/percolator-redesign-blog-post

Behavior answered 1/7, 2016 at 23:39 Comment(0)
Y
1
"query": {
"match": {
  "tagName": {
    "query":     "Model 22340 Sound Spectrum Analyzer",
    "fuzziness": "AUTO",
    "operator":  "or"
  }
}

}

If you want an equal match so that "sound meter" will not match you will have to add another field for each tag containing the terms count in the tag name, add a script to count the terms in the query and add a comparison of the both in the match_query, see: Finding Multiple Exact Values.

Regarding the proximity issue: Since you require "Fuzzyness" you cannot control the proximity because the "match_phrase" query is not integrated with Fuzzyness, as stated by Elastic docs Fuzzy-match-query:

Fuzziness works only with the basic match and multi_match queries. It doesn’t work with phrase matching, common terms, or cross_fields matches.

so you need to decide: Fuzzyness vs. Proximity.

Yeasty answered 20/6, 2016 at 6:23 Comment(2)
This does work, but I clarified my question because the results are too generous. Tags with multiple tokens will match when only one of those tokens is found. Tags should only match if all tokens in the tag are found in the string within a certain proximity (say within 5 words).Behavior
counting the terms in the query string seems irrelevant because it is made up of the name and description of the product, which is not going to have anything to do with the count of terms for any given matching tag. If this comment were my query string, I would want to match the tags "query string" but not "query description" so proximity is important.Behavior
L
0

Of course you can. You can achieve what you want to get using only just match query with standard analyzer.

curl -XGET "http://localhost:9200/tags/_search?pretty" -d '{
  "query": {
    "match" : {
      "tagName" : "Model 22340 Sound Spectrum Analyzer"
    }
  }
}'
Locket answered 20/6, 2016 at 11:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.