Elasticsearch search suggestion on array field with partial edge ngram completion
Asked Answered
N

1

7

I am trying to build a suggester based on arrays of strings in my documents, it is similar to this one but with several differences : the completion suggester from Elasticsearch is not exactly doing what I want (in terms of filtering and prefix matching), as I need an edge ngram that would work on any word of the sentence, accent-insensitive. Let me clarify with an example.

Assume I have the following indexed documents. I want to suggest "tags" based on a query q (I don't care about the document themselves, only the tags that match my query)

[
  { "tags": [ "société générale", "consulting" ] },
  { "tags": [ "big data", "big", "data"] },
  { "tags": [ "data" ] },
  { "tags": [ "data engineering" ] }
  { "tags": [ "consulting and management of IT" ] }
]

I want to match prefix with accent tolerance, and the following query/responses highlight what I need

  • (1) q = "societe" or q = "societe generale" should return [ "société générale" ] --> accent insensitive
  • (2) q = "big data" should return [ "big data" ] --> both prefixes "big" and "data" must be in the string
  • (3) q = "data" should return [ "big data", "data", "data engineering" ], --> anywhere in the sentence (but as a prefix)
  • (4) q = "ata" should not return anything (not a prefix)
  • (5) q = "IT consulting" should return [ "consulting and management of IT" ] --> both prefixes of q should match regardless of order

If I use a regular completion mapper+suggester,

# assuming a mapping of "tags", of type 'completion' is configured in my ES
{
  suggest: {
    text: "big data",
    tags: {
      completion: {
        field: "tags",
      },
    },

almost none of these cases work apart from (2), (4) and 1/3 results from (3)

Can I build a custom suggester or a custom search query that would satisfy my requirements and the examples given above ?

Nitroglycerin answered 18/11, 2019 at 12:22 Comment(4)
did you get a solution to this?Poikilothermic
No. I've heard elasticsearch 7.3 might have some features that could help with this but unfortunately they still haven't released this version on AWSNitroglycerin
you should checkout Opensearch - Documentation. AWS is going to switch to Opensearch in the near future because of license reasons aws.amazon.com/de/elasticsearch-service/the-elk-stack/… opensearch.org/docs/opensearch/ux/#autocomplete-queries maybe the will release with the features you are need 😉Awhile
Not yet. We've paused work on our search engine for quite some time, and haven't upgraded our ES clusters for a while. It's likely that new versions are more flexible and there's even the new vector/semantic search that could provide help with this.Nitroglycerin
L
0

Elasticsearch doesn't extract matched an array item in hits and includes an entire array

Solution is a nested field with the inner_hits parameter in a query

Your documents

PUT /edge_suggestion/_bulk
{"create":{"_id":1}}
{"tags":["société générale","consulting"]}
{"create":{"_id":2}}
{"tags":["big data","big","data"]}
{"create":{"_id":3}}
{"tags":["data"]}
{"create":{"_id":4}}
{"tags":["data engineering"]}
{"create":{"_id":5}}
{"tags":["consulting and management of IT"]}

My source index is named edge_suggestion

Mapping of new index documents with a nested field for tags

PUT /edge_suggestion_with_nested
{
    "mappings": {
        "properties": {
            "tags": {
                "type": "text"
            },
            "tags_nested": {
                "type": "nested",
                "properties": {
                    "tag": {
                        "type": "text",
                        "analyzer": "lowercase_asciifolding_edge_ngram_standard_analyzer"
                    }
                }
            }
        }
    },
    "settings": {
        "analysis": {
            "analyzer": {
                "lowercase_asciifolding_edge_ngram_standard_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "asciifolding",
                        "edge_ngram_filter"
                    ]
                }
            },
            "filter": {
                "edge_ngram_filter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 11
                }
            }
        }
    }
}

Reindex query for transformation array items into nested documents

POST _reindex
{
    "source": {
        "index": "edge_suggestion"
    },
    "dest": {
        "index": "edge_suggestion_with_nested"
    },
    "script": {
        "source": """
                List tags = ctx._source[params['tags_field_name']];
                List nestedTags = new LinkedList();
                
                for (String tag : tags) {
                    Map item = [params['nested_tag_field_name'] : tag];
                    nestedTags.add(item);
                }
                ctx._source[params['tags_nested_field_name']] = nestedTags;
        """,
        "params": {
            "nested_tag_field_name": "tag",
            "tags_field_name": "tags",
            "tags_nested_field_name": "tags_nested"
        }
    }
}

Search query for "data" with the same analyzer as document field, the and operator, and the inner_hits parameter

GET /edge_suggestion_with_nested/_search?filter_path=hits.hits.inner_hits.tags_nested.hits.hits._source
{
    "query": {
        "nested": {
            "path": "tags_nested",
            "query": {
                "match": {
                    "tags_nested.tag": {
                        "query": "data",
                        "operator": "and",
                        "analyzer": "lowercase_asciifolding_edge_ngram_standard_analyzer"
                    }
                }
            },
            "inner_hits": {}
        }
    }
}

Response

{
    "hits" : {
        "hits" : [
            {
                "inner_hits" : {
                    "tags_nested" : {
                        "hits" : {
                            "hits" : [
                                {
                                    "_source" : {
                                        "tag" : "data"
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            {
                "inner_hits" : {
                    "tags_nested" : {
                        "hits" : {
                            "hits" : [
                                {
                                    "_source" : {
                                        "tag" : "data"
                                    }
                                },
                                {
                                    "_source" : {
                                        "tag" : "big data"
                                    }
                                }
                            ]
                        }
                    }
                }
            },
            {
                "inner_hits" : {
                    "tags_nested" : {
                        "hits" : {
                            "hits" : [
                                {
                                    "_source" : {
                                        "tag" : "data engineering"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

I filter only significant parts of the response

The "data" tag is duplicated. Some nested documents have this tag

Search query for "IT consulting"

GET /edge_suggestion_with_nested/_search?filter_path=hits.hits.inner_hits.tags_nested.hits.hits._source
{
    "query": {
        "nested": {
            "path": "tags_nested",
            "query": {
                "match": {
                    "tags_nested.tag": {
                        "query": "IT consulting",
                        "operator": "and",
                        "analyzer": "lowercase_asciifolding_edge_ngram_standard_analyzer"
                    }
                }
            },
            "inner_hits": {}
        }
    }
}

Response

{
    "hits" : {
        "hits" : [
            {
                "inner_hits" : {
                    "tags_nested" : {
                        "hits" : {
                            "hits" : [
                                {
                                    "_source" : {
                                        "tag" : "consulting and management of IT"
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}
Logsdon answered 14/4 at 13:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.