Very slow elasticsearch term aggregation. How to improve?
Asked Answered
B

3

9

We have ~20M (hotel offers) documents stored in elastic(1.6.2) and the point is to group documents by multiple fields (duration, start_date, adults, kids) and select one cheapest offer out of each group. We have to sort those results by cost field.

To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.).

Mapping for the field looks like this:

  "default_group_field": {
    "index": "not_analyzed",
    "fielddata": {
      "loading": "eager_global_ordinals"
    },
    "type": "string"
  }

Query we perform looks like this:

{
  "size": 0,
  "aggs": {
    "offers": {
      "terms": {
        "field": "default_group_field",
        "size": 5,
        "order": {
          "min_sort_value": "asc"
        }
      },
      "aggs": {
        "min_sort_value": {
          "min": {
            "field": "cost"
          }
        },
        "cheapest": {
          "top_hits": {
            "_source": {}
            },
            "sort": {
              "cost": "asc"
            },
            "size": 1
          }
        }
      }
    }
  },
  "query": {
    "filtered": {
      "filter": {
        "and": [
          ...
        ]
      }
    }
  }
}

The problem is that such query takes seconds (2-5sec) to load.

However once we perform query without aggregations we get a moderate amount of results (say "total": 490) in under 100ms.

{
  "took": 53,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "hits": {
    "total": 490,
    "max_score": 1,
    "hits": [...

But with aggregation it take 2sec :

{
  "took": 2158,
  "timed_out": false,
  "_shards": {
    "total": 6,
    "successful": 6,
    "failed": 0
  },
  "hits": {
    "total": 490,
    "max_score": 0,
    "hits": [

    ]
  },...

It seems like it should not take so long to process that moderate amount filtered documents and select the cheapest one out of every group. It could be done inside application, which seems an ugly hack for me.

The log is full of lines stating:

[DEBUG][index.fielddata.plain ] [Karen Page] [offers] Global-ordinals[default_group_field][2564761] took 2453 ms

That is why we updated our mapping to perform eager global_ordinals rebuild on index update, however this did not make notable impact on query timings.

Is there any way to speedup such aggregation, or maybe a way to tell elastic to do aggregation on filtered documents only.

Or maybe there is another source of such a long query execution? Any ideas highly appreciated!

Bracelet answered 3/6, 2016 at 12:59 Comment(5)
Can you take out the top_hits aggregation and try again? (just to see if this one is the heaviest or not, as I'm assuming)Manchu
Hmm.. can you explain why you have added "cheapest" part?. What is that cheapest and how are you using that?. Looks like you don't need that part. And if possible can you provide more details about your mapping, or the query?Bund
@AndreiStefan top_hits is not the problem actually it could be taken out. But it was left for the sake of completeness.Bracelet
@Bracelet can you provide a sample document ?, if possible, I will try to solve it again. Thanks,Bund
@Bund Here it is: gist.github.com/prikha/6b117574284c3e4744169f0813386d13 However we still have a solution for the main problem with timings.Bracelet
B
8

thanks again for the effort.

Finally we have solved the main problem and our performance is back to normal.

To be short we have done the following: - updated the mapping for the default_group_field to be of type Long - compressed the default_group_field values so that it would match type Long

Some explanations:

Aggregations on string fields require some work work be done on them. As we see from logs building Global Ordinals for that field that has very wide variance was very expensive. In fact we do only aggregations on the field mentioned. With that said it is not very efficient to use String type.

So we have changed the mapping to:

default_group_field: {
  type: 'long',
  index: 'not_analyzed'
}

This way we do not touch those expensive operations.

After this and the same query timing reduced to ~100ms. It also dropped down CPU usage.

PS 1

I`ve got a lot of info from docs on global ordinals

PS 2

Still I have no idea on how to bypass this issue with the field of type String. Please comment if you have some ideas.

Bracelet answered 8/6, 2016 at 9:4 Comment(1)
RE: PS 2 - If you are using String type, you could possibly improve performance by eagerly building global ordinals- elastic.co/guide/en/elasticsearch/reference/current/… and alexmarquardt.com/…Pernell
P
3

This is likely due to the the default behaviour of terms aggregations, which requires global ordinals to be built. This computation can be expensive for high-cardinality fields.

The following blog addresses the likely cause of this poor performance and several approaches to resolve it.

https://www.elastic.co/blog/improving-the-performance-of-high-cardinality-terms-aggregations-in-elasticsearch

Pernell answered 10/5, 2019 at 19:9 Comment(1)
Wow - that was exactly what I was looking for! Thank you very much. When using a term aggregation in field with high cardinality while filtering the result before to a small number of documents this is definitely the way to go! Try execution_hint: 'map'Howdah
B
0

Ok. I will try to answer this, There are few parts in the question which I was not able to understand like -

To avoid sub-aggregations we have united target fields values into one called default_group_field by joining them with dot(.)

I am not sure what you really mean by this because you said that,

You added this field to avoid aggregation(But how? and also how are you avoiding the aggregation if you are joining them with dot(.)?)

Ok. Even I am also new to elastic search. So If there is anything I missed, you can comment on this answer. Thanks,

I will continue to answer this question.

But before that I am assuming that you have that(default_group_field) field to differentiate between records duration, start_date, adults, kids.

I will try to provide one example below after my solution.

My solution:

{
  "size": 0,
  "aggs": {
    "offers": {
      "terms": {
        "field": "default_group_field"
      },
      "aggs": {
        "sort_cost_asc": {
          "top_hits": {
            "sort": [
              {
                "cost": {
                  "order": "asc"
                }
              }
            ],
            "_source": {
              "include": [ ... fields you want from the document ... ]
            },
            "size": 1
          }
        }
      }
    }
  },
  "query": {
"... your query part ..."
   }
}

I will try to explain what I am trying to do here:

I am assuming that your document looks like this (may be there is some nesting also, But for example I am trying to keep the document as simple as I can):

document1:

{
"default_group_field": "kids",
"cost": 100,
"documentId":1
}

document2:

{
"default_group_field": "kids",
"cost": 120,
"documentId":2
}

document3:

{
"default_group_field": "adults",
"cost": 50,
"documentId":3
}

document4:

{
"default_group_field": "adults",
"cost": 150,
"documentId":4
}

So now you have this documents and you want to get the min. cost document for both adults and kids:

so your query should look like this:

    {
      "size": 0,
      "aggs": {
        "offers": {
          "terms": {
            "field": "default_group_field"
          },
          "aggs": {
            "sort_cost_asc": {
              "top_hits": {
                "sort": [
                  {
                    "cost": {
                      "order": "asc"
                    }
                  }
                ],
                "_source": {
                  "include": ["documentId", "cost", "default_group_field"]
                },
                "size": 1
              }
            }
          }
        }
      },
      "query": {
         "filtered":{ "query": { "match_all": {} } }   
       }
    }

To explain the above query, what I am doing is grouping the document by "default_group_field" and then I am sorting each group by cost and size:1 helps me to get the just one document.

Therefore the result for this query will be min. cost document in each category (adults and kids)

Usually when I try to write the query for elastic search or db. I try to minimize the number of document or rows.

I assume that I am right in understanding your question. If I am wrong in understanding your question or I did some mistake, Please reply and let me know where I went wrong.

Thanks,

Bund answered 7/6, 2016 at 10:35 Comment(1)
Thanks again for the effort! Unfortunately this does not do the trick. Pagination and sorting by the top_hits cost is lost this way.Bracelet

© 2022 - 2024 — McMap. All rights reserved.