How to perform a pipeline aggregation without returning all buckets in Elasticsearch
Asked Answered
G

1

14

I'm using Elasticsearch 2.3 and I'm trying to perform a two-step computation using a pipeline aggregation. I'm only interested in the final result of my pipeline aggregation but Elasticsearch returns all the buckets information.

Since I have a huge number of buckets (tens or hundreds of millions), this is prohibitive. Unfortunately, I cannot find a way to tell Es not to return all this information.

Here is a toy example. I have an index test-index with a document type obj. obj has two fields, key and values.

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 100,
  "key": "foo"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 20,
  "key": "foo"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 50,
  "key": "bar"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 60,
  "key": "bar"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 70,
  "key": "bar"
}'

I want to get the average value (over all keys ) of the minimum value of objs having the same keys. An average of minima.

Elasticsearch allows me to do this:

curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search' -d '{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "key_aggregates": {
      "terms": {
        "field": "key",
        "size": 0
      },
      "aggs": {
        "min_value": {
          "min": {
            "field": "value"
          }
        }
      }
    },
    "avg_min_value": {
      "avg_bucket": {
        "buckets_path": "key_aggregates>min_value"
      }
    }
  }
}'

But this query returns the minimum for every bucket, although I don't need it:

{
  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": [

    ]
  },
  "aggregations": {
    "key_aggregates": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "bar",
          "doc_count": 2,
          "min_value": {
            "value": 50
          }
        },
        {
          "key": "foo",
          "doc_count": 2,
          "min_value": {
            "value": 20
          }
        }
      ]
    },
    "avg_min_value": {
      "value": 35
    }
  }
}

Is there a way to get rid of all the information inside "buckets": [...]? I'm only interested in avg_min_value.

This might not seem like a problem in this toy example, but when the number of different keys is not big (tens or hundreds of millions), the query response is prohibitively large, and I would like to prune it.

Is there a way to do this with Elasticsearch? Or am I modelling my data wrong?

NB: it is not acceptable to pre-aggregate my data per key, since the match_all part of my query might be replaced by complex and unknown filters.

NB2: changing size to a non-negative number in my terms aggregation is not acceptable because it would change the result.

Goodall answered 28/6, 2016 at 16:36 Comment(0)
N
19

I had the same issue and after doing quite a bit of research I found a solution and thought I'd share here.

You can use the Response Filtering feature to filter the part of the answer that you want to receive.

You should be able to achieve what you want by adding the query parameter filter_path=aggregations.avg_min_value to the search URL. In the example case, it should look similar to this:

curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search?filter_path=aggregations.avg_min_value' -d '{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "key_aggregates": {
      "terms": {
        "field": "key",
        "size": 0
      },
      "aggs": {
        "min_value": {
          "min": {
            "field": "value"
          }
        }
      }
    },
    "avg_min_value": {
      "avg_bucket": {
        "buckets_path": "key_aggregates>min_value"
      }
    }
  }
}'

PS: if you found another solution would you mind sharing it here? Thanks!

Naphthyl answered 18/12, 2016 at 3:55 Comment(2)
This seems like a perfect solution. If only we had found this! We didn't find a direct solution to this problem, so we adopted the script aggregation way: building a map whose keys are the field "key" and values the minima over all docs with a given value for field "key"; then aggregating the values of this map. I don't know how the two compare in terms of performance, but in all likelihood your solution should be faster!Goodall
@Naphthyl do you know if "response filtering" circumvents the bucket limitations of elasticsearch? or does it simply truncate/hide the buckets from the output?Deuce

© 2022 - 2024 — McMap. All rights reserved.