I'm using Elasticsearch 2.3 and I'm trying to perform a two-step computation using a pipeline aggregation. I'm only interested in the final result of my pipeline aggregation but Elasticsearch returns all the buckets information.
Since I have a huge number of buckets (tens or hundreds of millions), this is prohibitive. Unfortunately, I cannot find a way to tell Es not to return all this information.
Here is a toy example. I have an index test-index
with a document type obj
. obj
has two fields, key
and values
.
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 100,
"key": "foo"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 20,
"key": "foo"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 50,
"key": "bar"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 60,
"key": "bar"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 70,
"key": "bar"
}'
I want to get the average value (over all key
s ) of the minimum value
of obj
s having the same key
s.
An average of minima.
Elasticsearch allows me to do this:
curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search' -d '{
"size": 0,
"query": {
"match_all": {}
},
"aggregations": {
"key_aggregates": {
"terms": {
"field": "key",
"size": 0
},
"aggs": {
"min_value": {
"min": {
"field": "value"
}
}
}
},
"avg_min_value": {
"avg_bucket": {
"buckets_path": "key_aggregates>min_value"
}
}
}
}'
But this query returns the minimum for every bucket, although I don't need it:
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"key_aggregates": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bar",
"doc_count": 2,
"min_value": {
"value": 50
}
},
{
"key": "foo",
"doc_count": 2,
"min_value": {
"value": 20
}
}
]
},
"avg_min_value": {
"value": 35
}
}
}
Is there a way to get rid of all the information inside "buckets": [...]
? I'm only interested in avg_min_value
.
This might not seem like a problem in this toy example, but when the number of different key
s is not big (tens or hundreds of millions), the query response is prohibitively large, and I would like to prune it.
Is there a way to do this with Elasticsearch? Or am I modelling my data wrong?
NB: it is not acceptable to pre-aggregate my data per key, since the match_all
part of my query might be replaced by complex and unknown filters.
NB2: changing size
to a non-negative number in my terms
aggregation is not acceptable because it would change the result.