What is the significance of doc_count_error_upper_bound in elasticsearch and how can it be minimized?
Asked Answered
R

4

20

I am always getting a high value for an aggregation query in elasticsearch on the doc_count_error_upper_bound attribute. It's sometimes as high as 8000 or 9000 for a ES cluster having almost a billion documents indexed. I run the query on an index of about 5M doc and I get the value to be about 300 to 500.

The question is how incorrect are my results (I am trying a top 20 count query based on the JSON below)

"aggs":{ "group_by_creator":{ "terms":{ "field":"creator" } } } }
Rolfe answered 29/5, 2016 at 18:49 Comment(0)
D
36

This is pretty well explained in the official documentation.

When running a terms aggregation, each shard will figure out its own top-20 list of terms and will then return their 20 top terms. The coordinating node will gather all those terms and reorder them to get the overall top-20 terms for all the shards.

If you have more than one shard, it's no surprise that there might be a non-zero error count as shown in the official doc example and there's a way to compute the doc count error.

With one shard per index, the doc error count will always be zero, but it might not always be feasible depending on your index topology, especially if you have almost one billion documents. But for your index with 5M docs, if they are not to big, they could well be stored in a single shard. Of course, it depends a lot on your hardware, but if your shard size doesn't exceed 15/20GB, you should be fine. You should try to create a new index with a single shard and see how it goes.

Domineering answered 30/5, 2016 at 4:26 Comment(0)
F
4

I created this visualisation to try and understand it myself.

Example of elastic aggregation errors

There are two levels of aggregation errors:

  • Whole Aggregation - shows you the potential value of a missing term
  • Term Level - indicates the potential inaccuracy in a returned term

The first gives a value for the aggregation as a whole which represents the maximum potential document count for a term which did not make it into the final list of terms.

and

The second shows an error value for each term returned by the aggregation which represents the worst case error in the document count and can be useful when deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

You can see the term level error by setting:

"show_term_doc_count_error": true

While the Whole Aggregation Error is shown by default

Quotes from official docs

Fumigant answered 13/10, 2020 at 14:17 Comment(5)
Great explanation! Thanks! I would just add that the accuracy can be raised by raised the number of returned buckets. When we want to TOP 5 buckets, we can tell Elastic to return TOP 10 and thereby reduce the probability that some shard do not contain all terms from TOP 5. But of course it is at the cost of performance.Fayfayal
In the second show, why POTTER's count is greater than CHEN's? Is it really the ES balancing do? (POTTER and CHEN docs are uneven)Mammillary
Increasing the "shard_size" parameter helps improve performanceHibbert
in the case of Wearne, the second shard must be error 8, not 2? It’s the sum of the size of the largest bucket on each shard that didn’t fit into shard_sizePrenomen
The largest bucket was Jay=2 hence the max error for Wearne=2Fumigant
Y
1

To add to the other (fine) answer:

You can minimize the error_count by increasing the shard_size-parameter, like this:

"aggs":{ "group_by_creator":{ "terms":{
    "field":"creator",
    "size": 20,
    "shard_size": 100
} } } }

What does this do?

When you ask for a terms-aggregation, elastic passes it on to individual shards, and then later recombines the results. This could for example look like this (though I shrank it for brevity):

  • Shard 0 reports this: A: 10, B: 8, C:6, D:4, E:2
  • Shard 1 reports: A: 12, C:8, F: 5, B: 4, G: 4

The coordinating node adds this up to:

A: 22
C: 14
B: 12
F: 5 (plus up to 2 missing)
D: 4 (plus up to 4 missing)
G: 4 (plus up to 2 missing)
E: 2 (plus up to 4 missing)

This means the coordinating node could return a top-3 that is guaranteed accurate, but the top-5 would have a doc_count_error_upper_bound of 4, because we missed some data.

To prevent this problem, elastic can actually request each shard to report on more terms. As you see, if we just wanted a top-3, requesting each shard to report on their top-5 solved the problem. And to get an accurate top-5, requesting each shard sends their top-10 might well have solved our problem.

That's what shard_size does: it requests each node to report a larger amount of terms, so that we hopefully get a report for enough terms to lower the error-rate. Though it comes at the cost of needing some processing by each node, so setting it to a ridiculous value like int.MaxValue (like another answer is suggesting) is a very bad idea if you have a lot of documents.

Why doesn't elastic already do this by itself?

Surprise: it does. It's just not always enough, but if you experiment with the value, you'll see you can get worse results. By default (as mentioned in the docs), it uses size * 1.5 + 10, giving a value of 25 for the default top-10, or 40 for your use-case.

Yuzik answered 28/8, 2023 at 14:27 Comment(0)
C
0

set shardSize to int.MaxValue it will reduce errors in count

Caster answered 26/8, 2018 at 17:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.