Elasticsearch 7.x circuit breaker - data too large - troubleshoot
Asked Answered
E

2

13

The problem:
Since the upgrading from ES-5.4 to ES-7.2 I started getting "data too large" errors, when trying to write concurrent bulk request (or/and search requests) from my multi-threaded Java application (using elasticsearch-rest-high-level-client-7.2.0.jar java client) to an ES cluster of 2-4 nodes.

My ES configuration:

Elasticsearch version: 7.2

custom configuration in elasticsearch.yml:   
    thread_pool.search.queue_size = 20000  
    thread_pool.write.queue_size = 500

I use only the default 7.x circuit-breaker values, such as:  
    indices.breaker.total.limit = 95%  
    indices.breaker.total.use_real_memory = true  
    network.breaker.inflight_requests.limit = 100%  
    network.breaker.inflight_requests.overhead = 2  

The error from elasticsearch.log:

    {
      "error": {
        "root_cause": [
          {
            "type": "circuit_breaking_exception",
            "reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
            "bytes_wanted": 3144831050,
            "bytes_limit": 3060164198,
            "durability": "PERMANENT"
          }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [3144831050/2.9gb], which is larger than the limit of [3060164198/2.8gb], real usage: [3144829848/2.9gb], new bytes reserved: [1202/1.1kb]",
        "bytes_wanted": 3144831050,
        "bytes_limit": 3060164198,
        "durability": "PERMANENT"
      },
      "status": 429
    }

Thoughts:
I'm having hard time to pin point the source of the issue.
When using ES cluster nodes with <=8gb heap size (on a <=16gb vm), the problem become very visible, so, one obvious solution is to increase the memory of the nodes.
But I feel that increasing the memory only hides the issue.

Questions:
I would like to understand what scenarios could have led to this error?
and what action can I take in order to handle it properly?
(change circuit-breaker values, change es.yml configuration, change/limit my ES requests)

Engrail answered 5/2, 2020 at 11:54 Comment(1)
do you have an example request that triggers this circuit breaker? Generally its some aggregations with sub buckets aggregations with huge size.Portulaca
E
5

So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?

  1. the circuit breaker mechanism exists since the very first versions.
  2. we started experience issues around it when moving from version 5.4 to 7.2
  3. in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
  4. In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
  5. also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
  6. considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169

How to fix:

  1. change the configuration back to CMS GC.
  2. or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.
Engrail answered 24/6, 2020 at 20:4 Comment(0)
N
10

The reason is that the heap of the node is pretty full and being caught by the circuit breaker is nice because it prevents the nodes from running into OOMs, going stale and crash...

Elasticsearch 6.2.0 introduced the circuit breaker and improved it in 7.0.0. With the version upgrade from ES-5.4 to ES-7.2, you are running straight into this improvement.

I see 3 solutions so far:

  1. Increase heap size if possible
  2. Reduce the size of your bulk requests if feasible
  3. Scale-out your cluster as the shards are consuming a lot of heap, leaving nothing to process the large request. More nodes will help the cluster to distribute the shards and requests among more nodes, what leads to a lower AVG heap usage on all nodes.

As an UGLY workaround (not solving the issue) one could increase the limit after reading and understanding the implications:

Nichollenicholls answered 5/2, 2020 at 18:38 Comment(0)
E
5

So I've spent some time researching how exactly ES implemented the new circuit breaker mechanism, and tried to understand why we are suddenly getting those errors?

  1. the circuit breaker mechanism exists since the very first versions.
  2. we started experience issues around it when moving from version 5.4 to 7.2
  3. in version 7.2 ES introduced a new way for calculating circuit-break: Circuit-break based on real memory usage (why and how: https://www.elastic.co/blog/improving-node-resiliency-with-the-real-memory-circuit-breaker, code: https://github.com/elastic/elasticsearch/pull/31767)
  4. In our internal upgrade of ES to version 7.2, we changed the jdk from 8 to 11.
  5. also as part of our internal upgrade we changed the jvm.options default configuration, switching the official recommended CMS GC with the G1GC GC which have a fairly new support by elasticsearch.
  6. considering all the above, I found this bug that was fixed in version 7.4 regarding the use of circuit-breaker together with the G1GC GC: https://github.com/elastic/elasticsearch/pull/46169

How to fix:

  1. change the configuration back to CMS GC.
  2. or, take the fix. the fix for the bug is just a configuration change that can be easily changed and tested in your deployment.
Engrail answered 24/6, 2020 at 20:4 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.