Elasticsearch multi-node cluster one node always fails with docker compose

O

3

9

I am attempting to setup a basic dev 3 node 7.7.0 ES cluster in docker compose following the official documentation here but cannot get all 3 nodes to operate at the same time. After running docker-compose up at least one of the containers exits either right away or shortly after like this:

Starting es01 ... done
Starting es03 ... done
Starting es02 ... done
Attaching to es03, es02, es01
es01 exited with code 137

If I try and bring it back up by running docker-compose start es01 (or whichever one happened to exit, it's random sometimes) it causes these errors that don't stop until I kill all containers:

es03    | {"type": "server", "timestamp": "2020-05-25T15:52:37,214Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-docker-cluster", "node.name": "es03", "message": "master not discovered or elected yet, an election requires 2 nodes with ids [9RNLWSMbTmetFFh3Tm1q0g, 3CbHK5iBQAqt7poPRsMmaw], have discovered [{es03}{3CbHK5iBQAqt7poPRsMmaw}{AGMLafOfRyyMVeEuFuaTOA}{172.25.0.2}{172.25.0.2:9300}{dilmrt}{ml.machine_memory=2085416960, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {es02}{9RNLWSMbTmetFFh3Tm1q0g}{bnuXTgAqQyCz04G7Cva1sA}{172.25.0.4}{172.25.0.4:9300}{dilmrt}{ml.machine_memory=2085416960, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}] which is a quorum; discovery will continue using [172.25.0.4:9300] from hosts providers and [{es03}{3CbHK5iBQAqt7poPRsMmaw}{AGMLafOfRyyMVeEuFuaTOA}{172.25.0.2}{172.25.0.2:9300}{dilmrt}{ml.machine_memory=2085416960, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 0, last-accepted version 0 in term 0" }
es02    | {"type": "server", "timestamp": "2020-05-25T15:52:37,936Z", "level": "WARN", "component": "o.e.d.SeedHostsResolver", "cluster.name": "es-docker-cluster", "node.name": "es02", "message": "failed to resolve host [es01]", "cluster.uuid": "e16AK3EnQ-28Xky2afhx4A", "node.id": "9RNLWSMbTmetFFh3Tm1q0g"

If I don't attempt to bring back up the container that exited the other two will seemingly work fine, if I go to http://localhost:9200/_cluster/health I can see

{"cluster_name":"es-docker-cluster","status":"green","timed_out":false,"number_of_nodes":2,"number_of_data_nodes":2,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}

I am using Docker Desktop version 2.2.0.5 for MacOS Catalina version 10.15.4 with Docker Compose version 1.25.4

This is my docker-compose.yml file for reference:

version: '2.2'
services:
  es01:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.7.0
    container_name: es01
    environment:
      - node.name=es01
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es02,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data01:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
    networks:
      - elastic
  es02:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.7.0
    container_name: es02
    environment:
      - node.name=es02
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es03
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data02:/usr/share/elasticsearch/data
    networks:
      - elastic
  es03:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.7.0
    container_name: es03
    environment:
      - node.name=es03
      - cluster.name=es-docker-cluster
      - discovery.seed_hosts=es01,es02
      - cluster.initial_master_nodes=es01,es02,es03
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - data03:/usr/share/elasticsearch/data
    networks:
      - elastic

volumes:
  data01:
    driver: local
    name: data01
  data02:
    driver: local
    name: data02
  data03:
    driver: local
    name: data03

networks:
  elastic:
    driver: bridge
    name: elastic

Osburn answered 25/5, 2020 at 16:46 Comment(0)

O

17

My docker desktop memory settings were at a 2gb, once I set it to above 4gb the issue stopped.

Turns out exit code 137 with docker containers is commonly due to memory allocation issues!

Osburn answered 25/5, 2020 at 18:27 Comment(0)

H

18

For me exit code 137 was caused by giving so much memory to Elasticsearch that it tried to reserve more than my system's available memory for the replica. The primary took up 6 GB in docker stats, so Elasticsearch was reserving another 6 GB for a replica (but never allocating it because it was on the same node).

I limited the heap size in docker-compose.yml:

environment:
    - ES_JAVA_OPTS=-Xms750m -Xmx750m

This ran everything just as fast in 1 GB of memory as it formerly did in 6 GB, without crashing.

Heinrich answered 18/10, 2021 at 19:14 Comment(0)

O

17

My docker desktop memory settings were at a 2gb, once I set it to above 4gb the issue stopped.

Turns out exit code 137 with docker containers is commonly due to memory allocation issues!

Osburn answered 25/5, 2020 at 18:27 Comment(0)

H

0

I had this due to a too low vm.max_map_count, had to multiply it by the number of instances I wanted to run. The suggested amount (262144) seems to be fine for one instance, but for two it wasn't enough. You'd find OOM-Killer messages in dmesg output like the following:

[1937998.722050] Out of memory: Killed process 25343 (java) total-vm:37704556kB, anon-rss:33089604kB, file-rss:17612kB, shmem-rss:0kB, UID:1101 pgtables:65484kB oom_score_adj:0

I also limited Xmx and Xms like many others here had suggested. 8GB each (node has 64GB, plan on running 4 cold storage instances)

Hodgson answered 11/3 at 2:15 Comment(0)

Recommended topics

Hot tags