Docker Swarm host cannot resolve hosts on other nodes
Asked Answered
F

4

11

I am following this very excellent tutorial: https://github.com/binblee/springcloud-swarm

When I deploy a stack to a Docker swarm that contains a single node (just the manager node), it works perfectly.

docker stack deploy -c all-in-one.yml springcloud-demo

I have four docker containers, one of them is Eureka service discovery, which all the other three containers register with successfully.

The problem is when I add a worker node to the swarm, then two of the containers will be deployed to the worker, and two to the manager, and the services deployed to the worker node cannot find the Eureka server.

java.net.UnknownHostException: eureka: Name does not resolve

This is my compose file:

version: '3'
services:
  eureka:
    image: demo-eurekaserver
    ports:
      - "8761:8761"

  web:
    image: demo-web
    environment:
      - EUREKA_SERVER_ADDRESS=http://eureka:8761/eureka

  zuul:
    image: demo-zuul
    environment:
      - EUREKA_SERVER_ADDRESS=http://eureka:8761/eureka
    ports:
      - "8762:8762"

  bookservice:
    image: demo-bookservice
    environment:
      - EUREKA_SERVER_ADDRESS=http://eureka:8761/eureka

Also, I can only access the Eureka Service Discovery server on the host on which it is deployed to.

I thought that using "docker stack deploy" automatically creates an overlay network, in which all exposed ports will be routed to a host on which the respective service is running:

From https://docs.docker.com/engine/swarm/ingress/ :

All nodes participate in an ingress routing mesh. The routing mesh enables each node in the swarm to accept connections on published ports for any service running in the swarm, even if there’s no task running on the node.

This is the output of docker service ls:

manager:~/springcloud-swarm/compose$ docker service ls

ID                  NAME                           MODE                REPLICAS            IMAGE                                                  PORTS
rirdysi0j4vk        springcloud-demo_bookservice   replicated          1/1                 demo-bookservice:latest
936ewzxwg82l        springcloud-demo_eureka        replicated          1/1                 demo-eurekaserver:latest   *:8761->8761/tcp
lb1p8nwshnvz        springcloud-demo_web           replicated          1/1                 demo-web:latest
0s52zecjk05q        springcloud-demo_zuul          replicated          1/1                 demo-zuul:latest           *:8762->8762/tcp

and of docker stack ps springcloud-demo:

manager:$ docker stack ps springcloud-demo
ID                  NAME                             IMAGE                      NODE            DESIRED STATE       CURRENT STATE        
o8aed04qcysy        springcloud-demo_web.1           demo-web:latest            workernode      Running             Running 2 minutes ago
yzwmx3l01b94        springcloud-demo_eureka.1        demo-eurekaserver:latest   managernode     Running             Running 2 minutes ago
rwe9y6uj3c73        springcloud-demo_bookservice.1   demo-bookservice:latest    workernode      Running             Running 2 minutes ago
iy5e237ca29o        springcloud-demo_zuul.1          demo-zuul:latest           managernode     Running             Running 2 minutes ago

UPDATE:

I successfully added another host, but now I can't add a third. I tried a couple of times, following the same steps, (installing docker, opening the requisite ports, joining the swarm) - but the node cannot find the Eureka server with the container host name).

UPDATE 2:

In testing that the ports were opened, I examined the firewall config:

workernode:~$ sudo ufw status
Status: active

To                         Action      From
--                         ------      ----
8080                       ALLOW       Anywhere
4789                       ALLOW       Anywhere
7946                       ALLOW       Anywhere
2377                       ALLOW       Anywhere
8762                       ALLOW       Anywhere
8761                       ALLOW       Anywhere
22                         ALLOW       Anywhere

However - when I try to hit port 2377 on the worker node from the manager node, I can't:

managernode:~$ telnet xx.xx.xx.xx 2377

Trying xx.xx.xx.xx...
telnet: Unable to connect to remote host: Connection refused
Fraktur answered 5/10, 2018 at 12:16 Comment(2)
so is the euraka server not getting resolved only at the worker nodes or in all nodes including manager?Ganister
Could you please add output of docker network ls and docker inspect network-name(specially the containers section)?Chabazite
G
22

Let us break the solution into parts. Each part tries to give you an idea about the solution and is interconnected with each other.

Docker container network

Whenever we create a container without specifying network, docker attaches it to default bridge network. According to this, service discovery is unavailable in the default network. Hence, in order to make service discovery work properly, we are supposed to create a user-defined network as it provides isolation, DNS resolution and many more features. All these things are applicable when we use docker run command.

When docker-compose is used to run a container and network is not specified, it creates its own bridge network. which has all the properties of the user-defined networks.

These bridge networks are not attachable by default, But they allow docker containers in the local machine to connect to them.

Docker swarm network

In Docker swarm and swarm mode routing mesh, whenever we deploy a service to it without specifying an external network it connects to the ingress network.

When you specify an external overlay network you can notice that the created overlay network will be available only to the manager and not in the worker node unless a service is created and is replicated to it. These are also not attachable by default and does not allow other containers outside swarm services to connect to them. So you don't need to declare a network as attachable until you connect a container to it outside swarm.

Docker Swarm

As there is no pre defined/official limit on number of worker/manager nodes, You should be able to connect from the third node. One possibility is that the node might be connected as a worker node but you might try to deploy a container in that node which is restricted by the worker node if the overlay network is not attachable.

And moreover, you can't deploy a service directly in the worker node. All the services are deployed in the manager node and it takes care of replicating and scaling the services based on config and mode provided.

Firewall

As mentioned in Getting started with swarm mode

  • TCP port 2377 for cluster management communications
  • TCP and UDP port 7946 for communication among nodes
  • UDP port 4789 for overlay network traffic
  • ip protocol 50 (ESP) for encrypted overlay network

These ports should be whitelisted for communication between nodes. Most firewalls need to be reloaded once you make changes. This can be done by passing reload option to the firewall and it varies between Linux distributions. ufw doesn't need to be reloaded but needs commit if rules are added in file.

Extra steps to be followed in firewall

Apart from whitelisting the above ports. You may need to whitelist docker0,docker_gw_bridge,br-123456 ip address with netmask of 16. Else service discovery will not work in same host machine. i.e If you are trying to connect to eureka in 192.168.0.12 where the eureka service is in same 192.168.0.12 it will not resolve as firewall will block the traffic. Refer this (NO ROUTE TO HOST network request from container to host-ip:port published from other container)

Java

Sometimes Java works weird such that it throws java.net.MalformedURLException and similar exceptions. I've my own experience of such case with the solution as well. Here ping resolved properly but Java rmi was throwing an error. So, You can define your own custom alias when you attach to a user-defined network.

Docker service discovery

By default, you can resolve to a service by using container name. Apart from that, you can also resolve a service as <container_name>.<network_name>. Of course, you can define alias as well. And even you can resolve it as <alias_name>.<network_name>.

Solution

So you should create a user-defined overlay network after joining the swarm and then should deploy services. In the services, You should mention the external network as defined here along with making changes in the firewall.

If you want to allow external containers to connect to the network you should make the network attachable.

Since you haven't provided enough details on what's happening with third server. I assume that you are trying to deploy a container there which is denied by docker overlay network as the network is not attachable.

Ganister answered 11/10, 2018 at 16:33 Comment(4)
Thanks for taking the time to put together such a well thought out and informative answer. Very helpful.Fraktur
Thank you, very complete answer. But, still when running a container by composer it add search fritz.box on top of the /etc/resolv.conf, this happen all times, no matter what I put into yaml as networks, default or networks: [name]: / external: true. It is really strange, I do not know what is updating that file, for sure not during docker build ... or maybe it is, really it cames from my host's conf, but why so much un-determinism?Lylelyles
It should be coming from your hosts /etc/resolv.conf file unless you have added dns entries to your docker-compose.yml file. Check it out once.Ganister
I found in unix.stackexchange.com/questions/612416/… more details about resolv.conf and systemd-resolved, I know little as of now, but someway the resolv.conf copied into running containers does not care about some my late changes, even when restarting dockerd daemonLylelyles
T
3

You need to create a network for the services, like this:

version: '3'
services:
  eureka:
    image: demo-eurekaserver
    networks:
      - main
    ports:
      - "8761:8761"

  web:
    image: demo-web
    networks:
      - main
    environment:
      - EUREKA_SERVER_ADDRESS=http://eureka:8761/eureka

  zuul:
    image: demo-zuul
    networks:
      - main
    environment:
      - EUREKA_SERVER_ADDRESS=http://eureka:8761/eureka
    ports:
      - "8762:8762"

  bookservice:
    image: demo-bookservice
    networks:
      - main
    environment:
      - EUREKA_SERVER_ADDRESS=http://eureka:8761/eureka

networks:
  main:
    driver: overlay
    attachable: true

The attachable: true is so that you can connect to this network from another compose file (you can remove it if this is not the case)

Tetreault answered 9/10, 2018 at 17:9 Comment(0)
F
3

I finally found the answer. The problem was that I was not rebooting the host machines after adding the firewall exceptions.

I updated the version of the compose file to "3.3" because according to the docs, the "endpoint_mode: dnsrr" is only available from version 3.3.

With this change in place I was able to get it working.

Thanks to all for taking the time to look at my problem to try to resolve it.

Fraktur answered 10/10, 2018 at 10:52 Comment(5)
Is it essential to reboot after configuring firewall?Chabazite
Possibly not, I rebooted when I saw this message after enabling the firewall rules: "Firewall is active and enabled on system startup". But if a reboot is not required then the only change I made that seems to have a made a difference was setting the compose file version to 3.3Fraktur
Maybe you should double check that if you accept you own answer. If it’s not that rebooting operation what concerns, this would be a mislead to others. Since dnsrr is not the default mode for routing traffic inside a docker swarm cluster, you may check some docs here. It helped me a lot.Chabazite
Thanks Light.G - I took your advice and unaccepted this answer. Those link is really useful.Fraktur
You should have mentioned that you are using endpoint mode. It would have resulted in faster resolution.Ganister
B
1

i have same problem in amazon AWS.

My problem is in docker network ingress. I solved this open ports in my hosts and VPC.

https://docs.docker.com/network/overlay/#customize-the-docker_gwbridge-interface

You need the following ports open to traffic to and from each Docker host participating on an overlay network:

TCP port 2377 for cluster management communications

TCP and UDP port 7946 for communication among nodes

UDP port 4789 for overlay network traffic

Burgle answered 5/10, 2018 at 15:45 Comment(3)
Thanks Tainha - but I've triple-checked all the ports on every node - they're definitely open. For some reason, containers on work nodes can't resolve the hostnames of the other containers. --- Connect to eureka:8761 timed outFraktur
The only one who resolves the name is the manager. Up you service eura and execute in you manager: ping eureka If not resolve name, you have problem in docker network. You use AWS?Burgle
Yes there's an issue in the overlay network I suspect, I'm not using AWS, it's a private network.Fraktur

© 2022 - 2024 — McMap. All rights reserved.