We've been experiencing a long-standing networking issue. In short, one container cannot ping (or ssh) another. Does anybody have an extra moment to think along with me?
Our setup:
- Docker CE 18.06.03 (while trying to fix the issue, we've upgraded from 17.03, but it has not helped)
- Swarm Classic (Standalone) 1.2.9
- Consul as a Swarm backend, running with members on five nodes
- Seven nodes in total, six of which host containers
- Each container is connected to an overlay network when it is started
What we've tried so far:
This issue has largely stumped us. We've spent a lot of time on it and done much of the basic troubleshooting, and some more advanced troubleshooting (happy to elaborate). (But I don't expect that I've exhausted our options, so please don't hesitate to suggest anything you may think will work.) It's inconsistent (happening to different images, different nodes), intermittent, and long-standing (several months). We've made two changes, one of which was a workaround for MAC address assignment (explained here: https://github.com/docker/libnetwork/pull/2380; the actual workaround: https://github.com/systemd/systemd/issues/3374#issuecomment-452718898), which did improve the situation, including removing MAC address assignment errors from the logs. We also upgraded to get this fix (https://github.com/docker/libnetwork/pull/1935), which deals with IP reuse. This also decreased the problem (at the time, no containers could communicate). I've also run through some basics tests using the netshoot container (let me know if you want more info on that).
We have a workaround for a given container that is broken: we delete the Consul data for this container and then stop and restart it. From what I can tell, it does not seem to be an issue with the Consul data per se but instead comes from Docker/Swarm resetting several network configurations when the container is started (I can say more if this seems to trigger a thought for anybody reading). Then, the container can often ping other containers, but not always.
Specific question:
It seems like there's a window of time during which this can be worse. It's not necessarily tied to starting several containers at once, but there's a somewhat clear pattern: during some windows of time, containers do not get configured properly to communicate with each other. What troubleshooting steps come to mind for you?
The content below is the output from trying to ping one container (82afb0dccbcc
) from two other containers. It fails at first, but then is successful.
The first time I try to ping the container, at 2019-12-10T23:57:52+00:00
:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
82afb0dccbcc: user___92397089 crccheck/hello-world
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
PING 82afb0dccbcc (172.24.0.165) 56(84) bytes of data.^M
^M
--- 82afb0dccbcc ping statistics ---^M
4 packets transmitted, 0 received, 100% packet loss, time 3033ms^M
^M
PING 82afb0dccbcc (172.24.0.165) 56(84) bytes of data.^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=2 ttl=64 time=0.083 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=3 ttl=64 time=0.072 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=4 ttl=64 time=0.073 ms^M
^M
--- 82afb0dccbcc ping statistics ---^M
4 packets transmitted, 3 received, 25% packet loss, time 3023ms^M
rtt min/avg/max/mdev = 0.072/0.076/0.083/0.005 ms^M
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
In this first ping test, above, we note that the packet loss from the first container is 100% and from the second container, it is 25%.
A few minutes later (2019-12-10T23:57:52+00:00
), however, 82afb0dccbcc
can be successfully pinged from both containers:
82afb0dccbcc: user___92397089 crccheck/hello-world
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
ping from ansible-provisioner:
PING 82afb0dccbcc (172.24.0.165) 56(84) bytes of data.^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=1 ttl=64 time=0.056 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=2 ttl=64 time=0.073 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=3 ttl=64 time=0.077 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=4 ttl=64 time=0.087 ms^M
^M
--- 82afb0dccbcc ping statistics ---^M
4 packets transmitted, 4 received, 0% packet loss, time 3063ms^M
rtt min/avg/max/mdev = 0.056/0.073/0.087/0.012 ms^M
ping from ansible_container:
PING 82afb0dccbcc (172.24.0.165) 56(84) bytes of data.^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=1 ttl=64 time=0.055 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=2 ttl=64 time=0.055 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=3 ttl=64 time=0.060 ms^M
64 bytes from user___92397089.wharf (172.24.0.165): icmp_seq=4 ttl=64 time=0.085 ms^M
^M
--- 82afb0dccbcc ping statistics ---^M
4 packets transmitted, 4 received, 0% packet loss, time 3062ms^M
rtt min/avg/max/mdev = 0.055/0.063/0.085/0.015 ms^M
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
"ping": executable file not found in $PATH"
? It seems more of a filesystem issue in whichping
executable is not mounted properly than a networking issue. What do you think? – Correlative