In some cases, we have Services that get no response when trying to access them. Eg Chrome shows ERR_EMPTY_RESPONSE, and occasionally we get other errors as well, like 408, which I'm fairly sure is returned from the ELB, not our application itself.
After a long involved investigation, including ssh'ing into the nodes themselves, experimenting with load balancers and more, we are still unsure at which layer the problem actually exists: either in Kubernetes itself, or in the backing services from Amazon EKS (ELB or otherwise)
- It seems that only the instance (data) port of the node is the one that has the issue. The problems seems to come and go intermittently, which makes us believe it is not something obvious in our kubernetes manifest or docker configurations, but rather something else in the underlying infrastructure. Sometimes the service & pod will be working, but come back and the morning it will be broken. This leads us to believe that the issue stems from a redistribution of the pods in kubernetes, possibly triggered by something in AWS (load balancer changing, auto-scaling group changes, etc) or something in kubernetes itself when it redistributes pods for other reasons.
- In all cases we have seen, the health check port continues to work without issue, which is why kubernetes and aws both thing that everything is ok and do not report any failures.
- We have seen some pods on a node work, while others do not on that same node.
- We have verified kube-proxy is running and that the iptables-save output is the "same" between two pods that are working. (the same meaning that everything that is not unique, like ip addresses and ports are the same, and consistent with what they should be relative to each other). (we used these instructions to help with these instructions: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#is-the-kube-proxy-working
- From ssh on the node itself, for a pod that is failing, we CAN access the pod (ie the application itself) via all possible ip/ports that are expected.
- the 10. address of the node itself, on the instance data port.
- the 10. address of the pod (docker container) on the application port.
- the 172. address of the ??? on the application port (we are not sure what that ip is, or how the ip route gets to it, as it is a different subnet than the 172 address of the docker0 interface).
- From ssh on another node, for a pod that is failing, we cannot access the failing pod on any ports (ERR_EMPTY_RESPONSE). This seems to be the same behaviour as the service/load balancer.
What else could cause behaviour like this?