Why does my Kubernetes Service only work sometimes on EKS?

Asked 3/8, 2018 at 12:23 Answered 18/9, 2019 at 23:36

kubernetes amazon-eks kubernetes-service

In some cases, we have Services that get no response when trying to access them. Eg Chrome shows ERR_EMPTY_RESPONSE, and occasionally we get other errors as well, like 408, which I'm fairly sure is returned from the ELB, not our application itself.

After a long involved investigation, including ssh'ing into the nodes themselves, experimenting with load balancers and more, we are still unsure at which layer the problem actually exists: either in Kubernetes itself, or in the backing services from Amazon EKS (ELB or otherwise)

It seems that only the instance (data) port of the node is the one that has the issue. The problems seems to come and go intermittently, which makes us believe it is not something obvious in our kubernetes manifest or docker configurations, but rather something else in the underlying infrastructure. Sometimes the service & pod will be working, but come back and the morning it will be broken. This leads us to believe that the issue stems from a redistribution of the pods in kubernetes, possibly triggered by something in AWS (load balancer changing, auto-scaling group changes, etc) or something in kubernetes itself when it redistributes pods for other reasons.
In all cases we have seen, the health check port continues to work without issue, which is why kubernetes and aws both thing that everything is ok and do not report any failures.
We have seen some pods on a node work, while others do not on that same node.
We have verified kube-proxy is running and that the iptables-save output is the "same" between two pods that are working. (the same meaning that everything that is not unique, like ip addresses and ports are the same, and consistent with what they should be relative to each other). (we used these instructions to help with these instructions: https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/#is-the-kube-proxy-working
From ssh on the node itself, for a pod that is failing, we CAN access the pod (ie the application itself) via all possible ip/ports that are expected.
- the 10. address of the node itself, on the instance data port.
- the 10. address of the pod (docker container) on the application port.
- the 172. address of the ??? on the application port (we are not sure what that ip is, or how the ip route gets to it, as it is a different subnet than the 172 address of the docker0 interface).
From ssh on another node, for a pod that is failing, we cannot access the failing pod on any ports (ERR_EMPTY_RESPONSE). This seems to be the same behaviour as the service/load balancer.

What else could cause behaviour like this?

Overstay answered 3/8, 2018 at 12:23 Comment(5)

we CANNOT access the failing pod on any ports what exactly does that mean? The same "empty response", connection refused, connection timeout, other? – Narayan 4/8, 2018 at 4:28

@MatthewLDaniel - I've updated with a bit more detail. – Overstay 8/8, 2018 at 20:19

Yikes that's a tough one; can you access the failing Pod from within the cluster (that is, from another Pod inside the SDN)? – Narayan 9/8, 2018 at 1:51

seems like an iffy product – Stokehold 14/8, 2018 at 20:44

Any update on this? I been doing tests in EKS and experience timeouts on Kube-DNS lookups. And "Address not found" when going straight to the Load Balance's Service IP. Is AWS EKS load balancing really that flaky? – Menashem 13/9, 2018 at 20:13

After much investigation, we were fighting a number of issues: * Our application didn't always behave the way we were expecting. Always check that first. * In our Kubernetes Service manifest, we had set the externalTrafficPolicy: Local, which probably should work, but was causing us problems. (This was with using Classic Load Balancer) service.beta.kubernetes.io/aws-load-balancer-type: "clb". So if you have problems with CLB, either remove the externalTrafficPolicy or explicitly set it to the default "Cluster" value.

So our manifest is now: kind: Service apiVersion: v1 metadata: name: apollo-service annotations: service.beta.kubernetes.io/aws-load-balancer-type: "clb" service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws:acm:REDACTED" service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443" service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http" spec: externalTrafficPolicy: Cluster selector: app: apollo ports: - name: http protocol: TCP port: 80 targetPort: 80 - name: https protocol: TCP port: 443 targetPort: 80 type: LoadBalancer

Overstay answered 20/9, 2018 at 14:37 Comment(1)

externalTrafficPolicy: Local isn't supported on AWS github.com/kubernetes/kubernetes/issues/80579 Describes the problem in more detail. I'm also going to post a more detailed answer about it on this question #32113422, because it's a known issues that's not well documented. – Misusage 14/8, 2019 at 16:43

adding

service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"

Fixed this for me

Counterscarp answered 18/9, 2019 at 23:36 Comment(0)

Recommended topics

Hot tags