What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster?
Asked Answered
S

1

9

I have a working Cluster with services that all respond behind a helm installed Ingress nGinx running on Azure AKS. This ended up being Azure specific.

My question is: Why does my connection to the services / pods in this cluster periodically get severed (apparently by some sort of idle timeout), and why does that connection severing appear to also coincide with my Az AKS Browse UI connection getting cut?

This is an effort to get a final answer on what exactly triggers the time-out that causes the local 'Browse' proxy UI to disconnect from my Cluster (more background on why I am asking to follow).

When working with Azure AKS from the Az CLI you can launch the local Browse UI from the terminal using:

az aks browse --resource-group <resource-group> --name <cluster-name>

This works fine and pops open a browser window that looks something like this (yay):

Azure AKS Disconnects Connections entering pods

In your terminal you will see something along the lines of:

  1. Proxy running on http://127.0.0.1:8001/ Press CTRL+C to close the tunnel...
  2. Forwarding from 127.0.0.1:8001 -> 9090 Forwarding from
  3. [::1]:8001 -> 9090 Handling connection for 8001 Handling connection for 8001 Handling connection for 8001

If you leave the connection to your Cluster idle for a few minutes (ie. you don't interact with the UI) you should see the following print to indicate that the connection has timed out:

E0605 13:39:51.940659 5704 portforward.go:178] lost connection to pod

One thing I still don't understand is whether OTHER activity inside of the Cluster can prolong this timeout but regardless once you see the above you are essentially at the same place I am... which means we can talk about the fact that it looks like all of my other connections OUT from pods in that server have also been closed by whatever timeout process is responsible for cutting ties with the AKS browse UI.

So what's the issue?

The reason this is a problem for me is that I have a Service running a Ghost Blog pod which connects to a remote MySQL database using an npm package called 'Knex'. As it happens the newer versions of Knex have a bug (which has yet to be addressed) whereby if a connection between the Knex client and a remote db server is cut and needs to be restored — it doesn't re-connect and just infinitely loads.

nGinx Error 503 Gateway Time-out

In my situation that resulted in nGinx Ingress giving me an Error 503 Gateway time-out. This was because Ghost wasn't responding after the Idle timeout cut the Knex connection — since Knex wasn't working properly and doesn't restore the broken connection to the server properly.

Fine. I rolled back Knex and everything works great.

But why the heck are my pod connections being severed from my Database to begin with?

Hence this question to hopefully save some future person days of attempting to troubleshoot phantom issues that relate back to Kubernetes (maybe Azure specific, maybe not) cutting connections after a service / pod has been idle for some time.

Subsistence answered 5/6, 2018 at 18:13 Comment(2)
Did you happen to see this Github issue? github.com/Azure/AKS/issues/285 It also references an issue and possible workaround for the dashboard timeout.Christenachristendom
@Christenachristendom yeah I saw that one, the work around would probably solve the issue but my question is what mechanic disconnects all connections in and out of the Cluster (not just regarding the timeout of the Browse UI). It's annoying to reconnect to the Browse UI but the issue of connections being severed for my node app (ghost blog) to my database — which may not end up being related to the Browse UI timeout — is really what I am trying to figure out.Subsistence
S
11

Short Answer:

Azure AKS automatically deploys an Azure Load Balancer (with public IP address) when you add a new ingress (nGinx / Traefik... ANY Ingress) — that Load Balancer has its settings configured as a 'Basic' Azure LB which has a 4 minute idle connection timeout.

That idle timeout is both standard AND required (although you MAY be able to modify it, see here: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout). That being said there is no way to ELIMINATE it entirely for any traffic that is heading externally OUT from the Load Balancer IP — the longest duration currently supported is 30 minutes.

There is no native Azure way to get around an idle connection being cut.

So as per the original question, the best way (I feel) to handle this is to leave the timeout at 4 minutes (since it has to exist anyway) and then setup your infrastructure to disconnect your connections in a graceful way (when idle) prior to hitting the Load Balancer timeout.

Our Solutions

For our Ghost Blog (which hit a MySQL database) I was able to roll back as mentioned above which made the Ghost process able to handle a DB disconnect / reconnect scenario.

What about Rails?

Yep. Same problem.

For a separate Rails based app we also run on AKS which is connecting to a remote Postgres DB (not on Azure) we ended up implementing PGbouncer (https://github.com/pgbouncer/pgbouncer) as an additional container on our Cluster via the awesome directions found here: https://github.com/edoburu/docker-pgbouncer/tree/master/examples/kubernetes/singleuser

Generally anyone attempting to access a remote database FROM AKS is probably going to need to implement an intermediary connection pooling solution. The pooling service sits in the middle (PGbouncer for us) and keeps track of how long a connection has been idle so that your worker processes don't need to care about that.

If you start to approach the Load Balancer timeout the connection pooling service will throw out the old connection and make a new fresh one (resetting the timer). That way when your client sends data down the pipe it lands on your Database server as anticipated.

In closing

This was an INSANELY frustrating bug / case to track down. We burned at least 2 dev-ops days figuring the first solution out but even KNOWING that it was probably the same issue we burned another 2 days this time around.

Even elongating the timer beyond the 4 minute default wouldn't really help since that would just make the problem more ephemeral to troubleshoot. I guess I just hope that anyone who has trouble connecting from Azure AKS / Kubernetes to a remote db is better at googling than I am and can save themselves some pain.

Thanks to MSFT Support (Kris you are the best) for the hint on the LB timer and to the dude who put together PGbouncer in a container so I didn't have to reinvent the wheel.

Subsistence answered 19/7, 2018 at 22:27 Comment(3)
can we really increase the TCP idle timeout settings for outbound traffic in Azure Load Balancer(Basic) uptio 30 minutes ? cause this Link learn.microsoft.com/en-us/azure/load-balancer/… , suggests that This setting works for inbound connections onlyAccompany
Not 100% on whether increasing works, I just know that's what support told me. Since for me it made no difference (ie. I needed a solution to handle DB connection restore / cycling) it didn't matter whether it was 4 mins or 30 mins so I just took supports word for it and moved on with a connection pooling solution to handle the problem (ie. Expect the connection to get severed since there is no way to omit the timeout entirely).Subsistence
Azure is a very irritating platformScarcely

© 2022 - 2024 — McMap. All rights reserved.