Some 502 errors in GCP HTTP Load Balancing
Asked Answered
M

4

28

Our load balancer is returning 502 errors for some requests. It is just a very low percentage of the total requests, we have around 36000 request per hour and about 40 errors per hour, so just a 0,01% of the requests returns an error.

The instances are healthy when the error occurs and we have added this forwarding rule to the firewall for the load balancer: 130.211.0.0/22 tcp:1-5000 Apply to all targets

It is not a very serious problem because the application tolerates such errors, but I would like to know why they are given.

Any help will be apreciated.

Marrakech answered 23/12, 2016 at 16:51 Comment(0)
M
18

It seems that there are no an easy solution for this.

As Mike Fotinakis explains in this blog (thank you for this info JasonG :)):

It turns out that there is a race condition between the Google Cloud HTTP(S) Load Balancer and NGINX’s default keep-alive timeout of 65 seconds. The NGINX timeout might be reached at the same time the load balancer tries to re-use the connection for another HTTP request, which breaks the connection and results in a 502 Bad Gateway response from the load balancer.

In my case I'm using Apache with the mpm_prefork module. The solution proposed is to increase the connection keepalive timeout to 650s, but this is not possible because each connection opens one new process (so this would represent a great waste of resources).

UPDATE:
It seems that there are some new documentation about this problem on the official load balancer documentation page (search for "Timeouts and retries"): https://cloud.google.com/compute/docs/load-balancing/http/

They recommend to set the KeepAliveTimeout value to 620 in both cases (Apache and Nginx).

Marrakech answered 28/12, 2016 at 13:55 Comment(0)
G
12

I had an issue w/ 502s that was unexplainable after recreating a load balancer and backend config. I recreated my backend & instance group for unmanaged instances and this seemed to fix the issue for me. I wasn't able to identify any issues in my configuration in GCP :(

But I had a lot more errors - 1/10. There are load balancer logs that will tell you what the cause is and docs explain the causes.

Eg mine were: jsonPayload: { statusDetails: "failed_to_pick_backend" @type: "type.googleapis.com/google.cloud.loadbalancing.type.LoadBal‌​ancerLogEntry" }

If you're using nginx and it's on POSTS and the error is reported as "backend_connection_closed_before_data_sent_to_client" it may be fixed by changing your nginx timeouts. See this excellent blog post:

https://medium.com/perceptual-percy/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

Goatee answered 25/12, 2016 at 21:38 Comment(6)
I'm using Apache, but yes, the errors are on POST requests and the error is "backend_connection_closed_before_data_sent_to_client". I have changed the KeepAliveTimeout configuration of Apache to 65 seconds and the problem was solved. Thank you for your help JasonG! :)Marrakech
There seems to be fewer errors but still happening. I'll check it out in a few hours.Marrakech
I think you need the timeout to be longer than 600s.Goatee
"To fix this race condition, set “keepalive_timeout 650;” in nginx so that your timeout is longer than the 600 second timeout in the GCP HTTP(S) Load Balancer. This causes the load balancer to be the side that closes idle connections, rather than nginx, which fixes the race condition! (This is not a 100% accurate description for how closing TCP connections works, but it’s fair enough for here)."Goatee
In my case, it is IIS 10.0 and there are no details about IIS mentioned in the google documentation. I had to raise a ticket to the Google cloud team. The details is mentioned in the following stackoverflow question and answer - https://mcmap.net/q/504504/-gcp-load-balancer-502-server-error-and-quot-backend_connection_closed_before_data_sent_to_client-quot-iis-10Spooky
It's 2019 and this is exactly what's happening on our App Engine Flex instance.Singles
D
0

Sometimes you can got not explained 502 errors because yours AutoScalingGroup create instances by EVEN logic. After I changed to BALANCED scheme 99% of errors just gone. You can read about it: https://cloud.google.com/compute/docs/instance-groups/regional-mig-distribution-shape

Disruption answered 6/5, 2023 at 12:20 Comment(0)
M
0

Turns out if you have multiple ports in your "Backend services" this error shows up as described (intermittently with no meaningful feedback anywhere in any logs)

As seen in this Screenshot, when you create / edit your backend services and link them to instance groups, ensure that you have only one port set up here.

If you have 443 and 80, the error described in the original question will manifest under those circumstances as well.

Marucci answered 18/10, 2023 at 20:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.