Rolling restarts are causing are app engine app to go offline. Is there a way to change the config to prevent that from happening?
H

2

8

About once a week our flexible app engine node app goes offline and the following line appears in the logs: Restarting batch of VMs for version 20181008t134234 as part of rolling restart. We have our app set to automatic scaling with the following settings:

runtime: nodejs
env: flex
beta_settings:
 cloud_sql_instances: tuzag-v2:us-east4:tuzag-db
automatic_scaling:
 min_num_instances: 1
 max_num_instances: 3
liveness_check:
 path: "/"
 check_interval_sec: 30
 timeout_sec: 4
 failure_threshold: 2
 success_threshold: 2
readiness_check:
 path: "/"
 check_interval_sec: 15
 timeout_sec: 4
 failure_threshold: 2
 success_threshold: 2
 app_start_timeout_sec: 300
resources:
 cpu: 1
 memory_gb: 1
 disk_size_gb: 10

I understand the rolling restarts of GCP/GAE, but am confused as to why Google isn't spinning up another VM before taking our primary one offline. Do we have to run with a min num of 2 instances to prevent this from happening? Is there a way I get configure my app.yaml to make sure another instance is spun up before it reboots the only running instance? After the reboot finishes, everything comes back online fine, but there's still 10 minutes of downtime, which isn't acceptable, especially considering we can't control when it reboots.

Haerle answered 29/10, 2018 at 14:43 Comment(2)
I've got the same issue. My app's set to min_num_instances: 1, min_num_instances: 3 and the majority of the time it comfortably runs on 1. Insanely when GAE restarts the instance and there's only 1 it doesn't bother spinning up a new one beforehand, taking the service offline. Did you ever find a solution other than min_instances: 2?Sizemore
@Sizemore Unfortunately not. We face about 15-60 seconds of downtime a week because of this. There have been several times that I've learned that app engine is really targeted at larger projects. For smaller ones, I would use Compute Engine or head to another host like DigitalOcean droplets or AWS EC2 and just manage the servers manually.Haerle
P
4

We know that it is expected behaviour that Flexible instances are restarted on a weekly basis. Provided that health checks are properly configured and are not the issue, the recommendation is, indeed, to set up a minimum of two instances.

There is no alternative functionality in App Engine Flex, of which I am aware of, that raises a new instance to avoid downtime as a result of a weekly restart. You could try to run directly on Google Compute Engine instead of App Engine and manage updates and maintenance by yourself, perhaps that would suit your purpose better.

Parishioner answered 29/4, 2019 at 12:29 Comment(1)
Alextru, sorry for never responding to this. Thanks for your feedback. We ended up moving up our min number of instances to 2 for any production project on App Engine. Crazy that we pay roughly $75/mo for an app engine project with one instance and that 60 sec of downtime at an uncontrollable is the expected behavior. We no longer run into the issue if we host the project on a VM, whether it be with Compute Engine, Digital Ocean droplets, or AWS EC2Haerle
M
0

Are you just guessing this based on that num instances graph in the app engine dashboard? Or is your app engine project actually unresponsive during that time?

You could use cron to hit it every 5 minutes to see if it's responsive.

Does this issue persist if you change cool_down_period_sec & target_utilization back to their defaults?

If your service is truly down during that time, maybe you should implement a request handler for liveliness checks: https://cloud.google.com/appengine/docs/flexible/python/reference/app-yaml#updated_health_checks

Their default polling config would tell GAE to launch within a couple minutes

Another thing worth double checking is how long it takes your instance to start up.

Metaphrast answered 31/10, 2018 at 1:24 Comment(5)
The project becomes unresponsive during that time and throws 500s. I'll remove the cool_down_period_sec & target_utilization but won't know if that makes an impact for another 6 days. I'll also add in a liveliness check now. Will report back once a week has passed. Thanks for the inputHaerle
10:18:15 am EDT it started the reboot and by 10:23:09 it had booted back up and was responding to requests again. I would think that a 5 minute boot time would be acceptable.Haerle
Still the same issue. Took out the cool_down_period and cpu_utilization and added both liveness_check and readiness_check and had the same issue this morning (exactly a week since the last reboot). Really frustrating. Feel like I shouldn't have to pay for two servers because Google decides to reboot mine once a week and cause 5-10 minutes of downtime.Haerle
I updated the app.yaml in the original post, but still having the same issueHaerle
To be honest, I'm a bit stumped now. This sounds like it could be a bug, but the only way to find out is to pay for Google support and have them look behind the scenes. The only other workaround i could forsee working is somehow forcing the your instance to die/restart when you want it too so that it picks up those rolling restart updates. Like setting up some kind of CI service that automatically redeploys the current code to your service every night at 2am.Metaphrast

© 2022 - 2024 — McMap. All rights reserved.