Unable to export to Monitering service because: GaxError RPC failed, caused by 3
Asked Answered
W

1

7

I have a Java Application in App Engine, and recently I started getting following error:

Unable to export to Monitering service because: GaxError RPC failed, caused by 3:One or more TimeSeries could not be written: Metrics cannot be written to gae_app. See https://cloud.google.com/monitoring/custom-metrics/creating-metrics#which-resource for a list of writable resource types.: timeSeries[0]

and this happens every time after health check log:

Health checks: instance=instanceName start=2020-01-14T14:28:07+00:00 end=2020-01-14T14:28:53+00:00 total=18 unhealthy=0 healthy=18

and after some time my instances would be restarted and the same thing starts to happen again.

app.yaml:

 #https://cloud.google.com/appengine/docs/flexible/java/reference/app-yaml

#General settings
runtime: java
api_version: '1.0'
env: flex
runtime_config:
  jdk: openjdk8
#service: service_name #Required if creating a service. Optional for the default service.

#https://cloud.google.com/compute/docs/machine-types
#Resource settings
resources:
  cpu: 2
  memory_gb: 6 #memory_gb = cpu * [0.9 - 6.5] - 0.4
#  disk_size_gb: 10 #default

##Liveness checks - Liveness checks confirm that the VM and the Docker container are running. Instances that are deemed unhealthy are restarted.
liveness_check:
  path: "/liveness_check"
  timeout_sec: 20         #1-300   Timeout interval for each request, in seconds.
  check_interval_sec: 30 #1-300   1-300Time interval between checks, in seconds.
  failure_threshold: 6   #1-10    An instance is unhealthy after failing this number of consecutive checks.
  success_threshold: 2   #1-10    An unhealthy instance becomes healthy again after successfully responding to this number of consecutive checks.
  initial_delay_sec: 300 #0-3600  The delay, in seconds, after the instance starts during which health check responses are ignored. This setting can allow an instance more time at deployment to get up and running.

##Readiness checks - Readiness checks confirm that an instance can accept incoming requests. Instances that don't pass the readiness check are not added to the pool of available instances.
readiness_check:
  path: "/readiness_check"
  timeout_sec: 10             #1-300      Timeout interval for each request, in seconds.
  check_interval_sec: 15      #1-300      Time interval between checks, in seconds.
  failure_threshold: 4       #1-10    An instance is unhealthy after failing this number of consecutive checks.
  success_threshold: 2       #1-10    An unhealthy instance becomes healthy after successfully responding to this number of consecutive checks.
  app_start_timeout_sec: 300 #1-3600  The maximum time, in seconds, an instance has to become ready after the VM and other infrastructure are provisioned. After this period, the deployment fails and is rolled back. You might want to increase this setting if your application requires significant initialization tasks, such as downloading a large file, before it is ready to serve.

#Service scaling settings
automatic_scaling:
  min_num_instances: 2
  max_num_instances: 3
  cpu_utilization:
    target_utilization: 0.7
Willable answered 14/1, 2020 at 17:14 Comment(8)
I have the same issue.Joseph
Hey @Alex, when did you start to have this issue?Willable
hey @Shb , which App Engine are you using standard or flex? have you made any changes recently to your application? from the error message I understand that you are creating some custom metrics in your monitoring using gae_app which is not included in supported custom metrics, check hereUnaneled
Hey @MethkalKhalawi, I am using flex. No, I haven’t made any changes also I am not creating any custom metric.Willable
do you see any other errors before and after the error messages that you have posted? can you add these errors? did you have more traffic on your app at this timeframe? can you edit your question with your app.yaml file? please add all the requested information to your question.Unaneled
This issue occurred for other users as well. It was caused due to spikes in the service that caused the autoscaler to scale up VM instances and then scale down. It is suggested by the App Engine product team to Increase the minimum instance number or Increase the target CPU utilization. Can you please update the description with your app.yaml file and check if your service has spikes? Also can you try increasing the mentioned values and see if this resolves the issue?Metatarsal
Added app.yaml. I tried following changes min_num_instances=3 and max_num_instances=5 but had the same problem.Willable
I am with Google Cloud Platform support and since you have tried to edit the app.yaml file without any luck, I would suggest opening a support ticket that so we could investigate the issue further and provide additional details about why it is failing. At the moment the setup looks all right, so a deeper investigation will give more helpful details.Metatarsal
S
2

The error is caused by an upgrade of the stackdriver logging sidecar to 1.6.25 version, which starts to push FluentD metrics to Stackdriver monitoring via OpenCensus. However the integration with App Engine Flex doesn't work yet.

These errors should be logs only. It is not relative to the health check logs. It should not impact VM restart. If your VM instances are restarted frequently, there may be caused by some other reason. In Stackdriver logging UI, you can search Free disk space under vm.syslog stream and unhealthy sidecars under vm.events stream. If some logs show up, your instances restart may be caused by low free disk size or any unhealthy sidecar containers.

Supplicate answered 23/1, 2020 at 18:54 Comment(2)
Hey @Yanwei Guo, Yes I am seeing some logs such as: "Unhealthy sidecars detected: stackdriver_metrics_agent". Do you know why does this happen and is this the reason why instances are restarting? Also do you know how can I fix this?Willable
Hi @Shb, not sure why this happens. I'd suggest that 1. Try to debug more with the logs under vm.syslog. 2. Check memory usage. Stackdriver monitoring agent may crash if it doesn't have sufficient memory. 3. Contact App Engine support for help.Supplicate

© 2022 - 2024 — McMap. All rights reserved.