stackdriver-metadata-agent-cluster-level gets OOMKilled

Asked 5/3, 2020 at 8:34 Answered 4/4, 2020 at 18:8

Solved logging kubernetes google-kubernetes-engine stackdriver

I updated a GKE cluster from 1.13 to 1.15.9-gke.12. In the process I switched from legacy logging to Stackdriver Kubernetes Engine Monitoring. Now I have the problem that the stackdriver-metadata-agent-cluster-level pod keeps restarting because it gets OOMKilled.

The memory seems to be just fine though. enter image description here

The logs also look just fine (same as the logs of a newly created cluster):

I0305 08:32:33.436613       1 log_spam.go:42] Command line arguments:
I0305 08:32:33.436726       1 log_spam.go:44]  argv[0]: '/k8s_metadata'
I0305 08:32:33.436753       1 log_spam.go:44]  argv[1]: '-logtostderr'
I0305 08:32:33.436779       1 log_spam.go:44]  argv[2]: '-v=1'
I0305 08:32:33.436818       1 log_spam.go:46] Process id 1
I0305 08:32:33.436859       1 log_spam.go:50] Current working directory /
I0305 08:32:33.436901       1 log_spam.go:52] Built on Jun 27 20:15:21 (1561666521)
 at [email protected]:/google/src/files/255462966/depot/branches/gcm_k8s_metadata_release_branch/255450506.1/OVERLAY_READONLY/google3
 as //cloud/monitoring/agents/k8s_metadata:k8s_metadata
 with gc go1.12.5 for linux/amd64
 from changelist 255462966 with baseline 255450506 in a mint client based on //depot/branches/gcm_k8s_metadata_release_branch/255450506.1/google3
Build label: gcm_k8s_metadata_20190627a_RC00
Build tool: Blaze, release blaze-2019.06.17-2 (mainline @253503028)
Build target: //cloud/monitoring/agents/k8s_metadata:k8s_metadata
I0305 08:32:33.437188       1 trace.go:784] Starting tracingd dapper tracing
I0305 08:32:33.437315       1 trace.go:898] Failed loading config; disabling tracing: open /export/hda3/trace_data/trace_config.proto: no such file or directory
W0305 08:32:33.536093       1 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0305 08:32:33.936066       1 main.go:134] Initiating watch for { v1 nodes} resources
I0305 08:32:33.936169       1 main.go:134] Initiating watch for { v1 pods} resources
I0305 08:32:33.936231       1 main.go:134] Initiating watch for {batch v1beta1 cronjobs} resources
I0305 08:32:33.936297       1 main.go:134] Initiating watch for {apps v1 daemonsets} resources
I0305 08:32:33.936361       1 main.go:134] Initiating watch for {extensions v1beta1 daemonsets} resources
I0305 08:32:33.936420       1 main.go:134] Initiating watch for {apps v1 deployments} resources
I0305 08:32:33.936489       1 main.go:134] Initiating watch for {extensions v1beta1 deployments} resources
I0305 08:32:33.936552       1 main.go:134] Initiating watch for { v1 endpoints} resources
I0305 08:32:33.936627       1 main.go:134] Initiating watch for {extensions v1beta1 ingresses} resources
I0305 08:32:33.936698       1 main.go:134] Initiating watch for {batch v1 jobs} resources
I0305 08:32:33.936777       1 main.go:134] Initiating watch for { v1 namespaces} resources
I0305 08:32:33.936841       1 main.go:134] Initiating watch for {apps v1 replicasets} resources
I0305 08:32:33.936897       1 main.go:134] Initiating watch for {extensions v1beta1 replicasets} resources
I0305 08:32:33.936986       1 main.go:134] Initiating watch for { v1 replicationcontrollers} resources
I0305 08:32:33.937067       1 main.go:134] Initiating watch for { v1 services} resources
I0305 08:32:33.937135       1 main.go:134] Initiating watch for {apps v1 statefulsets} resources
I0305 08:32:33.937157       1 main.go:142] All resources are being watched, agent has started successfully
I0305 08:32:33.937168       1 main.go:145] No statusz port provided; not starting a server
I0305 08:32:37.134913       1 binarylog.go:95] Starting disk-based binary logging
I0305 08:32:37.134965       1 binarylog.go:265] rpc: flushed binary log to ""

I already tried to disable the logging and reenable it without success. It keeps restarting all the time (more or less every minute).

Does anybody have the same experience?

Baksheesh answered 5/3, 2020 at 8:34 Comment(2)

More people are experiencing the same issue: github.com/kyma-project/test-infra/issues/2105 – Baksheesh 5/3, 2020 at 8:38

We are experiencing this error as well and are in the process of creating a support ticket with Google Cloud. – Fenugreek 5/3, 2020 at 14:14

The issue is being caused because the LIMIT set on the metadata-agent deployment is too low on resources so the POD is being killed (OOM killed) since the POD requires more memory to properly work.

There is a workaround for this issue until it is fixed.

You can overwrite the base resources in the configmap of the metadata-agent with:

kubectl edit cm -n kube-system metadata-agent-config

Setting baseMemory: 50Mi should be enough, if it doesn't work use higher value 100Mi or 200Mi.

So metadata-agent-config configmap should look something like this:

apiVersion: v1
data:
  NannyConfiguration: |-
    apiVersion: nannyconfig/v1alpha1
    kind: NannyConfiguration
    baseMemory: 50Mi
kind: ConfigMap

Note also that You need to restart the deployment, as the config map doesn't get picked up automatically:

kubectl delete deployment -n kube-system stackdriver-metadata-agent-cluster-level

For more details look into addon-resizer Documentation.

Krystakrystal answered 5/3, 2020 at 16:16 Comment(5)

Thanks a lot, that worked like a charm! I tried to update the deployment directly but it got reset all the time. Now it's obvious to me why. – Baksheesh 6/3, 2020 at 15:57

@piotr-malec thank you for the answer, it worked! Do you know if there's an official bug issue we can comment on and watch? – Bethel 30/3, 2020 at 21:49

Once there any any new updates on this bug I will update my answer. – Krystakrystal 31/3, 2020 at 11:12

Still works in May 10 2020, kubernetes version 1.15.8 – Antihistamine 10/5, 2020 at 6:29

I encountered this issue and was advised by Google Support to make these changes. Public issue tracker is: issuetracker.google.com/issues/150436352 – Interstate 12/5, 2020 at 7:36

I was about to open a support ticket with GCP, but they have this notice:

Description We are experiencing issue with Fluentd crashlooping in Google Kubernetes Engine where master version is 1.14 or 1.15, when gVisor is enabled. The fix is targeted for a release aiming to begin on 17 April 2020. We will provide more updates as the date gets closer. We will provide an update by Thursday, 2020-04-09 14:30 US/Pacific with current details. We apologize to all who are affected by the disruption.

Start time April 2, 2020 at 10:58:24 AM GMT-7

End time Steps to reproduce Fluentd crashloops in GKE clusters could lead to missing logs.

Workaround Upgrade Google Kubernetes Engine cluster masters to version 1.16+.

Affected products Other

Jemadar answered 4/4, 2020 at 18:8 Comment(1)

The answer is to upgrade again? I've just gone through upgrades. As of April 30th, I'm still seeing this issue. – Erna 1/5, 2020 at 0:48

Recommended topics

Hot tags