Google Cloud Data flow stuck with repeated error 'Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff'
Asked Answered
A

2

11

SDK: Apache Beam SDK for Go 0.5.0

Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same as it has been. Last night it failed, and I'm not sure exactly why. It gets to the 1 hour time limit and the job is cancelled due to no worker activity.

Looking at the Stackdriver logs the only thing I can see that stands out is repeated errors with Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff

It seems that it's somehow failing to sync the pod(?) and thus waiting 5 minutes before retrying.

Could anyone shed some light on what might be causing this and how we might go about either finding more information, or diagnosing the cause of the problem?

Note: I checked the status for Google Cloud Data flow and there doesn't appear to be any outages with the service.

Ambassadress answered 12/12, 2018 at 2:15 Comment(3)
Encountered similar issue with Apache Beam Python SDK. Using direct runner pipeline works flawlessly but when starting with dataflow runner — same issue. Dataflow UI shows everything is fine but in logs you see the pod being restarted with the same error cyclically.Background
This question might be a duplicate of this question.Background
Seeing the exact same thing. Tried to re-push the worker harness image to my own docker account but it also fails. Seems like something is broken. This was working a week back when I last ran the job.Jeneejenei
M
1

We had something similar and found that is was an inability to start the workers (for us due to an slf4j issue, but it could be anything that prevents the worker from starting in whatever language).

If you look at the Stackdriver Logs (view Logs in the UI, and click the link to go to Stackdriver) you should be able to view the worker_startup logs.

Moralist answered 12/3, 2019 at 14:31 Comment(0)
S
0

I ran into the same problem today, and followed the instructions here to build my own image, pushed it to a public repo and used it with the --worker_harness_container_image option and it worked for me.

Soukup answered 9/4, 2019 at 2:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.