SDK: Apache Beam SDK for Go 0.5.0
Our Golang job has been running fine on Google Cloud Data flow for weeks. We haven't made any updates to the job itself and the SDK version seems to be the same as it has been. Last night it failed, and I'm not sure exactly why. It gets to the 1 hour time limit and the job is cancelled due to no worker activity.
Looking at the Stackdriver logs the only thing I can see that stands out is repeated errors with Error syncing pod...failed to "StartContainer" for "sdk" with CrashLoopBackOff
It seems that it's somehow failing to sync the pod(?) and thus waiting 5 minutes before retrying.
Could anyone shed some light on what might be causing this and how we might go about either finding more information, or diagnosing the cause of the problem?
Note: I checked the status for Google Cloud Data flow and there doesn't appear to be any outages with the service.