Dataflow Pipeline - "Processing stuck in step <STEP_NAME> for at least <TIME> without outputting or completing in state finish..."

Asked 4/3, 2019 at 19:39 Answered 10/9, 2019 at 21:38

google-cloud-dataflow apache-beam

The Dataflow pipelines developed by my team suddenly started getting stuck, stopping processing our events. Their worker logs became full of warning messages saying that one specific step got stuck. The peculiar thing is that the steps that are failing are different, one is a BigQuery output and another for Cloud Storage output.

The following are the log messages that we are receiving:

For BigQuery output:

Processing stuck in step <STEP_NAME>/StreamingInserts/StreamingWriteTables/StreamingWrite for at least <TIME> without outputting or completing in state finish
  at sun.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
  at java.util.concurrent.FutureTask.get(FutureTask.java:191)
  at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:765)
  at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:829)
  at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.flushRows(StreamingWriteFn.java:131)
  at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn.finishBundle(StreamingWriteFn.java:103)
  at org.apache.beam.sdk.io.gcp.bigquery.StreamingWriteFn$DoFnInvoker.invokeFinishBundle(Unknown Source)

For Cloud Storage output:

Processing stuck in step <STEP_NAME>/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least <TIME> without outputting or completing in state process
  at sun.misc.Unsafe.park(Native Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
  at java.util.concurrent.FutureTask.get(FutureTask.java:191)
  at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(AbstractGoogleAsyncWriteChannel.java:421)
  at com.google.cloud.hadoop.util.AbstractGoogleAsyncWriteChannel.close(AbstractGoogleAsyncWriteChannel.java:287)
  at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1007)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:726)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)

All applications have been drained and redeployed but the same thing happened after a while (period of 3 to 4 hours). Some of them were running for more than 40 days and they suddenly got into this without any changes in the code.

I would like ask for some help to know the reason of this problem. These are the following ids of some of the Dataflow jobs with those problems:

Stuck in BigQuery output: 2019-03-04_04_46_31-3901977107649726570

Stuck in Cloud Storage output: 2019-03-04_07_50_00-10623118563101608836

Coinstantaneous answered 4/3, 2019 at 19:39 Comment(2)

What version of the Dataflow SDK are you using? I am experiencing the same thing, across multiple projects, on 2.5.0. – Blessed 5/3, 2019 at 18:0

We are using Apache Beam SDK 2.8.0, but we probably found the problem that may also affecting you. Google's documentation says that "Pipelines might become stuck due to an issue with the Conscrypt library. If you see errors in Stackdriver logging with stack traces that include Conscrypt related calls, you might be affected by this issue. To resolve the issue, upgrade to SDK 2.9.0 or downgrade to SDK 2.4.0.". We are still testing it but it seems that's the issue. – Coinstantaneous 6/3, 2019 at 16:23

The Processing stuck messages do not necessarily imply that your pipeline is actually stuck. These messages are logged by a worker that has been performing the same operation for over 5 minutes.

Often, this simply indicates a slow operation: An external RPC, or waiting for an external process (very common when performing Load or Query jobs to BigQuery).

If you see this kind of messages happening a lot in your pipeline, or increasingly at higher numbers (5m, 10m, 50m, 1h, etc), then it probably indicates stuckness - but if you see it occasionally in your pipeline, then it's nothing to worry about.

It is worth considering that in older versions of Beam (from 2.5.0 to 2.8.0), there was a deadlock issue with the Conscrypt library which was being used as default security provider. As of Beam 2.9.0, Conscrypt is no longer the default security provider.

Another option is to downgrade to Beam 2.4.0, where conscrypt was also not the default provider.

Jingoism answered 7/3, 2019 at 23:37 Comment(5)

We ran into the same problem with 2.11.0 as well. Probably something the DataFlow team should be looking deep into? – Weighty 22/4, 2019 at 7:13

Can you file a support ticket to give following to this? – Jingoism 22/4, 2019 at 21:21

FWIW, it is normal for some steps to take a while ~ 10, 15 minutes. – Jingoism 23/4, 2019 at 5:45

We're on 2.9.0 and have had a job recently get wedged for over an hour, catch up, then get wedged again. It's really strange. – Moonrise 27/4, 2019 at 3:30

I've tried Beam 2.11 and even 2.12 and my dataflow job still gets stuck. Depending on the job, processing might completely come to a complete halt after enough errors or restart after a period of time. I haven't been able to find a pattern, but the errors definitely occur more frequently during high volume hours. – Lundgren 21/5, 2019 at 13:17

I'm having the same issue, I’ve found out that the most common case it’s because one of the jobs failed to insert into the BigQuery table or failed saving the file into the CGS bucket (very uncommon). The thread in charge is not catching the Exception and keeps waiting the job. This is a bug of Apache Beam and I already created a ticket for it.

https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-7693

Let’s see if the guys from Apache Beam fix this issue (it’s a literally an unhandled exception).

So far my recommendation is to validate the constraints of your data before the insertion. So keep in mind things like: 1) Max Row size (right now 2019 is 1MB for stream insert and 100MB for batch) 2) REQUIRED values that are not coming should create a dead letter before and not being able to reach the job 3) If you have unknown fields don’t forget to enable the option ignoreUnknownFields (otherwise they will make your job die)

I presume that you are only having issues during the peak hours because more “unsatisfied” events are coming.

Hopefully this could help a little bit

Jurisprudence answered 13/7, 2019 at 6:34 Comment(0)

I was running into the same error and reason was that I created an empty BigQuery table without specifying a schema. Make sure to create a BQ table with a schema before you can load data via Dataflow.

Electrophorus answered 10/9, 2019 at 21:38 Comment(0)

Recommended topics

Hot tags