BigQueryIO - Can't use DynamicDestination with CREATE_IF_NEEDED for unbounded PCollection and FILE_LOADS

Asked 12/3, 2018 at 17:57 Answered 28/1, 2021 at 10:32

google-cloud-platform google-bigquery google-cloud-dataflow apache-beam

My workflow : KAFKA -> Dataflow streaming -> BigQuery

Given that having low-latency isn't important in my case, I use FILE_LOADS to reduce the costs. I'm using BigQueryIO.Write with a DynamicDestination (one new table every hour, with the current hour as a suffix).

This BigQueryIO.Write is configured like this :

.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(triggeringFrequency)
.withNumFileShards(100)

The first table is successfully created and is written to. But then the following tables are never created and I get these exceptions:

(99e5cd8c66414e7a): java.lang.RuntimeException: Failed to create load job with id prefix 5047f71312a94bf3a42ee5d67feede75_5295fbf25e1a7534f85e25dcaa9f4986_00001_00023, reached max retries: 3, last failed load job: {
  "configuration" : {
    "load" : {
      "createDisposition" : "CREATE_NEVER",
      "destinationTable" : {
        "datasetId" : "dev_mydataset",
        "projectId" : "myproject-id",
        "tableId" : "mytable_20180302_16"
      },

For the first table the CreateDisposition used is CREATE_IF_NEEDED as specified, but then this parameter is not taken into account and CREATE_NEVER is used by default.

I also created the issue on JIRA.

Precursory answered 12/3, 2018 at 17:57 Comment(0)

According to the documentation of Apache Beam's BigQueryIO, the method BigQueryIO.Write.CreateDisposition requires that a table schema is provided using the precondition .withSchema() when CREATE_IF_NEEDED is used.

As stated also in the Dataflow documentation:

Note that if you specify CREATE_IF_NEEDED as the CreateDisposition and you don't supply a TableSchema, the transform may fail at runtime with a java.lang.IllegalArgumentException if the target table does not exist.

The error that documentation states is not the same one as you are receiving (you get java.lang.RuntimeException), but according to the BigQueryIO.Write() configuration you shared, you are not specifying any table schema, and therefore, if tables are missing, the job is prone to failure.

So as a first measure to solve your issue, you should create the table schema TableSchema() that matches the data you will load into BQ, and then use the precondition .withSchema(schema) accordingly:

List<TableFieldSchema> fields = new ArrayList<>();
// Add fields like:
fields.add(new TableFieldSchema().setName("<FIELD_NAME>").setType("<FIELD_TYPE>"));
TableSchema schema = new TableSchema().setFields(fields);

// BigQueryIO.Write configuration plus:
    .withSchema(schema)

Scopula answered 19/3, 2018 at 15:41 Comment(3)

Sorry, I didn't specify it because I thought it was not relevant but I do specify a schema, using .withJsonSchema(), and it works fine to create the first table. The problem seems to be that CREATE_DISPOSITION is ignored for the other tables than the ones in the first pane. Please see Jonas' comment on the JIRA issue for more details. – Precursory 21/3, 2018 at 12:47

It does look like how the class WriteTables handles panes may be related to your issue. That's why I think this question can better be tracked in the Apache Beam forum you shared, as I do not know if there's a way to force that each new table is created in the first pane, so that it can use the defined CreateDisposition. Once the question is solved, feel free to post an answer in this post. Thanks! – Scopula 24/3, 2018 at 9:36

I am experiencing the same issue with Beam 2.6.0 and 2.9.0. I am also able to create the first table, but not any of the following. The error message is exactly the same as in the original post. I have commented in the JIRA post with a sample code that fails. – Lippe 10/1, 2019 at 10:12

This is a bug in Beam: https://issues.apache.org/jira/browse/BEAM-3772

It's still open in version 2.27

I developed a workaround : I wrote a custom PTransform which creates an empty table before BigqueryIO.Write stage. It uses the bigquery java client. You can see it here to get inspired: https://gist.github.com/matthieucham/85459eff5fdea8d115be520e2dd5ccc1

Superannuation answered 28/1, 2021 at 10:32 Comment(0)

Recommended topics

Hot tags