How to specify insertId when streaming insert to BigQuery using Apache Beam

Asked 9/1, 2019 at 12:52 Answered 15/1, 2019 at 6:38

Solved java google-cloud-platform google-bigquery apache-beam apache-beam-io

BigQuery supports de-duplication for streaming insert. How can I use this feature using Apache Beam?

https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency

To help ensure data consistency, you can supply insertId for each inserted row. BigQuery remembers this ID for at least one minute. If you try to stream the same set of rows within that time period and the insertId property is set, BigQuery uses the insertId property to de-duplicate your data on a best effort basis. You might have to retry an insert because there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see troubleshooting streaming inserts.

I can not find such feature in Java doc. https://beam.apache.org/releases/javadoc/2.9.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.html

In this question, he suggest to set insertId in TableRow. Is this correct?

https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableRow.html?is-external=true

BigQuery client library has this feature.

https://googleapis.github.io/google-cloud-java/google-cloud-clients/apidocs/index.html?com/google/cloud/bigquery/package-summary.html https://github.com/googleapis/google-cloud-java/blob/master/google-cloud-clients/google-cloud-bigquery/src/main/java/com/google/cloud/bigquery/InsertAllRequest.java#L134

Dyanne answered 9/1, 2019 at 12:52 Comment(5)

Can you specify more about your use case? Dataflow/Beam should perform only-once when coupled with BigQuery, without you needing to specify manually an insertId. – Dettmer 10/1, 2019 at 4:56

my use case is mentioned above. want to de-duplicate when inserting to BigQuery. so just specify insertId as column in new row? – Dyanne 10/1, 2019 at 5:8

I understand you want to de-duplicate. But depending on the source of duplication, this might be already a solved problem. – Dettmer 10/1, 2019 at 5:53

no duplication on data source side. since Kafka support at least once delivery as default so I think there is possibility of duplication between Kafka producer and consumer. and also I guess dataflow might insert same row more than once when retry on some errors (e.g. temporal network issue). so I just want to know how I can avoid duplication on both. this question is about stream insert from dataflow to bigquery. – Dyanne 10/1, 2019 at 6:0

In my actual use case, requirement for de-duplication is not so strong. So I think the easiest way is just to insert to Big Query then de-duplication on query. but I just want to know BigQueryIO (Apache Beam) support deduplication feature. – Dyanne 10/1, 2019 at 6:25

As Felipe mentioned in the comment, it seems that Dataflow is already using insertId for itself to implement "exactly once". so we can not manually specify insertId.

Dyanne answered 15/1, 2019 at 6:38 Comment(0)

Pub/Sub + Beam/Dataflow + BigQuery: "Exactly once" should be guaranteed, and you don't need to worry much about this. That guarantee is stronger when you ask Dataflow to insert to BigQuery using FILE_LOADS instead of STREAMING_INSERTS, for now.
Kafka + Beam/Dataflow + BigQuery: If a message can be emitted more than once from Kafka (e.g. if the producer retried the insertion), then you need to take care of de-duplication. Either in BigQuery (as currently implemented, according to your comment), or in Dataflow with a .apply(Distinct.create()) transform.

Dettmer answered 10/1, 2019 at 22:23 Comment(10)

Thanks! but my original question is how to use BigQuery deduplication feature from Apache Beam. – Dyanne 11/1, 2019 at 2:58

You can't manually, because Dataflow is already using insertId for itself to implement "exactly once" as described. – Dettmer 11/1, 2019 at 3:42

OK I see. Thank you for clarification. – Dyanne 11/1, 2019 at 7:7

Thanks for asking! I had to ask some experts to get to this answer :). Including Pablo, who improved my answer above – Dettmer 11/1, 2019 at 8:38

And I can not find about .apply(Distinct.create()) transform in Apache Beam document. So it would be helpful if you could mention about it in the document. – Dyanne 11/1, 2019 at 9:6

beam.apache.org/documentation/sdks/javadoc/2.4.0/org/apache/… – Dettmer 11/1, 2019 at 10:21

I mean it is not easy to find it in japadoc without any explanation in Apache beam website – Dyanne 11/1, 2019 at 14:31

I'm confused now – Dettmer 11/1, 2019 at 18:3

"Dataflow is already using insertId for itself to implement "exactly once" as described" Can I see how this is implemented? Would you provide the link to this implementation? Thanks. – Dyanne 15/1, 2019 at 6:50

Re: "That guarantee is stronger when you ask Dataflow to insert to BigQuery using FILE_LOADS instead of STREAMING_INSERTS, for now". Is it really a "guarantee" that duplicate can't possibly happen? if so, why one is stronger than the other? – Spitter 16/10, 2021 at 0:9

As Felipe mentioned in the comment, it seems that Dataflow is already using insertId for itself to implement "exactly once". so we can not manually specify insertId.

Dyanne answered 15/1, 2019 at 6:38 Comment(0)

Recommended topics

Hot tags