Apache Beam BigQueryIO write slow
Asked Answered
W

1

9

My Beam pipeline is writing to an unpartitioned BigQuery target table. The PCollection consists of millions of TableRows. BigQueryIO apparently creates a temp file for every single record in the BigQueryWriteTemp temp folder first if I run it with DirectRunner. This is obviously not performing very well. Am I doing something wrong here? This is a normal batch job and not streaming. (The same job running with DataflowRunner doesn't seem to do this)

myrows.apply("WriteToBigQuery",
                BigQueryIO.writeTableRows().to(BQ_TARGET_TABLE)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));

This is the log we're seeing. Each of these files contains exactly one TableRow. The same on DataflowRunner seems to create only about 3 files.

2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/59668b03-a1e8-4288-a049-3472e7cb6333.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/feeb454b-799e-4d77-bd12-dec313cdadc2.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3c63db33-787f-4215-a425-3446d92157ed.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/87d55556-e012-4bef-8856-69efd4c5ab26.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5e6bfe94-b1c9-49cb-b0c7-a768d78d85f3.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/b236948b-bdf0-4bfe-9d26-4e67c8904320.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/451abb93-e02a-4210-aa46-5afa0c82547d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/60fd5ecc-8dbe-46e4-884d-3767694b009f.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/f3a5b4e0-e956-4a41-a78d-c7694950b6f1.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/a4e4c74f-d12c-495d-bf28-eb20ee25f086.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/eb3b29e1-cc0c-4a6d-82f4-8527d0c5a51e.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/916ac41b-4ece-42bb-bf24-c5ca17060d1d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5b76128f-3c66-4701-92ce-2d3ba2e91f65.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3a0ae709-756e-452c-9b0f-6efa9c0864ca.
Wittie answered 10/8, 2017 at 14:59 Comment(2)
I found this post after looking for something else, but I am doing something similar and writing to BigQuery with those same dispositions. I only see one file in my BigQueryWriteTemp folder. I am using the dataflow sdk version 2.4. Do you still see all those files?Dannadannel
did you set this properties? insertBundleParallelism, maxStreamingBatchSize and numStreamingKeys?Hephzipa
M
1

Direct runner is meant for testing and development and includes additional checks to insure that the pipeline will run correctly in other runners. This comes with the side effect of decreasing performance.

Here are the additional checks:

  • enforcing immutability of elements
  • enforcing encodability of elements
  • elements are processed in an arbitrary order at all points
  • serialization of user functions (DoFn, CombineFn, etc.)
Majors answered 21/7, 2020 at 21:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.