Experiencing slow streaming writes to BigQuery from Dataflow pipeline?
Asked Answered
D

1

7

I experience unexpected performance issues when writing to BigQuery with streaming inserts and Python SDK 2.23.

Without the write step the pipeline runs on one worker with ~20-30% CPU. Adding the BigQuery step the pipeline scales up to 6 workers all on 70-90% CPU.

I am pretty new to Dataflow and Beam and probably this behaviour is normal or I am doing something wrong but it seems to me that using 6 machines to write 250 rows per second to BigQuery is a bit heavy. I'm wondering how it is even possible to reach the insertion quota of 100K rows per second.

My pipeline looks like this:

p
    | "Read from PubSub" >> beam.io.ReadFromPubSub(subscription=options.pubsub_subscription) # ~40/s
    | "Split messages" >> beam.FlatMap(split_messages) # ~ 400/s
    | "Prepare message for BigQuery" >> beam.Map(prepare_row)
    | "Filter known message types" >> beam.Filter(filter_message_types) # ~ 250/s
    | "Write to BigQuery" >> beam.io.WriteToBigQuery(
          table=options.table_spec_position,
          schema=table_schema,
          write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
          create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
          additional_bq_parameters=additional_bq_parameters,
      )

The pipeline runs with these option although I experienced a similar behaviour without using streaming engine.

--enable_streaming_engine \
--autoscaling_algorithm=THROUGHPUT_BASED \
--max_num_workers=15 \
--machine_type=n1-standard-2 \
--disk_size_gb=30 \

Screenshot of metrics: enter image description here

My Question is if this behaviour is normal or is there anything I can do to reduce the number of required workers for this pipeline. Thanks!

Update: Here is an image of the last step of the dataflow graph with wall times. (taken after the job was running for 1h). All the other steps before have very low wall times of only a few seconds.

enter image description here

Die answered 9/9, 2020 at 7:28 Comment(4)
Have you tried to use dedicated solution for your scenario, which is Dataflow template: cloud.google.com/dataflow/docs/guides/templates/… Let me know if you tried it.Lindell
Thanks for your reply! I've seen the templates but we cannot use them because our incoming PubSub message contains multiple messages which need to be extracted (split by '\n' actually). Also we want to build und understand a custom pipeline to get a better understanding of Beam and Dataflow.Die
This is surprisingly slow. Can you look at your msec counters and see in what stages the pipeline is spending its time?Farly
Thanks for your reply! I'm not sure what you mean with msec counters. I updated the question and added an image of the final step with and the corresponding wall times, hoping this is what you asked for. It seems like the final write is causing the delay. But I'm not sure how to investigate further from here.Die
D
4

After debugging a bit I found that there were some invalid messages which could not be written to BigQuery (and did not log an error). So for anyone who comes across a similar issue:

After changing the insert_retry_strategy of beam.io.WriteToBigQuery to RETRY_NEVER and printing out the deadletter pCollection I fixed the wrong formatted messages and the performance improved. I guess there some invalid messages were stuck due to the default strategy of RETRY_ALWAYS.

Die answered 23/9, 2020 at 7:20 Comment(3)
Have you also tried RETRY_ON_TRANSIENT_ERROR? It ought to ignore failures with invalid messages but still retry when there's a different error.Marymarya
No, not yet but it sounds like this is the retry strategy I actually want to use. Thanks for pointing it out!Die
Happy to help. You might want log the failed rows, here's an intro on how to do it. beam.apache.org/documentation/patterns/bigqueryio/…Marymarya

© 2022 - 2024 — McMap. All rights reserved.