I'm building some data streaming pipelines that read from Kafka and write to various sinks using Google Cloud Dataflow. The pipeline looks something like this (simplified).
// Example pipeline that writes to BigQuery.
Pipeline.create(options)
.apply(KafkaIO.read().withTopic(options.topic))
.apply(/* Convert to a Row type */)
.setRowSchema(schemaRegistry.lookup(options.topic))
.apply(
BigQueryIO.write<Row>()
.useBeamSchema()
.withCreateDisposition(CreateDispotion.CREATE_IF_NEEDED)
.withProject(options.outputProject)
.withDataset(options.outputDataset)
.withTable(options.outputTable)
)
I plan to run a pipeline for each of our Kafka topics, of which there are hundreds. The pipeline looks up the schema for the given topic during the planning stage. This allows BigQueryIO
to create the necessary tables before starting the pipeline.
Question: How can I support evolving schemas in my Dataflow pipelines?
I've explored the option of updating an existing Dataflow job (using the --update
flag). The thought is that I could automate the process of submitting an updated job whenever a schema changes. But updating a job seems to incur about 3 minutes of downtime. For some of the jobs, that much downtime won't work. I'm looking for other solutions that hopefully have no more than a few seconds of downtime.
bqTable.updateSchema(Schema.of(newFieldList));
– Kaminsky