I'm reading an article on exactly-once processing implemented by some Dataflow sources and sinks and I'm having troubles understanding the example on BigQuery sink. From the article
Generating a random UUID is a non-deterministic operation, so we must add a reshuffle before we insert into BigQuery. Once that is done, any retries by Cloud Dataflow will always use the same UUID that was shuffled. Duplicate attempts to insert into BigQuery will always have the same insert id, so BigQuery is able to filter them
// Apply a unique identifier to each record
c
.apply(new DoFn<> {
@ProcessElement
public void processElement(ProcessContext context) {
String uniqueId = UUID.randomUUID().toString();
context.output(KV.of(ThreadLocalRandom.current().nextInt(0, 50),
new RecordWithId(context.element(), uniqueId)));
}
})
// Reshuffle the data so that the applied identifiers are stable and will not change.
.apply(Reshuffle.of<Integer, RecordWithId>of())
// Stream records into BigQuery with unique ids for deduplication.
.apply(ParDo.of(new DoFn<..> {
@ProcessElement
public void processElement(ProcessContext context) {
insertIntoBigQuery(context.element().record(), context.element.id());
}
});
What does reshuffle mean and how can it prevent generation of different UUID for the same insert on subsequent retries ?