The lifecycle of a DoFn
is as follows:
Setup
- Repeatedly process bundles:
StartBundle
- Repeated
ProcessElement
FinishBundle
Teardown
I.e. one instance of a DoFn can process many (zero or more) bundles, and within one bundle, it processes many (zero or more) elements.
Both Setup
/Teardown
and StartBundle
/FinishBundle
are optional - it is possible to implement any DoFn
without using them, and with doing the work only in ProcessElement
, however it will be inefficient. Both methods allow optimizations:
- Often one wants to batch work between elements, e.g. instead of doing an RPC per element, do an RPC for batches of N elements.
StartBundle
/FinishBundle
tell you what are the allowed boundaries of batching: basically, you are not allowed to batch across FinishBundle
- FinishBundle
must force a flush of your batch (and StartBundle
must initialize / reset the batch). This is the only common use of these methods that I'm aware of, but if you're interested in a more general or rigorous explanation - a bundle is a unit of fault tolerance, and the runner assumes that by the time FinishBundle
returns, you have completely performed all the work (outputting elements or performing side effects) associated with all elements seen in this bundle; work must not "leak" between bundles.
- Often one wants to manage long-lived resources, e.g. network connections. You could do this in
StartBundle
/FinishBundle
, but, unlike pending side effects or output, it is fine for such resources to persist between bundles. That's what Setup
and Teardown
are for.
- Also often one wants to perform costly initialization of a
DoFn
, e.g. parsing a config file etc. This is also best done in Setup
.
More concisely:
- Manage resources and costly initialization in
Setup
/Teardown
.
- Manage batching of work in
StartBundle
/FinishBundle
.
(Managing resources in bundle methods is inefficient; managing batching in setup/teardown is plain incorrect and will lead to data loss)
The DoFn documentation was recently updated to make this more clear.