What is the difference between DoFn.Setup and DoFn.StartBundle?
Asked Answered
M

1

27

What is the difference between these two annotations?

DoFn.Setup Annotation for the method to use to prepare an instance for processing bundles of elements.

Uses the word "bundle", takes zero arguments.

DoFn.StartBundle Annotation for the method to use to prepare an instance for processing a batch of elements.

Uses the word "batch", takes zero or one arguments (StartBundleContext, a way to access PipelineOptions).

What I'm trying to do

I need to initialize a library within the DoFn instance, then use that library for every element in the "batch" or "bundle". I wouldn't normally split hairs with these two words, but in a pipeline, there might be some difference?

Mills answered 31/8, 2017 at 16:2 Comment(0)
S
62

The lifecycle of a DoFn is as follows:

  • Setup
  • Repeatedly process bundles:
    • StartBundle
    • Repeated ProcessElement
    • FinishBundle
  • Teardown

I.e. one instance of a DoFn can process many (zero or more) bundles, and within one bundle, it processes many (zero or more) elements.

Both Setup/Teardown and StartBundle/FinishBundle are optional - it is possible to implement any DoFn without using them, and with doing the work only in ProcessElement, however it will be inefficient. Both methods allow optimizations:

  • Often one wants to batch work between elements, e.g. instead of doing an RPC per element, do an RPC for batches of N elements. StartBundle/FinishBundle tell you what are the allowed boundaries of batching: basically, you are not allowed to batch across FinishBundle - FinishBundle must force a flush of your batch (and StartBundle must initialize / reset the batch). This is the only common use of these methods that I'm aware of, but if you're interested in a more general or rigorous explanation - a bundle is a unit of fault tolerance, and the runner assumes that by the time FinishBundle returns, you have completely performed all the work (outputting elements or performing side effects) associated with all elements seen in this bundle; work must not "leak" between bundles.
  • Often one wants to manage long-lived resources, e.g. network connections. You could do this in StartBundle/FinishBundle, but, unlike pending side effects or output, it is fine for such resources to persist between bundles. That's what Setup and Teardown are for.
  • Also often one wants to perform costly initialization of a DoFn, e.g. parsing a config file etc. This is also best done in Setup.

More concisely:

  • Manage resources and costly initialization in Setup/Teardown.
  • Manage batching of work in StartBundle/FinishBundle.

(Managing resources in bundle methods is inefficient; managing batching in setup/teardown is plain incorrect and will lead to data loss)

The DoFn documentation was recently updated to make this more clear.

Soukup answered 27/4, 2018 at 18:33 Comment(10)
Thanks for the detailed answer! Is doing something in the Setup method different from doing it in the constructor or is just a matter of convention?Contingence
It is very different: if you do something in the constructor, it'll run once in the program that constructs the pipeline, before the DoFn is serialized and shipped to workers (which means e.g. you can't open connections in the constructor - a connection can not be serialized and shipped to a worker). If you do it in Setup, it will run once per DoFn instance on a worker - the constructor is not invoked on workers.Soukup
Basically, use the constructor only for trivially passing the configuration to the DoFn - e.g. assigning constructor arguments to member variables. Use Setup for anything else.Soukup
@Soukup can you please share some examples how to use Setup/Teardown in Dataflow?Animalism
There's lots of examples of usage of either method in the Beam SDK, especially in IOs - just search for Setup, Teardown, StartBundle, FinishBundle within this directory github.com/apache/beam/tree/master/sdks/java/ioSoukup
so setup gets called after constructor? can we use variable set in constructor further in setup?Reynaud
Constructor is only called when you call it yourself in the main program that constructs the pipeline. From there on, the DoFn is serialized and then deserialized on each worker; deserialization doesn't invoke the constructor, it uses low level JVM machinery to create an empty instance of the class and restore the field values. But Beam will invoke Setup on each deserialized instance before using it.Soukup
@Soukup Summarizing: - DoFn constructor is called by user and pipeline is created. - Beam serialize the DoFn and send it to workers. - When beam de-serialize the DoFn, it calls setup. - For each bundle it calls startBundle and finishBundle(all element of the bundle is over). - For each element in a bundle processElement is called. - tearDown is called --> no more bundle to process.Threequarter
@Soukup does the process of multiple bundles within the same DoFn instance might be done in parallel? Or is it always sequential?Forgiving
@Forgiving One DoFn instance processes bundles strictly sequentially - it won't be processing multiple bundles in parallel, and it won't be processing multiple elements within one bundle in parallel. See beam.apache.org/documentation/programming-guide/…Soukup

© 2022 - 2024 — McMap. All rights reserved.