Apache Beam: DoFn vs PTransform

Asked 8/12, 2017 at 1:57 Answered 30/6, 2023 at 15:27

Solved google-cloud-dataflow apache-beam

Both DoFn and PTransform is a means to define operation for PCollection. How do we know which to use when?

Unmitigated answered 8/12, 2017 at 1:57 Comment(0)

A simple way to understand it is by analogy with map(f) for lists:

The higher-order function map applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern.
The function f is the logic applied to each element.

Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn), which is a PTransform.

A PTransform is an operation that takes PCollections as input and yields PCollections as output. Beam has just five primitive types of PTransform, encapsulating embarrassingly parallel computational patterns.
ParDo is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.
The DoFn, here I called it fn, is the logic that is applied to each element.

It may also help to think of the fact that you write a DoFn to say what to do on each element, and the Beam runner provides the ParDo to apply your logic.

Glutamate answered 8/12, 2017 at 3:48 Comment(0)

DoFn according to docs:

The DoFn object that you pass to ParDo contains the processing logic that gets applied to the 
elements in the input collection

ParDo (originated from the term ParallelDo) according to docs:

The ParDo processing paradigm is similar to the “Map” phase of a 
Map/Shuffle/Reduce-style algorithm: a ParDo transform considers each element in the 
input PCollection, performs some processing function (your user code) on that element, 
and emits zero, one, or multiple elements to an output PCollection

PTransform according to docs:

A PTransform represents a data processing operation, or a step, in your pipeline. 
Every PTransform takes one or more PCollection objects as input, 
performs a processing function that you provide on the elements of that PCollection, 
and produces zero or more output PCollection objects.

Conceptually one PTransform can have multiple ParDo operations internally. And each ParDo will only have one DoFn, but that one DoFn gets executed multiple times based on the input.

DoFn is the most basic place where you write actually logic of transforming your inputs.

ParDo is a computational pattern that taken this DoFn and does it multiple times per each element in PCollection in a parallel, scalable fashion.

PTransfrom is the logical operation name that accepts one or many PCollection and encapsulates multiple operations including ParDo(DoFn) to transform these input PCollection into output PCollections.

Teeters answered 30/6, 2023 at 15:27 Comment(0)

Recommended topics

Hot tags