Both DoFn
and PTransform
is a means to define operation for PCollection
. How do we know which to use when?
A simple way to understand it is by analogy with map(f)
for lists:
- The higher-order function
map
applies a function to each element of a list, returning a new list of the results. You might call it a computational pattern. - The function
f
is the logic applied to each element.
Now, switching to talk about Beam specifics, I think you are asking about ParDo.of(fn)
, which is a PTransform
.
- A
PTransform
is an operation that takesPCollections
as input and yieldsPCollections
as output. Beam has just five primitive types ofPTransform
, encapsulating embarrassingly parallel computational patterns. ParDo
is the computational pattern of per-element computation. It has some variations, but you don't need to worry about that for this question.- The
DoFn
, here I called itfn
, is the logic that is applied to each element.
It may also help to think of the fact that you write a DoFn
to say what to do on each element, and the Beam runner provides the ParDo
to apply your logic.
DoFn according to docs:
The DoFn object that you pass to ParDo contains the processing logic that gets applied to the
elements in the input collection
ParDo (originated from the term ParallelDo) according to docs:
The ParDo processing paradigm is similar to the “Map” phase of a
Map/Shuffle/Reduce-style algorithm: a ParDo transform considers each element in the
input PCollection, performs some processing function (your user code) on that element,
and emits zero, one, or multiple elements to an output PCollection
PTransform according to docs:
A PTransform represents a data processing operation, or a step, in your pipeline.
Every PTransform takes one or more PCollection objects as input,
performs a processing function that you provide on the elements of that PCollection,
and produces zero or more output PCollection objects.
Conceptually one PTransform can have multiple ParDo operations internally. And each ParDo will only have one DoFn, but that one DoFn gets executed multiple times based on the input.
DoFn is the most basic place where you write actually logic of transforming your inputs.
ParDo is a computational pattern that taken this DoFn and does it multiple times per each element in PCollection in a parallel, scalable fashion.
PTransfrom is the logical operation name that accepts one or many PCollection and encapsulates multiple operations including ParDo(DoFn) to transform these input PCollection into output PCollections.
© 2022 - 2025 — McMap. All rights reserved.