I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code.
Can anyone explain what the _
, |
, and >>
are doing in the below code? Also is the text in quotes ie 'ReadTrainingData' meaningful or could it be exchanged with any other label? In other words how is that label being used?
train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data)
evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data)
input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema)
_ = (input_metadata
| 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(
os.path.join(output_dir, path_constants.RAW_METADATA_DIR),
pipeline=pipeline))
preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold)
(train_dataset, train_metadata), transform_fn = (
(train_data, input_metadata)
| 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset(
preprocessing_fn))
.method()
syntax and turn it into aninfix
operator. For example taking something likea.plus(b)
and making it possible to write syntax likea + b
. Check out the__or__
,__ror__
, and__rrshift__
function definitions in the source github.com/apache/beam/blob/master/sdks/python/apache_beam/…. SoMyPTransform | NextPTransform
is really taking both of those PTransform objects and passing them as a list_parts
to a new_ChainedPTransform
object, and if you__or__
yet another PTransform, it flattens the nested lists. – Drunk>>
aka__rrshift__
is effectively setting thelabel
attribute of the PTransform but it doesn't just do something like Ptransform.label('new name') or Ptransform.label = 'new name', it seems more convoluted to me.'ReadEvalData' >> _ReadData(eval_data)
is evaluated as returning a new_NamedPTransform
object with_ReadData(eval_data)
initialized as theself.transform
attribute and the string'ReadEvalData'
is initialized as the label attribute by usingSuper
to run the parent classPTransform
's init method. – Drunk__ror__
is the same as__or__
, they both end up as the pipe infix|
but__ror__
lets you define how to pipe other objects to aPTransform
where those other objects don't have the pipe operator method defined. It kind of makes the pipe operator of the thing on the right still work from left-to-right. – DrunkPTransforrm.apply(NextPTransform.apply(YetAnotherPTransform))
but then you can always create new variables for each step. After all, it is lazily evaluated and it is not doing deep copies of data so there's no penalty. – Drunk