Explain Apache Beam python syntax
Asked Answered
C

2

67

I have read through the Beam documentation and also looked through Python documentation but haven't found a good explanation of the syntax being used in most of the example Apache Beam code.

Can anyone explain what the _ , | , and >> are doing in the below code? Also is the text in quotes ie 'ReadTrainingData' meaningful or could it be exchanged with any other label? In other words how is that label being used?

train_data = pipeline | 'ReadTrainingData' >> _ReadData(training_data)
evaluate_data = pipeline | 'ReadEvalData' >> _ReadData(eval_data)

input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema)

_ = (input_metadata
| 'WriteInputMetadata' >> tft_beam_io.WriteMetadata(
       os.path.join(output_dir, path_constants.RAW_METADATA_DIR),
       pipeline=pipeline))

preprocessing_fn = reddit.make_preprocessing_fn(frequency_threshold)
(train_dataset, train_metadata), transform_fn = (
  (train_data, input_metadata)
  | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset(
      preprocessing_fn))
Cliffhanger answered 5/5, 2017 at 3:32 Comment(4)
The idea is to take a .method() syntax and turn it into an infix operator. For example taking something like a.plus(b) and making it possible to write syntax like a + b . Check out the __or__ , __ror__ , and __rrshift__ function definitions in the source github.com/apache/beam/blob/master/sdks/python/apache_beam/…. So MyPTransform | NextPTransform is really taking both of those PTransform objects and passing them as a list _parts to a new _ChainedPTransform object, and if you __or__ yet another PTransform, it flattens the nested lists.Drunk
>> aka __rrshift__ is effectively setting the label attribute of the PTransform but it doesn't just do something like Ptransform.label('new name') or Ptransform.label = 'new name', it seems more convoluted to me. 'ReadEvalData' >> _ReadData(eval_data) is evaluated as returning a new _NamedPTransform object with _ReadData(eval_data) initialized as the self.transform attribute and the string 'ReadEvalData' is initialized as the label attribute by using Super to run the parent class PTransform's init method.Drunk
__ror__ is the same as __or__, they both end up as the pipe infix | but __ror__ lets you define how to pipe other objects to a PTransform where those other objects don't have the pipe operator method defined. It kind of makes the pipe operator of the thing on the right still work from left-to-right.Drunk
Extremely convoluted, creating a new object with the current object as an attribute, just to get a functional / shell style flow going. It does make it easier to read the processing pipeline overall from left to right though, and looks similar to functional programming syntax like F# or using MagrittR in Rlang. It is probably better than a lot of nested function calls like PTransforrm.apply(NextPTransform.apply(YetAnotherPTransform)) but then you can always create new variables for each step. After all, it is lazily evaluated and it is not doing deep copies of data so there's no penalty.Drunk
S
88

Operators in Python can be overloaded. In Beam, | is a synonym for apply, which applies a PTransform to a PCollection to produce a new PCollection. >> allows you to name a step for easier display in various UIs -- the string between the | and the >> is only used for these display purposes and identifying that particular application.

See https://beam.apache.org/documentation/programming-guide/#transforms

Stlouis answered 5/5, 2017 at 19:43 Comment(4)
Thank you! This is very helpful. If I'm understanding correctly then are they using the _ because there is no output PCollection for a writeMetadata transform?Cliffhanger
Notionally, it should return a PDone [1], which they don't need (so they use the throwaway _). [1] github.com/apache/beam/blob/master/sdks/python/apache_beam/…Stlouis
@JoelCroteau: I guess they thought using | symbols - which are often pipe symbols (in shell for example) - for this would be so clever since you're building a pipeline ...Leopard
Also, since it wasn't explicitly said here, the _ is "typical" Python syntax for a variable that will be discarded. Eg, in _ = (input_metadata..., it is saying that the result of the operations will be returned to _, which is a variable that the code author is making clear she will not use again. In this case, you can see that is because it is a write operation, and the author didn't care about what that operation returns.Ladybird
L
2

No one mentioned the _, so just for completeness:

  • There is nothing officially special about the _, but it is taken as good practice to assign a variable that is returned but which you do not care about to _. This makes it obvious to readers of your code that you plan to throw it away.
    • It also reduces memory, since you're throwing away the other instances of assigning to _ when you re-assign it (overwrite it).
  • There is an unofficial role the _ has: because it is the "throwaway" variable, most linters and other code clarity helpers treat it differently.
    • For instance, if you assign a variable use_me and never actually use it, a linter will warn that you have an unused variable. And if you have rigorous code quality restrictions, maybe you cannot even merge your code into production with an unused variable.
    • _ is not caught by the linter (and could be merged into a strict code base) because it is understood to be a throwaway variable, and therefore there is no mistake in your code (at least not in this regard).
Ladybird answered 27/10, 2023 at 12:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.