Correct way to define an apache beam pipepline

U

5

6

I am new to Beam and struggling to find many good guides and resources to learn best practices.

One thing I have noticed is there are two ways pipelines are defined:

with beam.Pipeline() as p:
# pipeline code in here

Or

p = beam.Pipeline()
# pipeline code in here
result = p.run()
result.wait_until_finish()

Are there specific situations in which each method is preferred?

Unarm answered 6/7, 2019 at 12:49 Comment(1)

Usually, this is just matter of code style preference. We have recently published additional learning materials on beam.apache.org/documentation/resources/learning-resources . Hopefully you can see more examples there and decide which style you prefer. Also, here is a possibly dated explanation of the with statement in Python effbot.org/zone/python-with-statement.htm – Rugged 6/7, 2019 at 21:53

G

1

As pointed out by Yichi Zhang, Pipeline.__exit__ set .result, so you can do:

with beam.Pipeline() as p:
  ...

result = p.result

The contextmanager version is cleaner as it can correctly cleanup when error are raised inside the contextmanager.

Grunter answered 6/5, 2022 at 14:18 Comment(0)

T

3

From code snippets, I see the main difference is if you care about pipeline result or not. If you want to use PipelineResult to monitor pipeline status or or cancel your pipeline by your code, you can go to the second style.

Twophase answered 8/7, 2019 at 17:22 Comment(0)

S

1

I think functional wise they are equivalent since the __exit__ function for pipeline context manager is executing the same code. https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L426

Sadowski answered 8/7, 2019 at 20:28 Comment(0)

G

1

As pointed out by Yichi Zhang, Pipeline.__exit__ set .result, so you can do:

with beam.Pipeline() as p:
  ...

result = p.result

The contextmanager version is cleaner as it can correctly cleanup when error are raised inside the contextmanager.

Grunter answered 6/5, 2022 at 14:18 Comment(0)

T

1

Are there specific situations in which each method is preferred?

To answer this question, you can take a look at the Pipeline context manager implementation source code here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L587

As you can see, the context manager run the pipeline and block it until it finish

self.result = self.run()
self.result.wait_until_finish()

Exactly as you can do explicitly with the second approach, and a lot more.

So the general rule here is: if you need the control to decide if you want your pipeline to block or not, go with the second approach, otherwise use the context manager.

Trill answered 12/7, 2022 at 9:18 Comment(1)

This is the "correct" answer, as it captures something woefully under-documented and very important: if you are (1) deploying as a CI/CD pipeline, and (2) deploying a streaming, always-on Dataflow pipeline, using the typically documented with Pipeline() as p context manager will make your deployment hang because by default it waits until completion, and there is no completion!! If you have a proper CI/CD that does not deploy anything until successfully merged into your production branch, you'll never be able to merge anything!! – Polymer 5/12, 2023 at 15:22

T

1

If you are deploying streaming pipelines I suggest to use the second option and don't call the wait_until_finish function so your pipeline gets deployed but your code won't wait until the end of time.

Tungus answered 12/10, 2022 at 9:36 Comment(0)

Recommended topics

Hot tags