How to get apache beam for dataflow GCP on Python 3.x
Asked Answered
L

3

7

I'm very newby with GCP and dataflow. However , I would like to start to test and deploy few flows harnessing dataflow on GCP. According to the documentation and everything around dataflow is imperative use the Apache project BEAM. Therefore and following the official documentation here the supported version of python is 2.7

Honestly this is fairly disappointed due to the fact that Python version 2.x will vanish due not official support and everybody is working with version 3.x. Nevertheless, I want to know if someone knows how to get ready beam and GCP dataflow running in python version.

I saw this video and some how this parson complete this wonderful milestone and apparently it runs on Python 3.5.

Update:

Guys I want just raise a thought that has crossed my minds since I’m struggling with dataflow. I really feel highly disappointed in the sense how challenging is start hands on with this tool either version Java or Python. From python there are constrains about the version 3 which is pretty much the current standard. In the other hand, java has issues running on version 11 and I have to tweak a bit to run over version 8 my code and then I start to struggle with many incompatibilities on the code. Briefly , if really GCP wants to move forward and become the #1 there is so much much to improve. :disappointed:

Workaround:

I downgraded my java version to jdk 8 , install maven and now my eclipse version is working for Apache Beam.

I finally solved but, GCP really please consider enhance and span the support for the most recent versions of Java/Python.

Thanks so much

Lobel answered 24/1, 2019 at 4:15 Comment(0)
V
6

See @VibhorJain 's answer, it is working now.


Currently there is NO way to use Python3 for apache-beam (you may write an adapter for it, but for sure meaningless).

The support of Python3.X is on going, please take a look on this apache-beam issue.

P.S. In the video, Python 3.5.2 is ONLY for the editor version, it is not the python running the apache-beam. Please be noticed, in the bash, Python 2.7 is running.

Valenti answered 24/1, 2019 at 4:33 Comment(2)
The only thing I can say about dataflow is a huge sadness and disappointment. I really wanted to deploy few pipelines but I have to face many compatibility constrains before get hands on this is so sadness :(Lobel
@AndresUrregoAngel, you may consider using Cloud Dataproc instead, though I didn't try. Dataproc can establish Spark jobs, and spark can run in Python3.4+, providing you more freedom (and a bit more tedious job like balancing).Valenti
W
13

You can now run Apache Beam on Python 3.5 (I tried both on Direct as well as DataFlow runner).apache-beam==2.11.0

when running it comes with warning:

UserWarning: Running the Apache Beam SDK on Python 3 is not yet fully supported. You may encounter buggy behavior or missing features.

I already noticed, beam.io.gcp.pubsub.ReadFromPubSub() is broken. Pushing messages to PubSub but the pipeline never reads the messages (trying on Direct Runner).

Hope with time things will improve.

Woodhouse answered 7/3, 2019 at 19:58 Comment(4)
It seems it's actively being work on (finally!), there are quite a lot active changes nowDeviled
Thanks @Deviled for the link. Useful to followWoodhouse
Welcome. I tried a basic example with DirectRunner on Python 3.7 and it worked fine.Deviled
oh great! that means updated client libraries now support Py 3.5+Woodhouse
V
6

See @VibhorJain 's answer, it is working now.


Currently there is NO way to use Python3 for apache-beam (you may write an adapter for it, but for sure meaningless).

The support of Python3.X is on going, please take a look on this apache-beam issue.

P.S. In the video, Python 3.5.2 is ONLY for the editor version, it is not the python running the apache-beam. Please be noticed, in the bash, Python 2.7 is running.

Valenti answered 24/1, 2019 at 4:33 Comment(2)
The only thing I can say about dataflow is a huge sadness and disappointment. I really wanted to deploy few pipelines but I have to face many compatibility constrains before get hands on this is so sadness :(Lobel
@AndresUrregoAngel, you may consider using Cloud Dataproc instead, though I didn't try. Dataproc can establish Spark jobs, and spark can run in Python3.4+, providing you more freedom (and a bit more tedious job like balancing).Valenti
D
1

There has been a lot of new happening on Python 3 support. DataFlow now supports that as beta! data flow mentioning 3.7

Deviled answered 9/10, 2019 at 12:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.