Dataflow/apache beam: manage custom module dependencies
Asked Answered
T

1

8

I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this:

├── mymain.py
└── myothermodule.py

I import myothermodule.py in mymain.py like this:

import myothermodule

When I run locally on DirectRuner, I have no problem. But when I run it on dataflow with DataflowRunner, I have an error that tells:

ImportError: No module named myothermodule

So I want to know what should I do if I whant this module to be found when running the job on dataflow?

Tavie answered 9/8, 2018 at 9:27 Comment(0)
S
10

When you run your pipeline remotely, you need to make any dependencies available on the remote workers too. To do it you should put your module file in a Python package by putting it in a directory with a __init__.py file and creating a setup.py. It would look like this:

├── mymain.py
├── setup.py
└── othermodules
    ├── __init__.py
    └── myothermodule.py

And import it like this:

from othermodules import myothermodule

Then you can run you pipeline with the command line option --setup_file ./setup.py

A minimal setup.py file would look like this:

import setuptools

setuptools.setup(packages=setuptools.find_packages())

The whole setup is documented here.

And a whole example using this can be found here.

Selfinduction answered 10/8, 2018 at 13:57 Comment(2)
Thanks, I tried what you said and I got this error: ImportError: No module named othermodulesTavie
Oh, I did a mistake, I named the __init__.py by init.py that's whay I got the error, It solves my problem so thank you so muchTavie

© 2022 - 2024 — McMap. All rights reserved.