Including another file in Dataflow Python flex template, ImportError
Asked Answered
M

6

13

Is there an example of a Python Dataflow Flex Template with more than one file where the script is importing other files included in the same folder?

My project structure is like this:

├── pipeline
│   ├── __init__.py
│   ├── main.py
│   ├── setup.py
│   ├── custom.py

I'm trying to import custom.py inside of main.py for a dataflow flex template.

I receive the following error in the pipeline execution:

ModuleNotFoundError: No module named 'custom'

The pipeline works fine if I include all of the code in a single file and don't make any imports.

Example Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base

ARG WORKDIR=/dataflow/template/pipeline
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY pipeline /dataflow/template/pipeline

COPY spec/python_command_spec.json /dataflow/template/

ENV DATAFLOW_PYTHON_COMMAND_SPEC /dataflow/template/python_command_spec.json

RUN pip install avro-python3 pyarrow==0.11.1 apache-beam[gcp]==2.24.0

ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"

Python spec file:

{
    "pyFile":"/dataflow/template/pipeline/main.py"
}
  

I am deploying the template with the following command:

gcloud builds submit --project=${PROJECT} --tag ${TARGET_GCR_IMAGE} .
Myriammyriameter answered 18/11, 2020 at 14:52 Comment(3)
Have you tried appending the ${WORKDIR} to the PYTHONPATH environment variable? You can try adding ENV PYTHONPATH="${WORKDIR}:${PYTHONPATH}" to your dockerfile.Herculaneum
Yes. I tried appending to the PYTHONPATH. didn't seem to workMyriammyriameter
@AkshayApte do you have setup.py as the same level at custom.py? For me find_packages cannot find custom.py and it seems setup.py has to be one directory above - #28573540 curious how you made it work.Weatherboarding
M
5

I actually solved this by passing an additional parameter setup_file to the template execution. Also need to add setup_file parameter to the template metadata

--parameters setup_file="/dataflow/template/pipeline/setup.py"

Apparently the command ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py" in the Dockerfile is useless and doesnt actually pick up the setup file.

My setup file looked like this:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[
        'apache-beam[gcp]==2.24.0'
    ],
 )
Myriammyriameter answered 19/11, 2020 at 14:58 Comment(10)
wow thanks for posting this. For other people that might see here, I also want to mention that py_module in setup_files didn't work either. I'll try find_packages() nowWeatherboarding
find_packages() it somehow messed up my proto so I'm still trying to figure out how to get py_module work. hmm..Weatherboarding
I tried this and get Unrecognized parameter when sending in setup_file as a parameter in that wayWayne
You also need to add setup_file parameter to the template metadataMyriammyriameter
I've found useful documentation on this at beam.apache.org/documentation/sdks/python-pipeline-dependencies/…Shevat
I have successfully used FLEX_TEMPLATE_PYTHON_SETUP_FILE to declare the location of setup.py (which looks exactly the same as @akshay-apte's snippet above), no setup_file parameter required. I'm writing this more than two months after Akshay's answer so perhaps something has changed in the Dataflow service in the interim period which means FLEX_TEMPLATE_PYTHON_SETUP_FILE now works. HTH.Shevat
@jamiet, can you share the code you're using. I'm trying to do the same using FLEX_TEMPLATE_PYTHON_SETUP_FILE in the dockerfile, in the dataflow logs it does show Executing: python /dataflow/template/streaming_beam.py --setup_file=/dataflow/template/setup.py ... but immediately it throws traceback module not found. It is not actually performing setup actions mentioned in setup.pyEvidentiary
@PavanKumarKattamuri Sure, have posted as an answerShevat
Hi jamie T,could you share more details? i am having same issues and have posted in stackoverflow here #67858111Haematoxylon
Yeah, I tried this too, updated my metadata file, see --setup_file=/template/setup.py passed twice, but I'm still getting the error that "No module named utils." (I verified that the setup file works correctly locally, using Dataflow runner, and using Classic Dataflow templates. It only fails with Flex templates.) So, this hack may've worked at one time, but it doesn't now.Ping
B
3

After some tests I found out that for some unknown reasons phyton files at working directory (WORKDIR) cannot be referenced with an import. But it works if you create a subfolder and move the python dependencies into it. I tested and it worked, for example, in your use case you can have the following structure:

├── pipeline
│   ├── main.py
│   ├── setup.py
│   ├── mypackage
│   │   ├── __init__.py
│   │   ├── custom.py

And you will be able to reference: import mypackage.custom. The Docker file should move in the custom.py to proper directory.

RUN mkdir -p ${WORKDIR}/mypackage
RUN touch ${WORKDIR}/mypackage/__init__.py
COPY custom.py ${WORKDIR}/mypackage

And the dependecy will be added to the python installation directory:

$ docker exec -it <container> /bin/bash
# find / -name custom.py
/usr/local/lib/python3.7/site-packages/mypackage/custom.py
Butch answered 20/11, 2020 at 4:0 Comment(3)
Did you achieve a successfully running Dataflow job using this technique? I've tried reproducing it and am still getting error No module named 'protoc_gen (protoc_gen is the package I'm adding my module to)Shevat
What is in your setup.py file?Shevat
To be able to also import from a file instead of a package, you can use find_namespace_packages () as stated hereScotopia
S
1

Here is my solution:

Dockerfile:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base:flex_templates_base_image_release_20210120_RC00

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY requirements.txt .


# Read https://mcmap.net/q/907718/-can-i-make-flex-template-jobs-take-less-than-10-minutes-before-they-start-to-process-data#comment116304237_65766066
# to understand why apache-beam is not being installed from requirements.txt
RUN pip install --no-cache-dir -U apache-beam==2.26.0
RUN pip install --no-cache-dir -U -r ./requirements.txt

COPY mymodule.py setup.py ./
COPY protoc_gen protoc_gen/

ENV FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE="${WORKDIR}/requirements.txt"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/mymodule.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

and here is my setup.py:

import setuptools

setuptools.setup(
    packages=setuptools.find_packages(),
    install_requires=[],
    name="my df job modules",
)
Shevat answered 28/2, 2021 at 9:42 Comment(0)
R
0

For me I didn't need to integrate the setup_file in the command to trigger the flex template, here is my Dockerfile:

FROM gcr.io/dataflow-templates-base/python38-template-launcher-base

ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}

COPY . .

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

# Install apache-beam and other dependencies to launch the pipeline
RUN pip install apache-beam[gcp]
RUN pip install -U -r ./requirements.txt

This is the command:

gcloud dataflow flex-template run "job_ft" --template-file-gcs-location "$TEMPLATE_PATH" --parameters paramA="valA" --region "europe-west1"
Regale answered 12/4, 2022 at 8:54 Comment(0)
P
0

I have a bunch of pipelines all in the same repo, where all of the pipelines need to use my packages in the utils directory.

The solution for me was to add a symlink in each pipeline directory to the utils directory. That was not required for me to run locally, run using Dataflow Runner, or create and run a Classic Template. But, it was necessary to run a Flex template.

pipelines/
├── pipeline_1
│   ├── pipeline_1_metadata
│   ├── pipeline_1.py
│   ├── bin
│   │   ├── build_flex_template_and_image.sh
│   │   ├── run_flex_template.sh
│   │   ├── ...
│   ├── README.md
│   └── utils -> ../utils # Added this, and it worked
├── pipeline_2
│   ├── pipeline_2_metadata
│   ├── pipeline_2.py
│   ├── bin
│   │   ├── build_flex_template_and_image.sh
│   │   ├── run_flex_template.sh
│   │   ├── ...
│   ├── README.md
│   └── utils -> ../utils # Added this, and it worked
├── # etc.
├── requirements.txt
├── setup.py
|── utils
    ├── bigquery_utils.py
    ├── dprint.py
    ├── gcs_file_utils.py
    └── misc.py

My setup.py:

import setuptools

setuptools.setup(
    name="repo_name_here",
    version="0.2",
    install_requires=[], # Maybe upgrade Beam here?
    packages=setuptools.find_namespace_packages(exclude=["*venv*"]),
)

From the base directory, I build like so, using the Google-provided Docker image:

gcloud dataflow flex-template build "${TEMPLATE_FILE}" \
   --image-gcr-path "${REGION}-docker.pkg.dev/${PROJECT}/${ARTIFACT_REPO}/dataflow/pipeline-1:latest" \
   --sdk-language "PYTHON" \
   --flex-template-base-image "PYTHON3" \
   --py-path "." \
   --metadata-file "pipeline_1/pipeline_1_metadata" \
   --env "FLEX_TEMPLATE_PYTHON_PY_FILE=pipeline_1/pipeline_1.py" \
   --env "FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE=flex_requirements.txt" \
   --env "FLEX_TEMPLATE_PYTHON_SETUP_FILE=setup.py"

That totally works, but now the only hassle is that I can't use the default requirements.txt with the default Python image, since I can't figure out how to install the correct version of Python 3.9 in my venv and update requirements.txt accordingly, so I generate flex_requirements.txt via cut -d "=" -f 1 requirements.txt > flex_requirements.txt and let the base image figure out the dependencies. Which is insanity. But that'll be another Stack Overflow issue if I can't figure it out in another couple days.

Ping answered 28/12, 2023 at 2:8 Comment(1)
You can install the package in the Flex Template, then reference the utils via: from repo_name_here.utils import misc ; misc.some_util_fn(). Added an answer with this setup.Hirsh
H
0

The crux of the problem is that the package is not installed in launch environment, hence some modules might not be importable depending on the current directory and/or value of $PYTHONPATH. Structure the pipeline as a package, and install it. For example, consider the following structure:

/template      # Location of the template files in target image. 
  ├── some_package
  │   ├── launcher.py        # Parses command line args, calls Pipeline.run().
  │   ├── some_pipeline.py   # Pipeline(s) could be defined in separate file(s).
  │   ├── some_transforms.py # Building blocks to reference in other modules.  
  │   └── utils -> # You can have subpackages too.
  │        └── some_helper_functions.py
  ├── main.py   # Entrypoint. Calls `launcher.some_run_method()`.
  └── setup.py  # Defines the package and its requirements

Flex template Dockerfile might look like the following:

ARG WORKDIR=/template
WORKDIR ${WORKDIR}                
COPY setup.py .
COPY main.py .
COPY some_package some_package

# This is the key line to solve the problem discussed in this question. 
# Installing the package allows importing its modules regardless of current path.
RUN pip install -e .  

ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"

...

Hirsh answered 27/2, 2024 at 23:9 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.