Dockerized Apache Beam returns "No id provided"
Asked Answered
C

2

10

I've hit a problem with dockerized Apache Beam. When trying to run the container I am getting "No id provided." message and nothing more. Here's the code and files:

Dockerfile

FROM apache/beam_python3.8_sdk:latest
RUN apt update
RUN apt install -y wget curl unzip git
COPY ./ /root/data_analysis/
WORKDIR /root/data_analysis
RUN python3 -m pip install -r data_analysis/beam/requirements.txt
ENV PYTHONPATH=/root/data_analysis
ENV WORKER_ID=1
CMD python3 data_analysis/analysis.py

Code analysis.py :

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

def run():
    options = PipelineOptions(["--runner=DirectRunner"])

    with beam.Pipeline(options=options) as p:
        p | beam.Create([1, 2, 3]) | beam.Map(lambda x: x-1) | beam.Map(print)

if __name__ == "__main__":
    run()

Commands:

% docker build -f Dockerfile_beam -t beam .
[+] Building 242.2s (12/12) FINISHED                                                                                                                                                                                          
...

% docker run --name=beam beam   
2021/09/15 13:44:07 No id provided.

I found that this error message is most likely generated by this line: https://github.com/apache/beam/blob/410ad7699621e28433d81809f6b9c42fe7bd6a60/sdks/python/container/boot.go#L98

But what does it mean? Which id is this? What am I missing?

Calvillo answered 15/9, 2021 at 15:14 Comment(3)
So I worked out that in order to run the container apache/beam_python3.8_sdk:latest , the entrypoint should be replaced with ENTRYPOINT ["/opt/apache/beam/boot", "--worker_pool"]. But that runs apache beam, and I need to run Python. This suggests that I am barking at the wrong tree and that this is not the image I should be using. So the question remains what is this docker image?Calvillo
What are you intending to use this docker container for? The image you're building it from (apache/beam_python3.8_sdk) is an SDK harness image. So for portable pipelines the runner will start one or more SDK harness containers to send work to. It's not meant to be used independently, but custom containers can be provided to runners that support them.Inserted
Hi @DanielOliveira , thanks for your reply. What I would like to be able to do is to run Beam with ReadFromKafka consumer in Docker (with a later plan to deploy on k8s). Initially, I used the python:3.8.12-slim docker image. The ReadFromKafka consumer however requires adding not only JDK but also is trying to run a docker within a docker container (which is a no-go for me). The code: with beam.Pipeline(options=options) as pipeline: \n raw_items = ( pipeline | "Read Kafka Messages" >> ReadFromKafka(..) )Calvillo
I
11

This error is most likely happening due to your Docker image being based on the SDK harness image (apache/beam_python3.8_sdk). SDK harness images are used in portable pipelines; When a portable runner needs to execute stages of a pipeline that must be executed in their original language, it starts a container with the SDK harness and delegates execution of that stage of the pipeline to the SDK harness. Therefore, when the SDK harness boots up it is expecting to have various configuration details provided by the runner that started it, one of which is the ID. When you start this container directly, those configuration details are not provided and it crashes.

For context into your specific use-case, let me first diagram out the different processes involved in running a portable pipeline.

Pipeline Construction <---> Job Service <---> SDK Harness
                                         \--> Cross-Language SDK Harness
  • Pipeline Construction - The process where you define and run your pipeline. It sends your pipeline definition to a Job Service and receives a pipeline result. It does not execute any of the pipeline.
  • Job Service - A process for your runner of choice. This is potentially in a different language than your original pipeline construction, and can therefore not run user code, such as custom DoFns.
  • SDK Harness - A process that executes user code, initiated and managed by the Job Service. By default, this is in a docker container.
  • Cross-Language SDK Harness A process executing code from a different language than your pipeline construction. In your case, Python's Kafka IO uses cross-language, and is actually executing in a Java SDK harness.

Currently, the docker container you created is based on an SDK harness container, which does not sound like what you want. You seem to have been trying to containerize your pipeline construction code and accidentally containerized the SDK harness instead. But since you described that you want the ReadFromKafka consumer to be containerized, it sounds like what you need is for the Job Server to be containerized, in addition to any SDK harnesses it uses.

Containerizing the Job Server is possible, and may already be done. For example, here's a containerized Flink Job Server. Containerized job servers may give you a bit of trouble with artifacts, as the container won't have access to artifact staging directories on your local machine, but there may be ways around that.

Additionally, you mentioned that you want to avoid having SDK harnesses start in a nested docker container. If you start up a worker pool docker container for the SDK harness and set it as an external environment, the runner, assuming it supports external environments, will attempt to connect to the URL you supply instead of creating a new docker container. You will need to configure this for the Java cross-language environment as well, if that is possible in the Python SDK. This configuration should be done via python's pipeline options. --environment_type and --environment_options are good starting points.

Inserted answered 24/9, 2021 at 1:29 Comment(3)
I am going to accept this as the answer. Although I still didn't manage to run Beam with an external env (Spark or Flink), this answer pointed me to the right path. Thanks!Calvillo
How did you solve this @JakubCzaplicki?Westnorthwest
@Westnorthwest We chose a different tech stack. Life's to short to use overcomplicated frameworks.Calvillo
P
5

I encountered the same problem and what you could do is override the ENTRYPOINT like so:

docker run -it --rm --entrypoint /bin/sh <your-container-registry>/<your-image>/dataflow-worker-harness

From there you'll be able to get /bin/sh and play around to your heart's content.

Pentimento answered 2/6, 2022 at 7:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.