This error is most likely happening due to your Docker image being based on the SDK harness image (apache/beam_python3.8_sdk
). SDK harness images are used in portable pipelines; When a portable runner needs to execute stages of a pipeline that must be executed in their original language, it starts a container with the SDK harness and delegates execution of that stage of the pipeline to the SDK harness. Therefore, when the SDK harness boots up it is expecting to have various configuration details provided by the runner that started it, one of which is the ID. When you start this container directly, those configuration details are not provided and it crashes.
For context into your specific use-case, let me first diagram out the different processes involved in running a portable pipeline.
Pipeline Construction <---> Job Service <---> SDK Harness
\--> Cross-Language SDK Harness
- Pipeline Construction - The process where you define and run your pipeline. It sends your pipeline definition to a Job Service and receives a pipeline result. It does not execute any of the pipeline.
- Job Service - A process for your runner of choice. This is potentially in a different language than your original pipeline construction, and can therefore not run user code, such as custom DoFns.
- SDK Harness - A process that executes user code, initiated and managed by the Job Service. By default, this is in a docker container.
- Cross-Language SDK Harness A process executing code from a different language than your pipeline construction. In your case, Python's Kafka IO uses cross-language, and is actually executing in a Java SDK harness.
Currently, the docker container you created is based on an SDK harness container, which does not sound like what you want. You seem to have been trying to containerize your pipeline construction code and accidentally containerized the SDK harness instead. But since you described that you want the ReadFromKafka consumer to be containerized, it sounds like what you need is for the Job Server to be containerized, in addition to any SDK harnesses it uses.
Containerizing the Job Server is possible, and may already be done. For example, here's a containerized Flink Job Server. Containerized job servers may give you a bit of trouble with artifacts, as the container won't have access to artifact staging directories on your local machine, but there may be ways around that.
Additionally, you mentioned that you want to avoid having SDK harnesses start in a nested docker container. If you start up a worker pool docker container for the SDK harness and set it as an external environment, the runner, assuming it supports external environments, will attempt to connect to the URL you supply instead of creating a new docker container. You will need to configure this for the Java cross-language environment as well, if that is possible in the Python SDK. This configuration should be done via python's pipeline options. --environment_type
and --environment_options
are good starting points.
apache/beam_python3.8_sdk:latest
, the entrypoint should be replaced withENTRYPOINT ["/opt/apache/beam/boot", "--worker_pool"]
. But that runs apache beam, and I need to run Python. This suggests that I am barking at the wrong tree and that this is not the image I should be using. So the question remains what is this docker image? – Calvillopython:3.8.12-slim
docker image. TheReadFromKafka
consumer however requires adding not only JDK but also is trying to run a docker within a docker container (which is a no-go for me). The code:with beam.Pipeline(options=options) as pipeline: \n raw_items = ( pipeline | "Read Kafka Messages" >> ReadFromKafka(..) )
– Calvillo