I'm trying to connect to Standalone Apache Spark cluster from a dockerized Apache Spark application using Client mode.
Driver gives the Spark Master and the Workers its address. When run inside a docker container it will use some_docker_container_ip
. The docker address is not visible from outside so an application won't work.
Spark has spark.driver.host
property. This property is passed to Master and Workers. My initial instinct was to pass host machine address in there so the cluster would address visible machine instead.
Unfortunately the spark.driver.host
is also used to set up a server by Driver. Passing a host machine address in there will cause server startup errors because a docker container cannot bind ports under host machine host.
It seems like a lose-lose situation. I cannot use neither the host machine address nor the docker container address.
Ideally I would like to have two properties. The spark.driver.host-to-bind-to
used to set up the driver server and the spark.driver.host-for-master
which would be used by Master and Workers. Unfortunately it seems like I'm stuck with one property only.
Another approach would be to use --net=host
when running a docker container. This approach has many disadvantages (e.g. other docker containers cannot get linked to a container with the --net=host
on and must be exposed outside of the docker network) and I would like to avoid it.
Is there any way I could solve the driver-addressing problem without exposing the docker containers?