PySpark packages installation on kubernetes with Spark-Submit: ivy-cache file not found error

T

4

7

I am fighting it the whole day. I am able to install and to use a package (graphframes) with spark shell or a connected Jupiter notebook, but I would like to move it to the kubernetes based spark environment with spark-submit. My spark version: 3.0.1 I downloaded the last available .jar file (graphframes-0.8.1-spark3.0-s_2.12.jar) from spark-packages and put it to the jars folder. I use a variation of standard spark docker file to build my images. My spark-submit command looks like:

$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=$2 \
--conf spark.kubernetes.container.image=myimage.io/repositorypath \
--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \
--jars "local:///opt/spark/jars/graphframes-0.8.1-spark3.0-s_2.12.jar" \
path/to/my/script/script.py

But it ends with an error:

Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5;1.0
    confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5-1.0.xml (No such file or directory)

Here are my logs just for the case:

(base) konstantinigin@Konstantins-MBP spark-3.0.1-bin-hadoop3.2 % kubectl logs scalableapp-py-7669dd784bd59f67-driver
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ sort -t_ -k4 -n
+ grep SPARK_JAVA_OPT_
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' 3 == 2 ']'
+ '[' 3 == 3 ']'
++ python3 -V
+ pyv3='Python 3.7.3'
+ export PYTHON_VERSION=3.7.3
+ PYTHON_VERSION=3.7.3
+ export PYSPARK_PYTHON=python3
+ PYSPARK_PYTHON=python3
+ export PYSPARK_DRIVER_PYTHON=python3
+ PYSPARK_DRIVER_PYTHON=python3
+ '[' -n '' ']'
+ '[' -z ']'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.1.2.145 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/data/ScalableApp.py --number_of_executors 2 --dataset USAir --links 100
Ivy Default Cache set to: /opt/spark/.ivy2/cache
The jars for the packages stored in: /opt/spark/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5;1.0
    confs: [default]
Exception in thread "main" java.io.FileNotFoundException: /opt/spark/.ivy2/cache/resolved-org.apache.spark-spark-submit-parent-e833e157-44f5-4055-81a4-3ab524176ef5-1.0.xml (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
    at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:70)
    at org.apache.ivy.plugins.parser.xml.XmlModuleDescriptorWriter.write(XmlModuleDescriptorWriter.java:62)
    at org.apache.ivy.core.module.descriptor.DefaultModuleDescriptor.toIvyFile(DefaultModuleDescriptor.java:563)
    at org.apache.ivy.core.cache.DefaultResolutionCacheManager.saveResolvedModuleDescriptor(DefaultResolutionCacheManager.java:176)
    at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:245)
    at org.apache.ivy.Ivy.resolve(Ivy.java:523)
    at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1387)
    at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
    at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:871)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Did someone have something familiar? May be you have an idea what am I doing wrong here?

Trometer answered 20/3, 2021 at 14:40 Comment(0)

T

0

Okay, I solved my issue. Not sure whether it is going to work for other packages, but it lets me run graphframes in the mentioned setup:

Download the latest .jar file from spark-packages
Remove version-part of its name, leaving only the package name. In my case it was:

mv ./graphframes-0.8.1-spark3.0-s_2.12.jar ./graphframes.jar

Unpack it using the jar command:

# Extract jar contents
jar xf graphframes.jar

Now here comes the first point. I put all the packages I use in one dependencies folder that I later submit to kubernetes in zipped form. The logic behind this folder is explained in another question of mine that I again answered myself. See here. Now here I copy the graphframes folder from contents extracted in previous step using the jar command to my dependencies folder: 4. Copy graphframes folder from contents extracted before to your dependencies folder

cp -r ./graphframes $SPARK_HOME/path/to/your/dependencies

Add the original .jar file to the jars folder inside of your $SPARK_HOME
Include --jars to your spark-submit command pointing at the new .jar file:

$SPARK_HOME/bin/spark-submit \
--master k8s://https://kubernetes.docker.internal:6443 \
--deploy-mode cluster \
--conf spark.executor.instances=$2 \
--conf spark.kubernetes.container.image=docker.io/path/to/your/image \
--jars "local:///opt/spark/jars/graphframes.jar" \ ...

Include your dependencies as described here

I am in a hurry right now, but in a nearest future I will edit this post, adding a link to a short medium article about handling dependencies in py-spark. Hope that it is going to be useful to someone :)

Trometer answered 22/3, 2021 at 12:12 Comment(4)

Did you ever find a solution that allows the use of the --packages flag? This bug is currently affecting me as well – Courtier 12/5, 2021 at 9:26

--packages did not work for me. I believe there is a problem with this Spark package manager, Ivy. – Trometer 13/5, 2021 at 11:46

How would you handle this situation for a bunch of dependencies? My Pyspark job relies on the org.apache.hadoop:hadoop-azure:3.2.0 package, which has a dozen dependencies. I cannot supply all of them manually. The weird thing is also, that is works in local mode...so the packages have to be there already somewhere. – Heck 25/6, 2021 at 17:42

I also don't completely understand the 7. step. Do you supply the zipped jars as --py-files? – Heck 28/6, 2021 at 10:10

W

14

Adding this configuration with spark submit worked for me:

spark-submit \
 --conf spark.driver.extraJavaOptions="-Divy.cache.dir=/tmp -Divy.home=/tmp" \

Wolenik answered 13/10, 2021 at 16:25 Comment(0)

C

1

It seems to be a known spark issue that is getting resolved

https://github.com/apache/spark/pull/32397

Courtier answered 13/5, 2021 at 15:29 Comment(0)

H

1

I managed to solve a similar problem where I wasn't able to download the hadoop-azure jars with the --package flag. It is definitely a workaround but it works.

I modified the PySpark Docker container by changing the entrypoint to:

ENTRYPOINT [ "/opt/entrypoint.sh" ]

Now I was able to run the container without it exiting immediately:

docker run -td <docker_image_id>

And could ssh into it:

docker exec -it <docker_container_id> /bin/bash

At this point I could submit the spark job inside the container with the --package flag:

$SPARK_HOME/bin/spark-submit \
  --master local[*] \
  --deploy-mode client \
  --name spark-python \
  --packages org.apache.hadoop:hadoop-azure:3.2.0 \
  --conf spark.hadoop.fs.azure.account.auth.type.user.dfs.core.windows.net=SharedKey \
  --conf spark.hadoop.fs.azure.account.key.user.dfs.core.windows.net=xxx \
  --files "abfss://[email protected]/config.yml" \
  --py-files "abfss://[email protected]/jobs.zip" \
  "abfss://[email protected]/main.py"

Spark then downloaded the required dependencies and saved them under /root/.ivy2 in the container and executed the job succesfully.

I copied the whole folder from the container onto the host machine:

sudo docker cp <docker_container_id>:/root/.ivy2/ /opt/spark/.ivy2/

And modified the Dockerfile again to copy the folder into the image:

COPY .ivy2 /root/.ivy2

Finally I could submit the job to Kubernetes with this newly build image and everything runs as expected.

Heck answered 29/6, 2021 at 9:31 Comment(1)

this trick is so good. I am not working with k8s but this helps me rebuild my docker image and do not need to download dependencies when I create new container and run spark job. – Pentimento 22/5 at 20:41

T

0