Spark on Kubernetes with Minio - Postgres -> Minio -unable to create executor due to
Asked Answered
G

1

0

Hi I am facing an error with providing dependency jars for spark-submit in kubernetes.

/usr/middleware/spark-3.1.1-bin-hadoop3.2/bin/spark-submit --master k8s://https://112.23.123.23:6443 --deploy-mode cluster --name spark-postgres-minio-kubernetes  --jars file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar  --driver-class-path file:///AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar --conf spark.executor.instances=1 --conf spark.kubernetes.namespace=spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.file.upload.path=s3a://daci-dataintegration/spark-operator-on-k8s/code --conf spark.hadoop.fs.s3a.fast.upload=true --conf spark.kubernetes.container.image=hostname:5000/spark-py:spark3.1.2  file:///AirflowData/kubernetes/python/postgresminioKube.py

Below is the code to execute. The jars needed for the S3 minio and configurations are placed in the spark_home/conf and spark_home/jars and the docker image is created.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("Postgres-Minio-Kubernetes").getOrCreate()
import json
#spark = SparkSession.builder.config('spark.driver.extraClassPath', '/hadoop/externalJars/db2jcc4.jar').getOrCreate()
jdbcUrl = "jdbc:postgresql://{0}:{1}/{2}".format("hosnamme", "port", "db")
connectionProperties = {
  "user" : "username",
  "password" : "password",
  "driver": "org.postgresql.Driver",
  "fetchsize" : "100000"
}
pushdown_query = "(select * from public.employees) emp_als"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, column="employee_id", lowerBound=1, upperBound=100, numPartitions=2, properties=connectionProperties)
df.write.format('csv').options(delimiter=',').mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')
df.write.format('parquet').options(delimiter='|').options(header=True).mode('overwrite').save('s3a://daci-dataintegration/spark-operator-on-k8s/data/postgres-minio-csv/')

Error is below . It is trying to execute the jar for some reason

21/11/09 17:05:44 INFO SparkContext: Added JAR file:/tmp/spark-d987d7e7-9d49-4523-8415-1e438da1730e/postgresql-42.2.14.jar at spark://spark-postgres-minio-kubernetes-49d7d77d05a980e5-driver-svc.spark.svc:7078/jars/postgresql-42.2.14.jar with timestamp 1636477543573

21/11/09 17:05:49 ERROR TaskSchedulerImpl: Lost executor 1 on 192.168.216.12: Unable to create executor due to ./postgresql-42.2.14.jar
Grantley answered 9/11, 2021 at 17:31 Comment(5)
Is there a stack trace belonging to the ERROR line?Pawpaw
Is the path /AirflowData/kubernetes/externalJars/postgresql-42.2.14.jar on the machine from which you submit or inside that hostname:5000/spark-py:spark3.1.2 container image or somewhere else?Pawpaw
... either way, please have a close look at the Dependency Management section of the Spark on Kubernetes documentation.Pawpaw
The path is mounted on all nodes. But it works only if the jar are built in inside the image.Grantley
Were you still using the command you posted originally? Changing it according to the doc section I referenced might help.Pawpaw
G
0

The external jars are getting added to the /opt/spark/work-dir and it didnt had access. So i changed the dockerfile to have access to the folder and then it worked.

RUN chmod 777 /opt/spark/work-dir
Grantley answered 11/1, 2022 at 15:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.