How to pass -D parameter or environment variable to Spark job?
Asked Answered
H

8

89

I want to change Typesafe config of a Spark job in dev/prod environment. It seems to me that the easiest way to accomplish this is to pass -Dconfig.resource=ENVNAME to the job. Then Typesafe config library will do the job for me.

Is there way to pass that option directly to the job? Or maybe there is better way to change job config at runtime?

EDIT:

  • Nothing happens when I add --conf "spark.executor.extraJavaOptions=-Dconfig.resource=dev" option to spark-submit command.
  • I got Error: Unrecognized option '-Dconfig.resource=dev'. when I pass -Dconfig.resource=dev to spark-submit command.
Hoeve answered 27/1, 2015 at 9:6 Comment(5)
Please specify how you are starting your job. In general you can just stick -Dx=y on the command line.Blen
@DanielDarabos I start my job with spark-submit on YARN.Hoeve
@Hoeve Can you accept an answer?Miasma
@DonBranson I've tried all the answers here and none worked from me on spark 1.6.0 ! I have this exact issue. I can't seem to override a config property in my Typesafe config configuration file via a -D param.Redeeming
@Hoeve Did you manage to find a solution?Redeeming
H
61

Change spark-submit command line adding three options:

  • --files <location_to_your_app.conf>
  • --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app'
  • --conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'
Hoeve answered 29/1, 2015 at 12:9 Comment(6)
Note that using the --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app' option will not work when spark submits the driver in client mode. Use --driver-java-options "-Dconfig.resource=app" instead. See Spark Configuration.Zone
On Yarn I used: --files <location_to_your.conf>#application.conf --driver-java-options -Dconfig.file=your.conf The # in files gives the name relative to executors; so they will see the specified file as application.conf.Critchfield
Alternatively spark-submit --driver-java-options='-Dmy.config.path=myConfigValue'Outandout
@Hoeve This is not working for me... did this solve your issue?Redeeming
It was working for me back in 2015. ATM I can't even tell what Spark version it was.Hoeve
@Zone not the best question, but what does ` -Dconfig.resource=app` stand for ? and what is it purpose exactly in the 'spark.executor.extraJavaOptions' ?Campanile
C
22

Here is my spark program run with addition java option

/home/spark/spark-1.6.1-bin-hadoop2.6/bin/spark-submit \
--files /home/spark/jobs/fact_stats_ad.conf \
--conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf \
--conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf' \
--class jobs.DiskDailyJob \
--packages com.databricks:spark-csv_2.10:1.4.0 \
--jars /home/spark/jobs/alluxio-core-client-1.2.0-RC2-jar-with-dependencies.jar \
--driver-memory 2g \
/home/spark/jobs/convert_to_parquet.jar \
AD_COOKIE_REPORT FACT_AD_STATS_DAILY | tee /data/fact_ad_stats_daily.log

as you can see the custom config file --files /home/spark/jobs/fact_stats_ad.conf

the executor java options --conf spark.executor.extraJavaOptions=-Dconfig.fuction.conf

the driver java options. --conf 'spark.driver.extraJavaOptions=-Dalluxio.user.file.writetype.default=CACHE_THROUGH -Dalluxio.user.file.write.location.policy.class=alluxio.client.file.policy.MostAvailableFirstPolicy -Dconfig.file=/home/spark/jobs/fact_stats_ad.conf'

Hope it can helps.

Cerebrospinal answered 26/8, 2016 at 10:37 Comment(1)
This answer helps in showing the format for passing multiple options as a space separated list of -Dkey=value pairs.Antenna
T
9

I Had a lot of problems with passing -D parameters to spark executors and the driver, I've added a quote from my blog post about it: " The right way to pass the parameter is through the property: “spark.driver.extraJavaOptions” and “spark.executor.extraJavaOptions”: I’ve passed both the log4J configurations property and the parameter that I needed for the configurations. (To the Driver I was able to pass only the log4j configuration). For example (was written in properties file passed in spark-submit with “—properties-file”): “

spark.driver.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -
spark.executor.extraJavaOptions –Dlog4j.configuration=file:///spark/conf/log4j.properties -Dapplication.properties.file=hdfs:///some/path/on/hdfs/app.properties
spark.application.properties.file hdfs:///some/path/on/hdfs/app.properties

You can read my blog post about overall configurations of spark. I'm am running on Yarn as well.

Trisoctahedron answered 27/1, 2015 at 18:0 Comment(1)
Please add some more content to it and avoid link only answer.Pancreatotomy
K
8

--files <location_to_your_app.conf> --conf 'spark.executor.extraJavaOptions=-Dconfig.resource=app' --conf 'spark.driver.extraJavaOptions=-Dconfig.resource=app'

if you write in this way, the later --conf will overwrite the previous one, you can verify this by looking at sparkUI after job started under Environment tab.

so the correct way is to put the options under same line like this: --conf 'spark.executor.extraJavaOptions=-Da=b -Dc=d' if you do this, you can find all your settings will be shown under sparkUI.

Kneehigh answered 9/5, 2017 at 20:39 Comment(3)
I don't believe this is true for the "--conf" flag, though it is true for "--files".Autonomy
I have tested on 2.1.0 and 2.1.1. according to sparkUI->Environment, I only see the later one if we use --conf twice.Kneehigh
I think your example is flawed. You are showing two separate and completely different key/values after the "--conf" flag (one executor, one driver). Those cannot overwrite each other. If you are saying that only the last repetition of any --conf option will take effect, you are correct but your example does not show that. In spark-submit: - You can have One --files option, the last of which (if multiple) will be used and the earlier one ignored - You can have multiple --conf key=value options, but if you duplicate a key it will take the last valueAutonomy
L
3

I am starting my Spark application via a spark-submit command launched from within another Scala application. So I have an Array like

Array(".../spark-submit", ..., "--conf", confValues, ...)

where confValues is:

  • for yarn-cluster mode:
    "spark.driver.extraJavaOptions=-Drun.mode=production -Dapp.param=..."
  • for local[*] mode:
    "run.mode=development"

It is a bit tricky to understand where (not) to escape quotes and spaces, though. You can check the Spark web interface for system property values.

Lavella answered 28/1, 2015 at 0:56 Comment(1)
This worked for me! (at least for the local[*] mode). I'll try with yarn-cluster mode and update the comment (if I don't forget.. :D )Matildematin
E
3
spark-submit --driver-java-options "-Denv=DEV -Dmode=local" --class co.xxx.datapipeline.jobs.EventlogAggregator target/datapipeline-jobs-1.0-SNAPSHOT.jar

The above command works for me:

-Denv=DEV => to read DEV env properties file, and
-Dmode=local => to create SparkContext in local - .setMaster("local[*]")

Eastward answered 7/11, 2018 at 19:50 Comment(0)
F
2

Use the method like in below command, may be helpful for you -

spark-submit --master local[2] --conf 'spark.driver.extraJavaOptions=Dlog4j.configuration=file:/tmp/log4j.properties' --conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/tmp/log4j.properties' --class com.test.spark.application.TestSparkJob target/application-0.0.1-SNAPSHOT-jar-with-dependencies.jar prod

I have tried and it worked for me, I would suggest also go through heading below spark post which is really helpful - https://spark.apache.org/docs/latest/running-on-yarn.html

Fleam answered 24/11, 2017 at 14:31 Comment(0)
R
0

I originally had this config file:

my-app {
  environment: dev
  other: xxx
}

This is how I'm loading my config in my spark scala code:

val config = ConfigFactory.parseFile(File<"my-app.conf">)
  .withFallback(ConfigFactory.load())
  .resolve
  .getConfig("my-app")

With this setup, despite what the Typesafe Config documentation and all the other answers say, the system property override didn't work for me when I launched my spark job like so:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --name my-app \
  --driver-java-options='-XX:MaxPermSize=256M -Dmy-app.environment=prod' \
  --files my-app.conf \
  my-app.jar

To get it to work I had to change my config file to:

my-app {
  environment: dev
  environment: ${?env.override}
  other: xxx
}

and then launch it like so:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --name my-app \
  --driver-java-options='-XX:MaxPermSize=256M -Denv.override=prod' \
  --files my-app.conf \
  my-app.jar
Redeeming answered 28/12, 2017 at 17:3 Comment(1)
I'm running spark 1.6.0 BTWRedeeming

© 2022 - 2024 — McMap. All rights reserved.