How to add functions from custom JARs to EMR cluster?

Asked 19/6, 2019 at 11:18 Answered 14/1, 2022 at 3:36

I created an EMR cluster on AWS with Spark and Livy. I submitted a custom JAR with some additional libraries (e.g. datasources for custom formats) as a custom JAR step. However, the stuff from the custom JAR is not available when I try to access it from Livy.

What do I have to do to make the custom stuff available in the environment?

Vociferation answered 19/6, 2019 at 11:18 Comment(6)

Is it available in your Spark job? – Selfcontrol 19/6, 2019 at 11:27

In my Spark job, I add it as a dependency and sbt assembly packs it into the fat JAR. I want to include a library my colleagues can use when they use Spark with Livy. – Vociferation 19/6, 2019 at 11:30

You need to make sure you add it to their jobs via spark.driver.extraClassPath and spark.executor.extraClassPath properties to the spark-submit. – Selfcontrol 19/6, 2019 at 11:33

Ah, so I should handle this via the classifications JSON I can supply when creating the cluster? – Vociferation 19/6, 2019 at 11:34

Yes, I believe so. I don't remember the specific details but you can definitely provide custom configuration. – Selfcontrol 19/6, 2019 at 11:56

I am currently trying to use bootstrap actions to copy my library to the nodes in conjunction with configuration classifications. Let's see if that works. – Vociferation 19/6, 2019 at 11:58

I am posting this as an answer to be able to accept it - I figured it out thanks to Yuval Itzchakov's comments and the AWS documentation on Custom Bootstrap Actions.

So here is what I did:

I put my library jar (a fat jar created with sbt assembly containing everything needed) into an S3 bucket

Created a script named copylib.sh which contains the following:

#!/bin/bash

mkdir -p /home/hadoop/mylib
aws s3 cp s3://mybucket/mylib.jar /home/hadoop/mylib

Created the following configuration JSON and put it into the same bucket besides the mylib.jar and copylib.sh:

[{
   "configurations": [{
       "classification": "export",
       "properties": {
           "PYSPARK_PYTHON": "/usr/bin/python3"
       }
   }],
   "classification": "spark-env",
   "properties": {}
}, {
   "configurations": [{
       "classification": "export",
       "properties": {
           "PYSPARK_PYTHON": "/usr/bin/python3"
       }
   }],
   "classification": "yarn-env",
   "properties": {}
},
{
   "Classification": "spark-defaults",
   "Properties": {
       "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar",
       "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar"
   }
}
]

The classifications for spark-env and yarn-env are needed for PySpark to work with Python3 on EMR through Livy. And there is another issue: EMR already populates the two extraClassPaths with a lot of libraries which are needed for EMR to function properly, so I had to run a cluster without my lib, extract these settings from spark-defaults.conf and adjust my classification afterwards. Otherwise, things like S3 access wouldn't work.

When creating the cluster, in Step 1 I referenced the configuration JSON file from above in Edit software settings, and in Step 3, I configured copylib.sh as a Custom Bootstrap Action.

I can now open the Jupyterhub of the cluster, start a notebook and work with my added functions.

Vociferation answered 19/6, 2019 at 12:45 Comment(0)

I use an alternative way that does not use a bootstrap action.

Place the JARs in S3
Pass them in the --jars option of spark-submit eg. spark-submit --jars s3://my-bucket/extra-jars/*.jar. All the jars will be copied to the cluster.

This way we can use any jar from s3 if we missed to add bootstrap action during cluster creation.

Hemichordate answered 14/1, 2022 at 3:36 Comment(0)

Recommended topics

Hot tags