Adding external jars in EMR Notebooks

N

3

7

I use EMR Notebook connected to EMR cluster. Kernel is Spark and language is Scala. I need some jars that are located in S3 bucket. How can I add jars?

In case of 'spark-shell' it's easy:

spark-shell --jars "s3://some/path/file.jar, s3://some/path/faile2.jar"

Also in scala console I can do

:require s3://some/path/file.jar

Northward answered 13/8, 2019 at 8:28 Comment(5)

what is the kernel you are using? – Holloway 13/8, 2019 at 9:32

Kernel is Spark and language is Scala – Northward 13/8, 2019 at 9:42

did you try AddJar s3://some/path/file.jar ? – Holloway 13/8, 2019 at 10:38

yes, receive error: Incomplete statement – Northward 13/8, 2019 at 10:43

is there a way to add maven dependency ? – Dinosaurian 9/11, 2019 at 6:2

A

6

After you start the notebook, you can do this in a cell:

%%configure -f
{
"conf": {"spark.jars.packages": "com.jsuereth:scala-arm_2.11:2.0,ml.combust.bundle:bundle-ml_2.11:0.13.0,com.databricks:dbutils-api_2.11:0.0.3"},

"jars": [
        "//path to external downloaded jars"
    ],

}

Ashbey answered 13/8, 2019 at 12:59 Comment(8)

I tried this in such way %%configure -f { "conf": {"spark.jars.packages":"//path to external downloaded jars"} } and this %%configure -f { "conf": {"jars":"//path to external downloaded jars"} } – Northward 13/8, 2019 at 13:14

Should i use exactly with this line "conf": {"spark.jars.packages": "com.jsuereth:scala-arm_2.11:2.0,ml.combust.bundle:bundle-ml_2.11:0.13.0,com.databricks:dbutils-api_2.11:0.0.3"}, ? – Northward 13/8, 2019 at 13:17

Those are just sample jars, that I needed for my notebook. You need to replace the jars, with the jars you need – Ashbey 14/8, 2019 at 8:59

Should I use "spark.jars.packages" : "" or "jars": [""] ? – Northward 15/8, 2019 at 6:12

I used this, some time ago. You need to check this with the current version you are using – Ashbey 15/8, 2019 at 7:14

How we can specify maven repository in this? I am using below command to launch pyspark on master node but not able to figure out how to configure repository using %%configure sudo pyspark --repositories redshift-maven-repository.s3-website-us-east-1.amazonaws.com/… --packages org.apache.spark:spark-avro_2.11:2.4.0,com.databricks:spark-redshift_2.11:2.0.1,com.amazon.redshift:redshift-jdbc42:1.2.34.1058 – Stockade 9/12, 2019 at 21:41

I am running a Sagmaker notebook instance backed by EMR cluster (via Livy). I am able to use the following method to add my own Scala jar to be used in my notebook %%configure -f { "jars": [ "s3://bucket/prefix/package.jar" ] } – Savonarola 20/4, 2020 at 0:55

any ideas how this works in a pyspark kernel (not scala)? – Fakery 20/8, 2020 at 16:53

S

13

Just put that on your first paragraph:

%%configure -f
{
    "conf": {
        "spark.jars": "s3://YOUR_BUCKET/YOUR_DRIVER.jar"
    }
}

Spaceless answered 27/10, 2019 at 14:36 Comment(3)

This worked for me. Remeber to run this before a scala command. – Greenway 12/9, 2020 at 7:23

@IgorTavares EMR v5.29.0 notebook stop complaininig about library not found, but I got strange NullPointerException after adding spark.jars which points to S3. I'm afraid stack trace doesn't tell me much, since I'm not sure if EMR stack trace matches the open source Spark code lines. – Wheat 19/3, 2022 at 21:25

I get, "Error parsing magics!: Magic configure does not exist!" – Holohedral 23/1, 2024 at 16:11

A

6

After you start the notebook, you can do this in a cell:

%%configure -f
{
"conf": {"spark.jars.packages": "com.jsuereth:scala-arm_2.11:2.0,ml.combust.bundle:bundle-ml_2.11:0.13.0,com.databricks:dbutils-api_2.11:0.0.3"},

"jars": [
        "//path to external downloaded jars"
    ],

}