Adding external jars in EMR Notebooks
Asked Answered
N

3

7

I use EMR Notebook connected to EMR cluster. Kernel is Spark and language is Scala. I need some jars that are located in S3 bucket. How can I add jars?

In case of 'spark-shell' it's easy:

spark-shell --jars "s3://some/path/file.jar, s3://some/path/faile2.jar"

Also in scala console I can do

:require s3://some/path/file.jar

Northward answered 13/8, 2019 at 8:28 Comment(5)
what is the kernel you are using?Holloway
Kernel is Spark and language is ScalaNorthward
did you try AddJar s3://some/path/file.jar ?Holloway
yes, receive error: Incomplete statementNorthward
is there a way to add maven dependency ?Dinosaurian
A
6

After you start the notebook, you can do this in a cell:

%%configure -f
{
"conf": {"spark.jars.packages": "com.jsuereth:scala-arm_2.11:2.0,ml.combust.bundle:bundle-ml_2.11:0.13.0,com.databricks:dbutils-api_2.11:0.0.3"},

"jars": [
        "//path to external downloaded jars"
    ],

}
Ashbey answered 13/8, 2019 at 12:59 Comment(8)
I tried this in such way %%configure -f { "conf": {"spark.jars.packages":"//path to external downloaded jars"} } and this %%configure -f { "conf": {"jars":"//path to external downloaded jars"} }Northward
Should i use exactly with this line "conf": {"spark.jars.packages": "com.jsuereth:scala-arm_2.11:2.0,ml.combust.bundle:bundle-ml_2.11:0.13.0,com.databricks:dbutils-api_2.11:0.0.3"}, ?Northward
Those are just sample jars, that I needed for my notebook. You need to replace the jars, with the jars you needAshbey
Should I use "spark.jars.packages" : "" or "jars": [""] ?Northward
I used this, some time ago. You need to check this with the current version you are usingAshbey
How we can specify maven repository in this? I am using below command to launch pyspark on master node but not able to figure out how to configure repository using %%configure sudo pyspark --repositories redshift-maven-repository.s3-website-us-east-1.amazonaws.com/… --packages org.apache.spark:spark-avro_2.11:2.4.0,com.databricks:spark-redshift_2.11:2.0.1,com.amazon.redshift:redshift-jdbc42:1.2.34.1058Stockade
I am running a Sagmaker notebook instance backed by EMR cluster (via Livy). I am able to use the following method to add my own Scala jar to be used in my notebook %%configure -f { "jars": [ "s3://bucket/prefix/package.jar" ] }Savonarola
any ideas how this works in a pyspark kernel (not scala)?Fakery
S
13

Just put that on your first paragraph:

%%configure -f
{
    "conf": {
        "spark.jars": "s3://YOUR_BUCKET/YOUR_DRIVER.jar"
    }
}
Spaceless answered 27/10, 2019 at 14:36 Comment(3)
This worked for me. Remeber to run this before a scala command.Greenway
@IgorTavares EMR v5.29.0 notebook stop complaininig about library not found, but I got strange NullPointerException after adding spark.jars which points to S3. I'm afraid stack trace doesn't tell me much, since I'm not sure if EMR stack trace matches the open source Spark code lines.Wheat
I get, "Error parsing magics!: Magic configure does not exist!"Holohedral
A
6

After you start the notebook, you can do this in a cell:

%%configure -f
{
"conf": {"spark.jars.packages": "com.jsuereth:scala-arm_2.11:2.0,ml.combust.bundle:bundle-ml_2.11:0.13.0,com.databricks:dbutils-api_2.11:0.0.3"},

"jars": [
        "//path to external downloaded jars"
    ],

}
Ashbey answered 13/8, 2019 at 12:59 Comment(8)
I tried this in such way %%configure -f { "conf": {"spark.jars.packages":"//path to external downloaded jars"} } and this %%configure -f { "conf": {"jars":"//path to external downloaded jars"} }Northward
Should i use exactly with this line "conf": {"spark.jars.packages": "com.jsuereth:scala-arm_2.11:2.0,ml.combust.bundle:bundle-ml_2.11:0.13.0,com.databricks:dbutils-api_2.11:0.0.3"}, ?Northward
Those are just sample jars, that I needed for my notebook. You need to replace the jars, with the jars you needAshbey
Should I use "spark.jars.packages" : "" or "jars": [""] ?Northward
I used this, some time ago. You need to check this with the current version you are usingAshbey
How we can specify maven repository in this? I am using below command to launch pyspark on master node but not able to figure out how to configure repository using %%configure sudo pyspark --repositories redshift-maven-repository.s3-website-us-east-1.amazonaws.com/… --packages org.apache.spark:spark-avro_2.11:2.4.0,com.databricks:spark-redshift_2.11:2.0.1,com.amazon.redshift:redshift-jdbc42:1.2.34.1058Stockade
I am running a Sagmaker notebook instance backed by EMR cluster (via Livy). I am able to use the following method to add my own Scala jar to be used in my notebook %%configure -f { "jars": [ "s3://bucket/prefix/package.jar" ] }Savonarola
any ideas how this works in a pyspark kernel (not scala)?Fakery
K
1

If you're trying to automate I'd suggest this:

In your cluster's bootstrap script, copy the jar file from s3 into a readable location, sort of like so:

#!/bin/bash

aws s3 cp s3://path_to_your_file.jar /home/hadoop/

then in your cluster's software settings (in EMR UI on cluster creation) set the classpath properties:

[
    {
      "Classification": "spark-defaults",
      "Properties": {
        "spark.driver.extraClassPath": "/home/hadoop/path_to_your_file.jar",
        "spark.jars": "/home/hadoop/path_to_your_file.jar"
      }
    }
  ]

(you can add extra properties here like spark.executor.extraClassPath or spark.driver.userClassPathFirst) then launch your cluster and it should be available thru imports.

I had to log into the primary node and run spark-shell to see where the import was located (by typing in import com. and pressing tab to auto complete (theres probably an easier way to do this))

then I was able to import and use the class in zeppelin/jupyter

Kayseri answered 21/9, 2023 at 8:3 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.