Spark History Server on S3A FileSystem: ClassNotFoundException
Asked Answered
C

4

7

Spark can use Hadoop S3A file system org.apache.hadoop.fs.s3a.S3AFileSystem. By adding the following into the conf/spark-defaults.conf, I can get spark-shell to log to the S3 bucket:

spark.jars.packages               net.java.dev.jets3t:jets3t:0.9.0,com.google.guava:guava:16.0.1,com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3
spark.hadoop.fs.s3a.impl          org.apache.hadoop.fs.s3a.S3AFileSystem
spark.eventLog.enabled            true
spark.eventLog.dir                s3a://spark-logs-test/
spark.history.fs.logDirectory     s3a://spark-logs-test/
spark.history.provider            org.apache.hadoop.fs.s3a.S3AFileSystem

Spark History Server also loads configuration from conf/spark-defaults.conf, but it seems not to load spark.jars.packages configuration, and throws ClassNotFoundException:

Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.spark.util.Utils$.classForName(Utils.scala:225)
    at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:256)
    at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

The Spark source code for loading configuration is different in SparkSubmitArguments.scala and in HistoryServerArguments.scala, in particular the HistoryServerArguments does not seem to load packages.

Is there a way to add the org.apache.hadoop.fs.s3a.S3AFileSystem dependency to the History Server?

Cauchy answered 6/10, 2016 at 22:23 Comment(0)
C
10

Did some more digging and figured it out. Here's what was wrong:

  1. The JARs necessary for S3A can be added to $SPARK_HOME/jars (as described in SPARK-15965)

  2. The line

    spark.history.provider     org.apache.hadoop.fs.s3a.S3AFileSystem
    

    in $SPARK_HOME/conf/spark-defaults.conf will cause

    Exception in thread "main" java.lang.NoSuchMethodException: org.apache.hadoop.fs.s3a.S3AFileSystem.<init>(org.apache.spark.SparkConf)
    

    exception. That line can be safely removed as suggested in this answer.

To summarize:

I added the following JARs to $SPARK_HOME/jars:

  • jets3t-0.9.3.jar (may be already present with your pre-built Spark binaries, seems to not matter which 0.9.x version)
  • guava-14.0.1.jar (may be already present with your pre-built Spark binaries, seems to not matter which 14.0.x version)
  • aws-java-sdk-1.7.4.jar (must be 1.7.4)
  • hadoop-aws.jar (version 2.7.3) (probably should match the version of Hadoop in your Spark build)

and added this line to $SPARK_HOME/conf/spark-defaults.conf

spark.history.fs.logDirectory     s3a://spark-logs-test/

You'll need some other configuration to enable logging in the first place, but once the S3 bucket has the logs, this is the only configuration that is needed for the History Server.

Cauchy answered 7/10, 2016 at 16:26 Comment(2)
We tried running this and the amount of events that get generated appear to be getting pushed into memory until job completion. This causes our larger jobs to fail. Do you know much about this? Small jobs worked fine and we had data in our history server ui.Wang
I also needed to add: fs.s3a.fast.upload trueEstremadura
V
2

on EMR emr-5.16.0:

I've added the following to my cluster bootstrap:

sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-core-*.jar /usr/lib/spark/jars/
sudo cp /usr/share/aws/aws-java-sdk/aws-java-sdk-s3-*.jar /usr/lib/spark/jars/
sudo cp /usr/lib/hadoop/hadoop-aws.jar /usr/lib/spark/jars/

Then in the config of the cluster:

        {
          'Classification': 'spark-defaults',
          'Properties': {
            'spark.eventLog.dir': 's3a://some/path',
            'spark.history.fs.logDirectory': 's3a://some/path',
            'spark.eventLog.enabled': 'true'
          }
        }

If you're going to test this, first stop the spark history server:

sudo stop spark-history-server

Make the config changes

sudo vim /etc/spark/conf.dist/spark-defaults.conf

Then run the copying of JARs as above

Then restart the spark history server:

sudo /usr/lib/spark/sbin/start-history-server.sh

Thanks for the answers above!

Voracious answered 16/8, 2019 at 15:3 Comment(0)
B
1

I added the following jars into my SPARK_HOME/jars directory and it works great:

  • hadoop-aws-*.jar (Version must be same as hadoop-common which you have)
  • aws-java-sdk-s3-*.jar (Choose the one compatible with hadoop-aws jar)
  • aws-java-sdk-*.jar (Choose same version as above one)
  • aws-java-sdk-core-*.jar (Choose same version as above one)
  • aws-java-sdk-dynamodb-*.jar (Choose same version as above, Frankly not sure why this is needed but doesn't work for me without this jar).

Edit :

And my spark_defaults.conf has below 3 parameters set :

spark.eventLog.enabled : true
spark.eventLog.dir : s3a://bucket_name/folder_name
spark.history.fs.logDirectory : s3a://bucket_name/folder_name
Burkey answered 1/12, 2020 at 8:11 Comment(0)
M
0

OP's question and answer is useful and working find.
I leave some comments for someone who work with Spark 3.x version.

Let's take a look at Bitnami's spark 3.5.1 version image.
There are jar files for s3a as below.

$  docker run --name spark-test -it --rm bitnami/spark:3.5.1 ls -al jars/
...
aws-java-sdk-bundle-1.12.262.jar
guava-14.0.1.jar
hadoop-aws-3.3.4.jar
...

Here is a good reference about question for OP.


This is not related subject about OP's question but please remind,

  • Spark use hadoop as default and there is 3 dependencies to access s3a://

    • If you plan to use Apache Spark image,add these jar to {SPARK_HOME}/jars
  • We need to specify access information(ex, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)

  • Don't forget add lines to spark-default.conf or VM options

    • Opt1. spark-default.conf
    # I'm using minio for s3a
    spark.hadoop.fs.s3a.endpoint                http://minio:9000
    spark.hadoop.fs.s3a.access.key              xxx
    spark.hadoop.fs.s3a.secret.key              xxx
    spark.hadoop.fs.s3a.path.style.access       true  
    spark.hadoop.fs.s3a.connection.ssl.enabled  false
    spark.hadoop.fs.s3a.impl                    org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.fast.upload             true
    spark.eventLog.dir                          s3a://spark-logs-test/  # for driver
    spark.history.fs.logDirectory               s3a://spark-logs-test/  # for spark history server
    
    SPARK_HISTORY_OPTS= \
    -Dspark.eventLog.enabled=true \
    -Dspark.eventLog.dir=s3a://spark-logs-test/ \
    -Dspark.history.fs.logDirectory=s3a://spark-logs-test/ \
    -Dspark.hadoop.fs.s3a.endpoint=http://minio:9000 \ 
    -Dspark.hadoop.fs.s3a.access.key=xxx \
    -Dspark.hadoop.fs.s3a.secret.key=xxx \ 
    -Dspark.hadoop.fs.s3a.path.style.access=true \
    -Dspark.hadoop.fs.s3a.connection.ssl.enabled=false \
    -Dspark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
    
  • If you set as s3a://spark-history/logs/, Create bucket, directory and upload any file to the path. MinIO will ignore empty directory.

Mendicity answered 24/6 at 2:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.