How to configure Spark 2.4 correctly with user-provided Hadoop

Asked 2/3, 2020 at 8:4 Answered 6/3, 2020 at 13:14

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.

http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7. Another option is to use the Spark with user-provided Hadoop, so I tried that one.

As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either. There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?

When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using

spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5

in spark-defaults.conf, I get this error:

20/02/26 11:20:45 ERROR spark.SparkContext: 
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)

because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416

A workaround for the classifier probleme looks like this:

$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar

but DevOps won't accept this.

The complete list of dependencies looks like this (I have added line breaks for better readability)

root@a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307

(everything works - except for Hive)

Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
Is it necessary to build Spark to get around the Hive dependency problem ?

Andromeda answered 2/3, 2020 at 8:4 Comment(4)

Curious about the comment regarding pre-built versions. I only see prebuilt binaries for Hadoop 2.6 and Hadoop 2.7, but this suggest the availability is Hadoop 2.7 and Hadoop 2.8. Was this an off-by-one error, or has something changed? – Vidovik 25/4, 2020 at 21:24

I want to upgrade my hadoop aws to 3.2. Currently stuck with spark dependency on hadoop 2.7. Is it possible to shade the hadoop aws? I will be using uber jar. – Jacquiline 19/4, 2021 at 8:31

@Jacquiline It might work in your case. Yet in my experience this normally leads to problems especially in case of AWS dependencies. It's exactly for this reason that I build Spark (with consistent) dependencies. – Andromeda 19/4, 2021 at 15:1

Spark cluster is in 2.2.1 with pre-built hadoop 2.7. It is not working since the 2.7 jars are already there in classpath. Getting ClassNotFound error for hadoop config. – Jacquiline 19/4, 2021 at 16:24

There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0

As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.

./dev/make-distribution.sh \
  --name hadoop-2.10.0 \
  --tgz \
  -Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \
  -Phive -Phive-thriftserver \
  -Pyarn

Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.

In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.

Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).

Update 2021-04-08

If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments

Andromeda answered 6/3, 2020 at 13:14 Comment(2)

This approach seems sound. I did the same with a target version of Hadoop 2.8.5 successfully. For an extra measure of reassurance, I compared the versions of every override in any of Spark's build profile and found that they had not changed at all. I suspect the same is true of Hadoop 2.10.0. Setting -Dhadoop.verrsion does not appear to effect the compilation, only the dependencies that get bundled, which is what you are tweaking manually when you customize a your-own-Hadoop release by dropping in jars from a chosen hadoop version. This just allows the build system do that work for you. – Vidovik 25/4, 2020 at 21:35

It also seems to be the documentation-provided method to build with a targeted Hadoop version. It's worth scanning over these directions--otherwise you'll miss things like setting MAVEN_OPTS to account for the build's memory requirements! spark.apache.org/docs/latest/… – Vidovik 25/4, 2020 at 21:38

Assuming you don't want to run Spark-on-YARN -- start from bundle "Spark 2.4.5 with Hadoop 2.7" then cherry-pick the Hadoop libraries to upgrade from bundle "Hadoop 2.10.x"

Discard spark-yarn / hadoop-yarn-* / hadoop-mapreduce-client-* JARs because you won't need them, except hadoop-mapreduce-client-core that is referenced by write operations on HDFS and S3 (cf. "MR commit procedure" V1 or V2)
- you may also discard spark-mesos / mesos-* and/or spark-kubernetes / kubernetes-* JARs depending on what you plan to run Spark on
- you may also discard spark-hive-thriftserver and hive-* JARS if you don't plan to run a "thrift server" instance, except hive-metastore that is necessary for, as you might guess, managing the Metastore (either a regular Hive Metastore service or an embedded Metastore inside the Spark session)
Discard hadoop-hdfs / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl JARs
Replace with hadoop-hdfs-client / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl / stax2-api JARs from Hadoop 2.10 (under common/and common/lib/, or hdfs/ and hdfs/lib/)
Add the S3A connector from Hadoop 2.10 i.e. hadoop-aws / jets3t / woodstox-core JARs (under tools/lib/)
download aws-java-sdk from Amazon (cannot be bundled with Hadoop because it's not an Apache license, I guess)
and finally, run a lot of tests...

That worked for me, after some trial-and-error -- with a caveat: I ran my tests against an S3-compatible storage system, but not against the "real" S3, and not against regular HDFS. And without a "real" Hive Metastore service, just the embedded in-memory & volatile Metastore that Spark runs by default.

For the record, the process is the same with Spark 3.0.0 previews and Hadoop 3.2.1, except that

you also have to upgrade guava
you don't have to upgrade xercesImpl nor htrace-core nor stax2-api
you don't need jets3t any more
you need to retain more hadoop-mapreduce-client-* JARs (probably because of the new "S3 committers")

Ego answered 2/3, 2020 at 14:52 Comment(4)

Don't suppose you've a bash script to set this up? ;) – Piny 2/3, 2020 at 14:55

My client owns that script... and they are quite stubborn about not sharing anything, for any reason. – Ego 2/3, 2020 at 15:3

Hmm. Rewrite said script in Ansible/Chef? :) – Piny 2/3, 2020 at 15:10

Thank you for your very detailed answer. So indeed it's not easy. I will run some tests today, and give a feedback later. – Andromeda 3/3, 2020 at 8:57

Recommended topics

Hot tags