How to configure Spark 2.4 correctly with user-provided Hadoop
Asked Answered
A

2

6

I'd like to use Spark 2.4.5 (the current stable Spark version) and Hadoop 2.10 (the current stable Hadoop version in the 2.x series). Further I need to access HDFS, Hive, S3, and Kafka.

http://spark.apache.org provides Spark 2.4.5 pre-built and bundled with either Hadoop 2.6 or Hadoop 2.7. Another option is to use the Spark with user-provided Hadoop, so I tried that one.

As a consequence of using with user-provided Hadoop, Spark does not include Hive libraries either. There will be an error, like here: How to create SparkSession with Hive support (fails with "Hive classes are not found")?

When I add the spark-hive dependency to the spark-shell (spark-submit is affected as well) by using

spark.jars.packages=org.apache.spark:spark-hive_2.11:2.4.5

in spark-defaults.conf, I get this error:

20/02/26 11:20:45 ERROR spark.SparkContext: 
Failed to add file:/root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar to Spark environment
java.io.FileNotFoundException: Jar /root/.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar not found
at org.apache.spark.SparkContext.addJarFile$1(SparkContext.scala:1838)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1868)
at org.apache.spark.SparkContext.$anonfun$new$11(SparkContext.scala:458)
at org.apache.spark.SparkContext.$anonfun$new$11$adapted(SparkContext.scala:458)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:106)

because spark-shell cannot handle classifiers together with bundle dependencies, see https://github.com/apache/spark/pull/21339 and https://github.com/apache/spark/pull/17416

A workaround for the classifier probleme looks like this:

$ cp .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2-hadoop2.jar .../.ivy2/jars/org.apache.avro_avro-mapred-1.8.2.jar

but DevOps won't accept this.

The complete list of dependencies looks like this (I have added line breaks for better readability)

root@a5a04d888f85:/opt/spark-2.4.5/conf# cat spark-defaults.conf
spark.jars.packages=com.fasterxml.jackson.datatype:jackson-datatype-jdk8:2.9.10,
com.fasterxml.jackson.datatype:jackson-datatype-jsr310:2.9.10,
org.apache.spark:spark-hive_2.11:2.4.5,
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5,
org.apache.hadoop:hadoop-aws:2.10.0,
io.delta:delta-core_2.11:0.5.0,
org.postgresql:postgresql:42.2.5,
mysql:mysql-connector-java:8.0.18,
com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,
io.prestosql:presto-jdbc:307

(everything works - except for Hive)

  • Is the combination of Spark 2.4.5 and Hadoop 2.10 used anywhere? How?
  • How to combine Spark 2.4.5 with user-provided Hadoop and Hadoop 2.9 or 2.10 ?
  • Is it necessary to build Spark to get around the Hive dependency problem ?
Andromeda answered 2/3, 2020 at 8:4 Comment(4)
Curious about the comment regarding pre-built versions. I only see prebuilt binaries for Hadoop 2.6 and Hadoop 2.7, but this suggest the availability is Hadoop 2.7 and Hadoop 2.8. Was this an off-by-one error, or has something changed?Vidovik
I want to upgrade my hadoop aws to 3.2. Currently stuck with spark dependency on hadoop 2.7. Is it possible to shade the hadoop aws? I will be using uber jar.Jacquiline
@Jacquiline It might work in your case. Yet in my experience this normally leads to problems especially in case of AWS dependencies. It's exactly for this reason that I build Spark (with consistent) dependencies.Andromeda
Spark cluster is in 2.2.1 with pre-built hadoop 2.7. It is not working since the 2.7 jars are already there in classpath. Getting ClassNotFound error for hadoop config.Jacquiline
A
5

There does not seem to be an easy way to configure Spark 2.4.5 with user-provided Hadoop to use Hadoop 2.10.0

As my task actually was to minimize dependency problems, I have chosen to compile Spark 2.4.5 against Hadoop 2.10.0.

./dev/make-distribution.sh \
  --name hadoop-2.10.0 \
  --tgz \
  -Phadoop-2.7 -Dhadoop.version=hadoop-2.10.0 \
  -Phive -Phive-thriftserver \
  -Pyarn

Now Maven deals with the Hive dependencies/classifiers, and the resulting package is ready to be used.

In my personal opinion compiling Spark is actually easier than configuring the Spark with-user-provided Hadoop.

Integration tests so far have not shown any problems, Spark can access both HDFS and S3 (MinIO).

Update 2021-04-08

If you want to add support for Kubernetes, just add -Pkubernetes to the list of arguments

Andromeda answered 6/3, 2020 at 13:14 Comment(2)
This approach seems sound. I did the same with a target version of Hadoop 2.8.5 successfully. For an extra measure of reassurance, I compared the versions of every override in any of Spark's build profile and found that they had not changed at all. I suspect the same is true of Hadoop 2.10.0. Setting -Dhadoop.verrsion does not appear to effect the compilation, only the dependencies that get bundled, which is what you are tweaking manually when you customize a your-own-Hadoop release by dropping in jars from a chosen hadoop version. This just allows the build system do that work for you.Vidovik
It also seems to be the documentation-provided method to build with a targeted Hadoop version. It's worth scanning over these directions--otherwise you'll miss things like setting MAVEN_OPTS to account for the build's memory requirements! spark.apache.org/docs/latest/…Vidovik
E
3

Assuming you don't want to run Spark-on-YARN -- start from bundle "Spark 2.4.5 with Hadoop 2.7" then cherry-pick the Hadoop libraries to upgrade from bundle "Hadoop 2.10.x"

  1. Discard spark-yarn / hadoop-yarn-* / hadoop-mapreduce-client-* JARs because you won't need them, except hadoop-mapreduce-client-core that is referenced by write operations on HDFS and S3 (cf. "MR commit procedure" V1 or V2)
    • you may also discard spark-mesos / mesos-* and/or spark-kubernetes / kubernetes-* JARs depending on what you plan to run Spark on
    • you may also discard spark-hive-thriftserver and hive-* JARS if you don't plan to run a "thrift server" instance, except hive-metastore that is necessary for, as you might guess, managing the Metastore (either a regular Hive Metastore service or an embedded Metastore inside the Spark session)
  2. Discard hadoop-hdfs / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl JARs
  3. Replace with hadoop-hdfs-client / hadoop-common / hadoop-auth / hadoop-annotations / htrace-core* / xercesImpl / stax2-api JARs from Hadoop 2.10 (under common/and common/lib/, or hdfs/ and hdfs/lib/)
  4. Add the S3A connector from Hadoop 2.10 i.e. hadoop-aws / jets3t / woodstox-core JARs (under tools/lib/)
  5. download aws-java-sdk from Amazon (cannot be bundled with Hadoop because it's not an Apache license, I guess)
  6. and finally, run a lot of tests...


That worked for me, after some trial-and-error -- with a caveat: I ran my tests against an S3-compatible storage system, but not against the "real" S3, and not against regular HDFS. And without a "real" Hive Metastore service, just the embedded in-memory & volatile Metastore that Spark runs by default.


For the record, the process is the same with Spark 3.0.0 previews and Hadoop 3.2.1, except that
  • you also have to upgrade guava
  • you don't have to upgrade xercesImpl nor htrace-core nor stax2-api
  • you don't need jets3t any more
  • you need to retain more hadoop-mapreduce-client-* JARs (probably because of the new "S3 committers")
Ego answered 2/3, 2020 at 14:52 Comment(4)
Don't suppose you've a bash script to set this up? ;)Piny
My client owns that script... and they are quite stubborn about not sharing anything, for any reason.Ego
Hmm. Rewrite said script in Ansible/Chef? :)Piny
Thank you for your very detailed answer. So indeed it's not easy. I will run some tests today, and give a feedback later.Andromeda

© 2022 - 2024 — McMap. All rights reserved.