Configure Spark on Yarn to use hadoop native libraries
Asked Answered
G

0

5

Summary

I am new to Spark and I encountered an issue with saving text files using Snappy compression. I kept receiving the error message below. I followed many instructions from the Internet, but none of them worked for me. Eventually, I found a workaround but I like someone to advise on the right solution.

java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

Tech stack

  • Ubuntu 20.04.1 64-bit
  • Hadoop 3.3.0
  • Spark 3.0.0
  • OpenJDK 1.8.0_272

I only use the spark-shell to test my code and I start it using:

spark-shell --master yarn \
  --num-executors 1 \
  --executor-memory 512M

What I tried to resolve the issue

Added the following environment variables inside the .bashrc

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Added the following environment variables inside the spark-env.sh

export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/hadoop/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/hadoop/lib/native
export SPARK_YARN_USER_ENV="JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH,LD_LIBRARY_PATH=$LD_LIBRARY_PATH"

Checked Snappy library is present

hadoop checknative
...
Native library checking:
hadoop:  true /opt/hadoop/lib/native/libhadoop.so.1.0.0
zlib:    true /lib/x86_64-linux-gnu/libz.so.1
zstd  :  true /lib/x86_64-linux-gnu/libzstd.so.1
snappy:  true /lib/x86_64-linux-gnu/libsnappy.so.1
lz4:     true revision:10301
bzip2:   true /lib/x86_64-linux-gnu/libbz2.so.1
...

Workaround

I also tried to run the spark-shell without Yarn and I could save my RDD as a text file compressed by Snappy successfully. Hence, the issue seemed to be Yarn related. I added the following properties inside the spark-defaults.conf that eventually helped to get rid of the issue when Yarn was used. But I am not sure why this is actually needed and if this is the right approach to configure Spark on Yarn to use hadoop native libraries.

spark.driver.extraLibraryPath /opt/hadoop/lib/native
spark.executor.extraLibraryPath /opt/hadoop/lib/native
Gorcock answered 30/10, 2020 at 11:12 Comment(1)
I guess it is because when spark submits a task, the configuration items in defaluts.conf will be merged by defaultShipley

© 2022 - 2024 — McMap. All rights reserved.