Error through remote Spark Job: java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem

Asked 13/7, 2020 at 16:18 Answered 20/9, 2023 at 14:45

Solved scala apache-spark hadoop spark-structured-streaming azure-hdinsight

Problem

I am trying to run a remote Spark Job through IntelliJ with a Spark HDInsight cluster (HDI 4.0). In my Spark application I am trying to read an input stream from a folder of parquet files from Azure blob storage using Spark's Structured Streaming built in readStream function.

The code works as expected when I run it on a Zeppelin notebook attached to the HDInsight cluster. However, when I deploy my Spark application to the cluster, I encounter the following error:

java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator

Subsequently, I am unable to read any data from blob storage.

The little information I found online suggested that this is caused by a version conflict between Spark and Hadoop. The application is run with Spark 2.4 prebuilt for Hadoop 2.7.

Fix

To fix this, I ssh into each head and worker node of the cluster and manually downgrade the Hadoop dependencies to 2.7.3 from 3.1.x to match the version in my local spark/jars folder. After doing this , I am then able to deploy my application successfully. Downgrading the cluster from HDI 4.0 is not an option as it is the only cluster that can support Spark 2.4.

Summary

To summarize, could the issue be that I am using a Spark download prebuilt for Hadoop 2.7? Is there a better way to fix this conflict instead of manually downgrading the Hadoop versions on the cluster's nodes or changing the Spark version I am using?

Simulate answered 13/7, 2020 at 16:18 Comment(1)

Hi @Maria, Glad to know that your issue has resolved. You can accept it as answer( click on the check mark beside the answer to toggle it from greyed out to filled in.). This can be beneficial to other community members. Thank you. – Oolite 15/7, 2020 at 5:33

After troubleshooting some previous methods I had attempted before, I've come across the following fix:

In my pom.xml I excluded the hadoop-client dependency automatically imported by the spark-core jar. This dependency was version 2.6.5 which conflicted with the cluster's version of Hadoop. Instead, I import the version I require.

<dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version.major}</artifactId>
            <version>${spark.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.hadoop</groupId>
                    <artifactId>hadoop-client</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
</dependency>

After making this change, I encountered the error java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0. Further research revealed this was due to a problem with the Hadoop configuration on my local machine. Per this article's advice, I modified the winutils.exe version I had under C://winutils/bin to be the version I required and also added the corresponding hadoop.dll. After making these changes, I was able to successfully read data from blob storage as expected.

TLDR Issue was the auto imported hadoop-client dependency which was fixed by excluding it & adding the new winutils.exe and hadoop.dll under C://winutils/bin.

This no longer required downgrading the Hadoop versions within the HDInsight cluster or changing my downloaded Spark version.

Simulate answered 14/7, 2020 at 20:58 Comment(2)

For me it was enough to add the following dependency (no need for exclusions):

<dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>3.0.0</version></dependency>

– Nickynico 7/6, 2021 at 9:53

The above solution suggested by @D.Müller has worked for me. Thanks – Sher 24/10 at 8:25

Problem: I was facing same issue while running fat jar with hadoop 2.7 and spark 2.4 on cluster with hadoop 3.x , I was using maven shade plugin.

Observation: While building fat jar it was including jar org.apache.hadoop:hadoop-hdfs:jar:2.6.5 which has class class org.apache.hadoop.hdfs.web.HftpFileSystem. Which was causing problem in hadoop 3

Solution: I have excluded this jar while building fat jar as below.Issue got resolved.

Manvell answered 23/11, 2020 at 13:10 Comment(1)

I met the same issue , and it works as per the solution here. Thanks. – Settle 15/12, 2021 at 22:12

I faced the same issue due to the fact that I missed the % provided declarations in the built.sbt file.

While I run the sbt clean assembly, it creates a jar file that matches my local environment versions.

Previously

libraryDependencies ++= {

  val sparkVersion = "2.3.0"    
  Seq(
    "com.typesafe"      %  "config"        % "1.3.1",
    "org.apache.spark"  %%  "spark-core"   % sparkVersion,
    "org.apache.spark"  %%  "spark-sql"    % sparkVersion,
    "org.apache.spark"  %% "spark-hive"    % sparkVersion
  )
}

Now

libraryDependencies ++= {

  val sparkVersion = "2.3.0"
  
  Seq(
    "com.typesafe"      %   "config"        % "1.3.1",
    "org.apache.spark"  %%  "spark-core"    % sparkVersion % "provided",
    "org.apache.spark"  %%  "spark-sql"     % sparkVersion % "provided",
    "org.apache.spark"  %%  "spark-hive"    % sparkVersion % "provided"
  )
}

provided - This indicates that the specified dependency is expected to be provided by the Spark cluster or another environment at runtime. It means that you don't want to include this dependency in your project's JAR file because it's already available externally.

Gruel answered 20/9, 2023 at 14:45 Comment(0)

Problem

Fix

Summary

Recommended topics

Hot tags