I am trying to run a remote Spark Job through IntelliJ with a Spark HDInsight cluster (HDI 4.0). In my Spark application I am trying to read an input stream from a folder of parquet files from Azure blob storage using Spark's Structured Streaming built in readStream
function.
The code works as expected when I run it on a Zeppelin notebook attached to the HDInsight cluster. However, when I deploy my Spark application to the cluster, I encounter the following error:
java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
Subsequently, I am unable to read any data from blob storage.
The little information I found online suggested that this is caused by a version conflict between Spark and Hadoop. The application is run with Spark 2.4
prebuilt for Hadoop 2.7
.
To fix this, I ssh into each head and worker node of the cluster and manually downgrade the Hadoop dependencies to 2.7.3
from 3.1.x
to match the version in my local spark/jars
folder. After doing this , I am then able to deploy my application successfully. Downgrading the cluster from HDI 4.0 is not an option as it is the only cluster that can support Spark 2.4
.
To summarize, could the issue be that I am using a Spark download prebuilt for Hadoop 2.7
? Is there a better way to fix this conflict instead of manually downgrading the Hadoop versions on the cluster's nodes or changing the Spark version I am using?
After troubleshooting some previous methods I had attempted before, I've come across the following fix:
In my pom.xml
I excluded the hadoop-client
dependency automatically imported by the spark-core
jar. This dependency was version 2.6.5
which conflicted with the cluster's version of Hadoop. Instead, I import the version I require.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version.major}</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependency>
After making this change, I encountered the error java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0
. Further research revealed this was due to a problem with the Hadoop configuration on my local machine. Per this article's advice, I modified the winutils.exe
version I had under C://winutils/bin
to be the version I required and also added the corresponding hadoop.dll
. After making these changes, I was able to successfully read data from blob storage as expected.
TLDR
Issue was the auto imported hadoop-client
dependency which was fixed by excluding it & adding the new winutils.exe
and hadoop.dll
under C://winutils/bin
.
This no longer required downgrading the Hadoop versions within the HDInsight cluster or changing my downloaded Spark version.