hadoopapache-spark

java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics


I'm trying to run a simple spark to s3 app from a server but I keep getting the below error because the server has hadoop 2.7.3 installed and it looks like it doesn't include the GlobalStorageStatistics class. I have hadoop 2.8.x defined in my pom.xml file but trying to test it by running it locally.

How can I make it ignore searching for that or what workaround options are there to include that class if I have to go with hadoop 2.7.3?

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:348)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:301)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:441)
    at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:425)
    at com.ibm.cos.jdbc2DF$.main(jdbc2DF.scala:153)
    at com.ibm.cos.jdbc2DF.main(jdbc2DF.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StorageStatistics
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 28 more

Solution

  • You can't mix bits of Hadoop and expect things to work. It's not just the close coupling between internal classes in hadoop-common and hadoop-aws, its things like the specific version of the amazon SDK the hadoop-aws module was built it.

    If you get ClassNotFoundException or MethodNotFoundException stack traces when trying to work with s3a:// URLs, JAR version mismatch is the likely cause.

    Using the RFC2117 MUST/SHOULD/MAY terminology, here are the rules to avoid this situation:

    1. The s3a connector is in hadoop-aws JAR; it depends on hadoop-common and the appropriate shaded AWS SDK JARs.
    2. all these JARs MUST be on the classpath.
    3. All versions of the hadoop-* JARs on your classpath MUST be exactly the same version, e.g 3.4.1 everywhere, or 3.3.6. Otherwise: stack trace. Always
    4. And they MUST be exclusively of that version; there MUST NOT be multiple versions of hadoop-common, hadoop-aws etc on the classpath. Otherwise: stack trace. Always. Usually ClassNotFoundException or MethodNotFoundException indicating a mismatch in hadoop-common and hadoop-aws.
    5. The exact missing classes/methods vary across Hadoop releases: it's the first class depended on by org.apache.fs.s3a.S3AFileSystem which the classloader can't find -the exact class depends on the mismatch of JARs
    6. The AWS SDK jar SHOULD be the huge AWS SDK bundle unless you know exactly which bits of the AWS SDK stack you need *and are confident all transitive dependencies (jackson, httpclient, ...) are in your Spark distribution and compatible. Otherwise: missing classes or odd runtime issues.
    7. Hadoop 3.3.x and earlier used the v1 SDK JAR, called aws-sdk-bundle.
    8. Hadoop 3.4.0 and later uses the v2 AWS SDK, whose bundled shaded SDK JAR is called bundle.jar.
    9. You can't add a v2
    10. There MUST NOT be any other v2 or v2 AWS SDK jars on your classpath. Otherwise: duplicate classes and general classpath problems. Note: AWS v1 and v2 SDKs can coexist as they use completely different classes in different packages.
    11. The AWS v1 and v2 SDKs are completely different. If the wrong one is on your classpath for the hadoop version you are using, expect ClassNotFoundException.
    12. Apart from AWS credential providers, v1 SDK extensions (signers, etc) do not work with the v2 SDK.
    13. The specific version of the AWS SDK you need can be determined from Maven Repository
    14. Changing the AWS SDK versions MAY work. You get to test, and if there are compatibility problems: you get to fix. See Qualifying an AWS SDK Update for the least you should be doing.
    15. You SHOULD use the most recent versions of Hadoop you can/Spark is tested with. Non-critical bug fixes do not get backported to old Hadoop releases, and the S3A and ABFS connectors are rapidly evolving. New releases will be better, stronger, faster. Generally.
    16. If there's a feature/fix on a recent version of hadoop which isn't on the one you are using, you SHALL NOT ask for a backport, unless you want it closed as "invalid please upgrade". After all: it's a free upgrade.
    17. You MAY fork your own hadoop/spark/aws sdk versions and cherrypick commits from more recent builds. This is the likely only alternative to upgrade. All commercial vendors of hadoop, spark, iceberg etc do this for the S3A code. Apart from AWS themselves, nobody goes near the AWS SDK.
    18. There is no need to use that spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem declaration most Spark+S3A examples on SO do. That is a superstition which is passed on from one SO to the next. That binding information is set in the XML file core-default.xml in hadoop-common.jar; manually setting it can only break things.
    19. If none of this works. a bug report filed on the ASF JIRA server will get closed as WORKSFORME. Config issues aren't treated as code bugs.

    Finally: the ASF documentation: The S3A Connector.

    Note: that link is to the latest release. If you are using an older release it will lack features. Upgrade before complaining that the s3a connector doesn't do what the documentation says it does.