scalaapache-sparkhadoopgoogle-cloud-storageapache-spark-standalone

Hadoop 3 gcs-connector doesn't work properly with latest version of spark 3 standalone mode


I wrote a simple Scala application which reads a parquet file from GCS bucket. The application uses :

The connector is taken from Maven, imported via sbt (Scala build tool). I'm not using the latest, 2.2.9, version because of this issue.

The application works perfectly in local mode, so I tried to switch to the standalone mode.

What I did is these steps:

  1. Downloaded Spark 3.3.1 from here
  2. Started the cluster manually like here

I tried to run the application again and faced this error:

[error] Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error]         at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
[error]         at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
[error]         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
[error]         at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
[error]         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
[error]         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
[error]         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
[error]         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
[error]         at org.apache.parquet.hadoop.util.HadoopInputFile.fromStatus(HadoopInputFile.java:44)
[error]         at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
[error]         at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:484)
[error]         ... 14 more
[error] Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error]         at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
[error]         at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
[error]         ... 24 more

Somehow it cannot detect connector's file system: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

My spark configuration is pretty basic:

spark.app.name = "Example app"
spark.master = "spark://YOUR_SPARK_MASTER_HOST:7077"
spark.hadoop.fs.defaultFS = "gs://YOUR_GCP_BUCKET"
spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
spark.hadoop.google.cloud.auth.service.account.enable = true
spark.hadoop.google.cloud.auth.service.account.json.keyfile = "src/main/resources/gcp_key.json"

Solution

  • I ve found out that the maven version of GCS hadoop connector, is missing dependecies internally.

    Ive fixed it by either:

    to resolve the second option, I did unpack the gcs hadoop connector jar file, looked for the pom.xml, copy dependencies to a new stand alone xml file, and download them using mvn dependency:copy-dependencies -DoutputDirectory=/path/to/pyspark/jars/ command

    here is example pom.xml that Ive created, please note I am using the 2.2.9 version of the connector

    <project xmlns="http://maven.apache.org/POM/4.0.0"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
        <modelVersion>4.0.0</modelVersion>
        <name>TMP_PACKAGE_NAME</name>
        <description>
            jar dependencies of gcs hadoop connector
        </description>
        <!--'com.google.oauth-client:google-oauth-client:jar:1.34.1'
        -->
        <groupId>TMP_PACKAGE_GROUP</groupId>
        <artifactId>TMP_PACKAGE_NAME</artifactId>
        <version>0.0.1</version>
        <dependencies>
    
    <dependency>
                <groupId>com.google.cloud.bigdataoss</groupId>
                <artifactId>gcs-connector</artifactId>
                <version>hadoop3-2.2.9</version>
            </dependency>
    
            <dependency>
                <groupId>com.google.api-client</groupId>
                <artifactId>google-api-client-jackson2</artifactId>
                <version>2.1.0</version>
            </dependency>
    
            <dependency>
                <groupId>com.google.guava</groupId>
                <artifactId>guava</artifactId>
                <version>31.1-jre</version>
            </dependency>
            <dependency>
                <groupId>com.google.oauth-client</groupId>
                <artifactId>google-oauth-client</artifactId>
                <version>1.34.1</version>
            </dependency>
    
            <dependency>
                <groupId>com.google.cloud.bigdataoss</groupId>
                <artifactId>util</artifactId>
                <version>2.2.9</version>
            </dependency>
            <dependency>
                <groupId>com.google.cloud.bigdataoss</groupId>
                <artifactId>util-hadoop</artifactId>
                <version>hadoop3-2.2.9</version>
            </dependency>
            <dependency>
                <groupId>com.google.cloud.bigdataoss</groupId>
                <artifactId>gcsio</artifactId>
                <version>2.2.9</version>
            </dependency>
            <dependency>
                <groupId>com.google.auto.value</groupId>
                <artifactId>auto-value-annotations</artifactId>
                <version>1.10.1</version>
                <scope>runtime</scope>
            </dependency>
    
            <dependency>
                <groupId>com.google.flogger</groupId>
                <artifactId>flogger</artifactId>
                <version>0.7.4</version>
            </dependency>
    
            <dependency>
                <groupId>com.google.flogger</groupId>
                <artifactId>google-extensions</artifactId>
                <version>0.7.4</version>
            </dependency>
    
            <dependency>
                <groupId>com.google.flogger</groupId>
                <artifactId>flogger-system-backend</artifactId>
                <version>0.7.4</version>
            </dependency>
    
            <dependency>
                <groupId>com.google.code.gson</groupId>
                <artifactId>gson</artifactId>
                <version>2.10</version>
            </dependency>
    
        </dependencies>
    </project>
    
    

    hope this helps