I wrote a simple Scala application which reads a parquet file from GCS bucket. The application uses :
The connector is taken from Maven, imported via sbt (Scala build tool). I'm not using the latest, 2.2.9, version because of this issue.
The application works perfectly in local mode, so I tried to switch to the standalone mode.
What I did is these steps:
I tried to run the application again and faced this error:
[error] Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2688)
[error] at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
[error] at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
[error] at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
[error] at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
[error] at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
[error] at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
[error] at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
[error] at org.apache.parquet.hadoop.util.HadoopInputFile.fromStatus(HadoopInputFile.java:44)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:44)
[error] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:484)
[error] ... 14 more
[error] Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
[error] at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
[error] at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)
[error] ... 24 more
Somehow it cannot detect connector's file system: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
My spark configuration is pretty basic:
spark.app.name = "Example app"
spark.master = "spark://YOUR_SPARK_MASTER_HOST:7077"
spark.hadoop.fs.defaultFS = "gs://YOUR_GCP_BUCKET"
spark.hadoop.fs.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
spark.hadoop.google.cloud.auth.service.account.enable = true
spark.hadoop.google.cloud.auth.service.account.json.keyfile = "src/main/resources/gcp_key.json"
I ve found out that the maven version of GCS hadoop connector, is missing dependecies internally.
Ive fixed it by either:
to resolve the second option, I did unpack the gcs hadoop connector jar file, looked for the pom.xml, copy dependencies to a new stand alone xml file, and download them using mvn dependency:copy-dependencies -DoutputDirectory=/path/to/pyspark/jars/
command
here is example pom.xml that Ive created, please note I am using the 2.2.9 version of the connector
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<name>TMP_PACKAGE_NAME</name>
<description>
jar dependencies of gcs hadoop connector
</description>
<!--'com.google.oauth-client:google-oauth-client:jar:1.34.1'
-->
<groupId>TMP_PACKAGE_GROUP</groupId>
<artifactId>TMP_PACKAGE_NAME</artifactId>
<version>0.0.1</version>
<dependencies>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
<version>hadoop3-2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.api-client</groupId>
<artifactId>google-api-client-jackson2</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.1-jre</version>
</dependency>
<dependency>
<groupId>com.google.oauth-client</groupId>
<artifactId>google-oauth-client</artifactId>
<version>1.34.1</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util</artifactId>
<version>2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>util-hadoop</artifactId>
<version>hadoop3-2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcsio</artifactId>
<version>2.2.9</version>
</dependency>
<dependency>
<groupId>com.google.auto.value</groupId>
<artifactId>auto-value-annotations</artifactId>
<version>1.10.1</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>flogger</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>google-extensions</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.flogger</groupId>
<artifactId>flogger-system-backend</artifactId>
<version>0.7.4</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10</version>
</dependency>
</dependencies>
</project>
hope this helps