apache-sparkgoogle-bigqueryprotobuf-java

Spark + BigQuery: `InvocationTargetException: java.lang.VerifyError: Bad type on operand stack`


I've tried this with all combinations of:

and I get the same error. Here I will show the Scala + JDK 11 + Spark 3.3.1 attempt, but as I said, all combinations result in the same error:

  1. Set JAVA_HOME for JDK 11, SPARK_HOME for Spark 3.3.1 and run the Scala spark-shell with BigQuery + GCS connector configured:
export JAVA_HOME=$(/usr/libexec/java_home -v 11)
export SPARK_HOME=~/opt/spark/spark-3.3.1-bin-hadoop3-scala2.13
$SPARK_HOME/bin/spark-shell \
  -c spark.hadoop.fs.AbstractFileSystem.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS \
  -c spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem \
  --packages "com.google.cloud.spark:spark-bigquery-with-dependencies_2.13:0.28.0,com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.10"
  1. Setup a dummy test dataframe:
import org.apache.spark.sql._
import org.apache.spark.sql.types._

val df = spark.createDataFrame(
  java.util.List.of(
    Row(1, "foo"),
    Row(2, "bar")
  ), StructType(
    StructField("a", IntegerType) ::
    StructField("b", StringType) ::
    Nil))

df.show()

That results in:

+---+---+
|  a|  b|
+---+---+
|  1|foo|
|  2|bar|
+---+---+
  1. Write the simple dataframe to BigQuery:
df.write.
  format("bigquery").
  mode("overwrite").
  option("project", "<redacted>").
  option("parentProject", "<redacted>").
  option("dataset", "<redacted>").
  option("credentials", bigquery_credentials_b64).
  option("temporaryGcsBucket", "<redacted>").
  save("test_table")

I get:

java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
  at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:137)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3467)
  at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
  at com.google.cloud.spark.bigquery.SparkBigQueryUtil.getUniqueGcsPath(SparkBigQueryUtil.java:127)
  at com.google.cloud.spark.bigquery.SparkBigQueryUtil.createGcsPath(SparkBigQueryUtil.java:108)
  ... 75 elided
Caused by: java.lang.reflect.InvocationTargetException: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/api/ClientProto.registerAllExtensions(Lcom/google/protobuf/ExtensionRegistryLite;)V @4: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @4
    flags: { }
    locals: { 'com/google/protobuf/ExtensionRegistryLite' }
    stack: { 'com/google/protobuf/ExtensionRegistryLite', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
  Bytecode:
    0000000: 2ab2 0002 b600 032a b200 04b6 0003 2ab2
    0000010: 0005 b600 03b1

  at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
  at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
  at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
  at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
  ... 83 more
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/api/ClientProto.registerAllExtensions(Lcom/google/protobuf/ExtensionRegistryLite;)V @4: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @4
    flags: { }
    locals: { 'com/google/protobuf/ExtensionRegistryLite' }
    stack: { 'com/google/protobuf/ExtensionRegistryLite', 'com/google/protobuf/GeneratedMessage$GeneratedExtension' }
  Bytecode:
    0000000: 2ab2 0002 b600 032a b200 04b6 0003 2ab2
    0000010: 0005 b600 03b1

  ... 5 elided and 88 more

Solution

  • The solution is that custom shaded .jars are required. Managed Spark environments like Databricks and Amazon EMR have solved these issues, but this is actually quite complex to get running in a local environment with spark-shell.