Any file write attempt of Avro format fails with the stack trace below.
We are using Spark 2.4.3 (with user provided Hadoop), Scala 2.12, and we load the Avro package at runtime with either spark-shell:
spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.3
or spark-submit:
spark-submit --packages org.apache.spark:spark-avro_2.12:2.4.3 ...
The spark Session reports loading the Avro package successfully.
... in either case, the moment we attempt to write any data to an avro format, like:
df.write.format("avro").save("hdfs:///path/to/outputfile.avro")
or with a select:
df.select("recordidstring").write.format("avro").save("hdfs:///path/to/outputfile.avro")
... produces the same stacktrace error (this copy from spark-shell):
java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
at org.apache.spark.sql.avro.SchemaConverters$.$anonfun$toAvroType$1(SchemaConverters.scala:176)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$prepareWrite$2(AvroFileFormat.scala:119)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
We are able to write other formats (text-delimited, json, ORC, parquet) without any trouble.
We are using HDFS (Hadoop v3.1.2) as the filestore.
I have experimented with different package versions of Avro (e.g. 2.11, lower) which either raises the same error or fails to load entirely due to incompatibility. This error occurs with all of Python, Scala (using shell or spark-submit) and Java (using spark-submit).
There appears to be an Open Issue on apache.org JIRA for this, but this is a year old now without any resolution. I've bumped that issue, but also wondering if the community had a fix? Any help much appreciated.
This issue appears to be specific to our configuration on our local cluster - single node builds of HDFS (locally on windows, other linux etc) allow avro to write fine. We will rebuild the problem cluster but I'm confident the issue a bad config on that cluster only - solution - rebuild.