I tried to run my Spark/Scala code 2.3.0 on a Cloud Dataproc cluster 1.4 where there's Spark 2.4.8 installed. I faced an error concerning the reading of avro files. Here's my code :
sparkSession.read.format("com.databricks.spark.avro").load(input)
This code failed as expected. Then I added this dependency to my pom.xml
file:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>2.4.0</version>
</dependency>
Which made my code run successfully. And this is the part that I don't understand , I'm still using the module com.databricks.spark.avro
in my code. Why is adding org.apache.spark.avro
dependency solved my problem, knowing that I'm not really using it in my code?
I was expecting that I will need to change my code to something like this:
sparkSession.read.format("avro").load(input)
This is historic artifact of the fact that initially Spark Avro support was added by Databricks in their proprietary Spark Runtime as com.databricks.spark.avro
format, when Sark Avro support was added to open-source Spark as avro
format then, for backward compatibility, support of the com.databricks.spark.avro
format was retained if spark.sql.legacy.replaceDatabricksSparkAvro.enabled
property is set to true
:
If it is set to true, the data source provider com.databricks.spark.avro is mapped to the built-in but external Avro data source module for backward compatibility.