scalaapache-sparkgoogle-cloud-dataprocspark-avro

Why is adding org.apache.spark.avro dependency is mandatory to read/write avro files in Spark2.4 while I'm using com.databricks.spark.avro?


I tried to run my Spark/Scala code 2.3.0 on a Cloud Dataproc cluster 1.4 where there's Spark 2.4.8 installed. I faced an error concerning the reading of avro files. Here's my code :

sparkSession.read.format("com.databricks.spark.avro").load(input)

This code failed as expected. Then I added this dependency to my pom.xml file:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>2.4.0</version>
</dependency>

Which made my code run successfully. And this is the part that I don't understand , I'm still using the module com.databricks.spark.avro in my code. Why is adding org.apache.spark.avro dependency solved my problem, knowing that I'm not really using it in my code?

I was expecting that I will need to change my code to something like this:

sparkSession.read.format("avro").load(input)

Solution

  • This is historic artifact of the fact that initially Spark Avro support was added by Databricks in their proprietary Spark Runtime as com.databricks.spark.avro format, when Sark Avro support was added to open-source Spark as avro format then, for backward compatibility, support of the com.databricks.spark.avro format was retained if spark.sql.legacy.replaceDatabricksSparkAvro.enabled property is set to true:

    If it is set to true, the data source provider com.databricks.spark.avro is mapped to the built-in but external Avro data source module for backward compatibility.