I have a spark dataframe that has the following schema
StructType(
StructField(id,StringType,true),
StructField(type,StringType,true),
)
that I need to convert to avro with the following avro schema using the to_avro
function from spark-avro
like so to_avro(spark_df, jsonFormatSchema)
{
"type": "record",
"name": "Value",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "type",
"type": "string"
},
{
"name": "x",
"type": [
"null",
"string"
],
"default": null
},
{
"name": "y",
"type": [
{
"type": "boolean",
"connect.default": false
},
"null"
],
"default": false
}
],
}
Now obviously, my spark dataframe does not have the columns x and y, how do I define the avro schema so that the avro binary my spark dataframe is serialized into will contain null/default values for those fields instead of throwing an IncompatibleSchemaException?
I thought the "null" value in the type array would take care of fields that are not present in the input spark dataframe but that turned out to be wrong.
The problem is that default values are only used when decoding, not encoding. See this section in the specification: https://avro.apache.org/docs/current/specification/#schema-record
Specifically this part:
A default value for this field, only used when reading instances that lack the field for schema evolution purposes. The presence of a default value does not make the field optional at encoding time.