I have a AVRO schema which is currently in single avsc file like below. Now I want to move address record to a different common avsc file which should be referenced from many other avsc file. So Customer and address will be separate avsc files. How can I separate them and and have customer avsc file reference address avsc file. Also how would both the files can be processed using python. I am currently using fast avro in python3 to process the single avsc file but open to use any other utility in python3 or pyspark.
File name - customer_details.avsc
[
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "zip",
"type": "string"
}
]
},
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "phone",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
]
import fastavro
s1 = fastavro.schema.load_schema('customer_details.avsc')
How can split the schema in different file where address record file can be referenced from other avsc file. Then how would I process multiple avsc files using fast Avro (Python) or any other python utility?
To do this, the schema for the AddressRecord
should be in a file called com.company.model.AddressRecord.avsc
with the following contents:
{
"type": "record",
"namespace": "com.company.model",
"name": "AddressRecord",
"fields": [
{
"name": "streetaddress",
"type": "string"
},
{
"name": "city",
"type": "string"
},
{
"name": "state",
"type": "string"
},
{
"name": "zip",
"type": "string"
}
]
}
The Customer
schema doesn't necessarily need a special naming convention since it is the top level schema, but it's probably a good idea to follow the same convention. So it would be in a file named com.company.model.Customer.avsc
with the following contents:
{
"namespace": "com.company.model",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "firstname",
"type": "string"
},
{
"name": "lastname",
"type": "string"
},
{
"name": "email",
"type": "string"
},
{
"name": "phone",
"type": "string"
},
{
"name": "address",
"type": {
"type": "array",
"items": "com.company.model.AddressRecord"
}
}
]
}
The files must be in the same directory.
Then you should be able to do fastavro.schema.load_schema('com.company.model.Customer.avsc')