apache-kafkahdfsavroflume

using different avro schema for new columns


I am using flume + kafka to sink the log data to hdfs. My sink data type is Avro. In avro schema (.avsc), there is 80 fields as columns.

So I created an external table like that

CREATE external TABLE pgar.tiz_biaws_fraud
PARTITIONED BY(partition_date INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/datapool/flume/biaws/fraud'
TBLPROPERTIES ('avro.schema.url'='hdfs://xxxx-ns/data/datapool/flume/biaws/fraud.avsc')

Now, I need to add 25 more columns to avro schema. In that case,

if I create a new table with new schema which has 105 columns, I will have two table for one project. And if I add or remove some columns in coming days, I have to create a new table for that. I am afraid of having a lot of table which use different schema for same project.

If I swap the old schema with new schema in current table, I will have only one table for one project but I can't read and get old data anymore because of schema conflict.

What is the best way to use avro schema in case like that?


Solution

  • I added new columns to avro schema like that

    {"name":"newColumn1", "type": "string", "default": ""},
    {"name":"newColumn2", "type": "string", "default": ""},
    {"name":"newColumn3", "type": "string", "default": ""},
    

    When I use default property, if that columns doesn't exist in current data it returns default value, if that columns does exist in current data it returns the data value as expected.

    For setting null value as default, you need that

    { "name": "newColumn4", "type": [ "string", "null" ], "default": "null" },
    

    or

    { "name": "newColumn5", "type": [ "null", "string" ]},
    

    The position of null in type property, can be first place or can be second place with default property.