apache-sparkapache-kafkaavrospark-structured-streamingspark-avro

How to extract schema id from avro message in Spark Scala


I have spark Scala dataframe with column contains the value of avro message(Array[Byte]). I know that the 0 byte is the magic byte and the bytes in positions 1-4 included is the schema id. how can i extract those bytes (1-4) and add new column with the schema id value in int?

need to use some spark functions/udf in spark Scala to extrace the schema id value


Solution

  • You could do something like:

    import org.apache.spark.sql.functions.udf
    
    // extract the schema id (bytes 1-4) and convert it to integer
    val extractSchemaId = udf((message: Array[Byte]) => {
      ByteBuffer.wrap(message.slice(1, 5)).getInt()
    })
    
    // assuming the "value" column holds the Avro message
    df.withColumn("schema_id", extractSchemaId(df("value")))