I have spark Scala dataframe with column contains the value of avro message(Array[Byte]). I know that the 0 byte is the magic byte and the bytes in positions 1-4 included is the schema id. how can i extract those bytes (1-4) and add new column with the schema id value in int?
need to use some spark functions/udf in spark Scala to extrace the schema id value
You could do something like:
import org.apache.spark.sql.functions.udf
// extract the schema id (bytes 1-4) and convert it to integer
val extractSchemaId = udf((message: Array[Byte]) => {
ByteBuffer.wrap(message.slice(1, 5)).getInt()
})
// assuming the "value" column holds the Avro message
df.withColumn("schema_id", extractSchemaId(df("value")))