Using AWS Firehose to ingest data into an Iceberg table managed by AWS Glue, I'm unable to insert timestamp data.
Firehose
I'm trying to insert data using the following script:
json_data = json.dumps(
{
"ADF_Record": {
"foo": "bar",
"baz": "2024-09-04T18:56:15.114"
},
"ADF_Metadata": {
"OTF_Metadata": {
"DestinationDatabaseName": "my_db",
"DestinationTableName": "my_table",
"Operation": "INSERT"
}
}
}
)
response = boto3.client("firehose").put_record(
DeliveryStreamName="my_stream",
Record={"Data": json_data.encode()}
)
Note that the baz
value corresponds to a timestamp of type TimestampType.withoutZone as referenced in the Firehose documentation.
Glue
foo
: string
baz
: timestamp
Error
Whenever I try to insert data using this method, no data is delivered and I get this error on Firehose side :
Firehose is unable to convert column data in your record to the column type specified within the schema. Table: my_db.my_table
Things I tried
baz
is removed from the payload (the pipeline seems functional without timestamp).1725476175114000
) doesn't help. Glue creates a new version of the table with baz
as date
and the written data is not legible.org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
and timestamp.formats
parameter doesn't help. Glue creates a new version of the table and removes the SerDe parameters altogether.I'm about to give up and just write timestamps as string. Any insight is appreciated!
I think I figured it out. Modifying the schema of your Glue table has no incidence on the schema of the underlying Iceberg table (i.e. metadata are not altered by Glue), it only goes one way. This resulted in an inconsistency between schemas : baz
was a date
for Iceberg and a timestamp
for Glue.
What worked was to delete baz
with PyIceberg (an Athena query might do the trick too) and recreate it with the timestamp
type. This will update the metadata on Iceberg side and the Glue schema altogether. Now, baz
is properly interpreted by Firehose, Glue and Iceberg.