pythonamazon-web-servicesaws-glueamazon-kinesis-firehoseapache-iceberg

How do I insert timestamp data into an AWS Glue managed Iceberg table using AWS Firehose?


Using AWS Firehose to ingest data into an Iceberg table managed by AWS Glue, I'm unable to insert timestamp data.

Firehose

I'm trying to insert data using the following script:

json_data = json.dumps(
    {
        "ADF_Record": {
            "foo": "bar",
            "baz": "2024-09-04T18:56:15.114"
        },
        "ADF_Metadata": { 
            "OTF_Metadata": {
                "DestinationDatabaseName": "my_db",
                "DestinationTableName": "my_table",
                "Operation": "INSERT"
            }
        }
    }
)

response = boto3.client("firehose").put_record(
    DeliveryStreamName="my_stream",
    Record={"Data": json_data.encode()}
)

Note that the baz value corresponds to a timestamp of type TimestampType.withoutZone as referenced in the Firehose documentation.

Glue

Error

Whenever I try to insert data using this method, no data is delivered and I get this error on Firehose side :

Firehose is unable to convert column data in your record to the column type specified within the schema. Table: my_db.my_table

Things I tried

I'm about to give up and just write timestamps as string. Any insight is appreciated!


Solution

  • I think I figured it out. Modifying the schema of your Glue table has no incidence on the schema of the underlying Iceberg table (i.e. metadata are not altered by Glue), it only goes one way. This resulted in an inconsistency between schemas : baz was a date for Iceberg and a timestamp for Glue.

    What worked was to delete baz with PyIceberg (an Athena query might do the trick too) and recreate it with the timestamp type. This will update the metadata on Iceberg side and the Glue schema altogether. Now, baz is properly interpreted by Firehose, Glue and Iceberg.