I have a Python pipeline where I try to:
But I keep having issues due to the datetime field request_ts.
In my BQ schema, request_ts is typed as TIMESTAMP
My avro schema is generated by the py_avro_schema library via a dataclass.
In it, I was specifying that request_ts was of type Optional[datetime], but apparently this is wrong because then avro just maps the value to a regular string, whereas avro timestamps are longs with an additional logical type. So I manually modify my schema:
schema_bytes = generate(MyClass, namespace="myclass")
schema_str = schema_bytes.decode("utf-8") if isinstance(schema_bytes, bytes) else schema_bytes
schema = json.loads(schema_str)
schema["fields"][0] = {
"name": "request_ts",
"type": [
"null",
{
"type": "long",
"logicalType": "timestamp-micros"
}
],
"default": None,
}
I've tried to generate request_ts as a python datetime, isoformat datetime and timestamp
# tried eeach approach individually
datetime.datetime.now(tz=datetime.timezone.utc)
datetime.datetime.now(tz=datetime.timezone.utc).isoformat()
int(datetime.datetime.now(tz=datetime.timezone.utc).timestamp())
But it always leads to errors in Bigquery, it is unable to parse request_ts into a timestamp.
"type": ["null",{"type": "long","logicalType":"timestamp-millis"}]I've also tried to just select a timestamp in duckdb. This is typed by python as a Pandas Timestamp, which also fails because it doesn't map to an avro long.
So I seem to be missing something. I thought that maybe the BQ load job had to be manually configured to use avro logical types, but apparently this is not an option in the python API and simply having the logical types specified in the schema should make BQ try to parse the types properly
I additionally tried to recreate the table without time partitioning, in which case the job succeeds, but the timestamp is just the epoch timestamp in 1970, not current one
The main issue was actually how the timestamp was being generated.
py-avro-schema can actually correctly annotate the schema. But the provided timestamp (from Python) must match the expected avro logical type. I.e if one expects to load the data into BQ as a timestamp field, it must be a timestamp in microseconds. So whatever value is passed from python, it has to be like
ts = int(datetime.now(timezone.utc).timestamp()) * 1_000_000
And then the library will be able to correctly convert it to avro and bigquery will accept it as a timestamp