I'm experiencing timestamp precision issues when reading Delta tables created by an Azure Data Factory CDC dataflow. The pipeline extracts data from Azure SQL Database (using native CDC enabled on the source node in the dataflow) to Azure Data Lake Gen2 as Delta format, but reading the Delta table with Python libraries fails due to timestamp precision mismatches.
delta_table_uri = f"abfss://{container}@{account_name}.dfs.core.windows.net/{table_path}"
dt = DeltaTable(delta_table_uri)
df = dt.to_pandas()
Error:
ArrowInvalid: Casting from timestamp[ns] to timestamp[us, tz=UTC] would lose data: 4379020357742690304
Polars Approach
df = pl.read_delta(delta_table_uri, storage_options=storage_options)
Error:
SchemaError: data type mismatch for column DeliveryDate: expected: datetime[μs, UTC], found: datetime[ns]
Azure SQL Database datetime2 columns are set to precision 3 (millisecond precision) The polars table schema shows microsecond precision (datetime[μs, UTC])
('DeliveryDate', Datetime(time_unit='us', time_zone='UTC')),
I've seen online some esoteric suggestions that this is due to spark writing these timestamps as an int96 format (?)
I've verified that the source SQL Server timestamps are millisecond precision (datetime2(3))
Questions Root Cause: Why is there a mismatch between the Delta table's reported schema (microseconds) and actual data precision (nanoseconds)?
Azure Data Factory: Is there a setting in ADF to control timestamp precision when writing to Delta format?
Python Workaround: How can I read this Delta table in Python while handling the precision mismatch? Are there specific parameters for deltalake or polars to handle this?
Long-term Solution: What's the recommended approach to ensure timestamp precision consistency in this Azure ecosystem?
Additionally, The Delta table was created through Azure Data Factory's built-in Delta format support This issue occurs specifically with datetime columns from the CDC process
I believe the issue lies with pyarrow trying to read these timestamp columns.
Any insights on resolving this timestamp precision issue would be greatly appreciated!
I found the solution:
df = pl.scan_delta(delta_table_uri,
storage_options=storage_options,
use_pyarrow=True,
pyarrow_options={"parquet_read_options": {"coerce_int96_timestamp_unit":"ms"}}).collect()
Where storage options = my azure keys.
Setting these read options enables me to read a delta table with really wonky timestamp values properly. The minimum of one of my columns is year 1899.
The issue is indeed an int96 timestamp written into the delta lake by spark.
EDIT: Apparently this is Azure Data Factory writing these int96 timestamps which is absolutely insane. Int96 is a deprecated timestamp format and Microsoft's SAAS ETL platform is writing them... incredible.