I have a CDC data feed from SQL Server that I ingest into our delta lakehouse using Azure Databricks. Some of the fields are stored as a datetimeoffset
in SQL Server and when I load them using pyspark, I have to ingest them as a StringType
rather than a TimestampType
(otherwise they parse as a null). If I try to clean up the data using the to_timestamp()
method, I also get a null. The problem seems to be with specifying the offset. If I do not specify an offset in the format string, the data parses fine (although it ignores the offset which I need as these timestamps are keyed to the customer's timezone and thus every transaction is in a different timezone depending on the customer). I've posted an example below.
dfData.withColumn('DateCreated', to_timestamp(col('DateCreated'), 'M/d/yyyy h:mm:ss a ZZZZZ'))
A sample date looks like: 4/15/2020 3:09:50 PM -07:00
. If I remove the 'ZZZZZ'
from the format string, I get a timestamp that looks like 2020-04-15T15:09:50.000+00:00
(note the missing offset information). With the 'ZZZZZ'
in the format string, it returns null
. I have tried every iteration of the various format options for offset strings listed in the pyspark documentation but none of them work (and a few seem to be no longer supported as I get IllegalArgumentException: Illegal pattern character
errors if I use x
or O
: only X
and Z
seem to be supported as they do not throw exceptions—however regardless of the number of characters I specify (e.g. 'xxx'
or 'Z'
or 'ZZZZ'
or 'xx'
), the data always parses as null). Has anyone else run into this issue? How do I resolve it? For now, I'm leaving the data as StringType
but it will make any datetime analysis difficult for downstream workloads if I cannot resolve this issue.
In case it's needed, my cluster is running Databricks v 15.1 which includes Apache Spark 3.5.0. The data is coming from SQL Server 2016 v 13.0.6435.1.
I have tried the below approach:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, substring, length, concat, to_timestamp
data = [
("4/15/2020 3:09:50 PM -07:00",),
("5/20/2021 10:30:15 AM -05:00",),
("6/25/2022 8:45:00 PM +03:00",)
]
df = spark.createDataFrame(data, ["DateCreated"])
df = df.withColumn('offset', expr("substring(DateCreated, length(DateCreated) - 5, 6)"))
df = df.withColumn('timestamp_str', expr("substring(DateCreated, 1, length(DateCreated) - 7)"))
df = df.withColumn('combined', concat(col('timestamp_str'), col('offset')))
df = df.withColumn('DateCreated', to_timestamp(col('combined'), 'M/d/yyyy h:mm:ss aXXX'))
df.show(truncate=False)
df = df.drop('offset', 'timestamp_str', 'combined')
Results:
+-------------------+------+---------------------+---------------------------+
|DateCreated |offset|timestamp_str |combined |
+-------------------+------+---------------------+---------------------------+
|2020-04-15 22:09:50|-07:00|4/15/2020 3:09:50 PM |4/15/2020 3:09:50 PM-07:00 |
|2021-05-20 15:30:15|-05:00|5/20/2021 10:30:15 AM|5/20/2021 10:30:15 AM-05:00|
|2022-06-25 17:45:00|+03:00|6/25/2022 8:45:00 PM |6/25/2022 8:45:00 PM+03:00 |
+-------------------+------+---------------------+---------------------------+
In the above code, I extracted the timezone offset, the timestamp string without the timezone offset, and combined the timestamp string and offset. Next, I converted the combined string to a timestamp. You can also drop the intermediate columns if needed.