pysparkapache-spark-sqlazure-databrickssql-server-2016

DateTimeOffset in Databricks not parsing using to_timestamp


I have a CDC data feed from SQL Server that I ingest into our delta lakehouse using Azure Databricks. Some of the fields are stored as a datetimeoffset in SQL Server and when I load them using pyspark, I have to ingest them as a StringType rather than a TimestampType (otherwise they parse as a null). If I try to clean up the data using the to_timestamp() method, I also get a null. The problem seems to be with specifying the offset. If I do not specify an offset in the format string, the data parses fine (although it ignores the offset which I need as these timestamps are keyed to the customer's timezone and thus every transaction is in a different timezone depending on the customer). I've posted an example below.

dfData.withColumn('DateCreated', to_timestamp(col('DateCreated'), 'M/d/yyyy h:mm:ss a ZZZZZ'))

A sample date looks like: 4/15/2020 3:09:50 PM -07:00. If I remove the 'ZZZZZ' from the format string, I get a timestamp that looks like 2020-04-15T15:09:50.000+00:00 (note the missing offset information). With the 'ZZZZZ' in the format string, it returns null. I have tried every iteration of the various format options for offset strings listed in the pyspark documentation but none of them work (and a few seem to be no longer supported as I get IllegalArgumentException: Illegal pattern character errors if I use x or O: only X and Z seem to be supported as they do not throw exceptions—however regardless of the number of characters I specify (e.g. 'xxx' or 'Z' or 'ZZZZ' or 'xx'), the data always parses as null). Has anyone else run into this issue? How do I resolve it? For now, I'm leaving the data as StringType but it will make any datetime analysis difficult for downstream workloads if I cannot resolve this issue.

In case it's needed, my cluster is running Databricks v 15.1 which includes Apache Spark 3.5.0. The data is coming from SQL Server 2016 v 13.0.6435.1.


Solution

  • I have tried the below approach:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, expr, substring, length, concat, to_timestamp
    
    data = [
        ("4/15/2020 3:09:50 PM -07:00",),
        ("5/20/2021 10:30:15 AM -05:00",),
        ("6/25/2022 8:45:00 PM +03:00",)
    ]
    
    df = spark.createDataFrame(data, ["DateCreated"])
    df = df.withColumn('offset', expr("substring(DateCreated, length(DateCreated) - 5, 6)"))
    df = df.withColumn('timestamp_str', expr("substring(DateCreated, 1, length(DateCreated) - 7)"))
    df = df.withColumn('combined', concat(col('timestamp_str'), col('offset')))
    df = df.withColumn('DateCreated', to_timestamp(col('combined'), 'M/d/yyyy h:mm:ss aXXX'))
    df.show(truncate=False)
    df = df.drop('offset', 'timestamp_str', 'combined')
    

    Results:

    +-------------------+------+---------------------+---------------------------+
    |DateCreated        |offset|timestamp_str        |combined                   |
    +-------------------+------+---------------------+---------------------------+
    |2020-04-15 22:09:50|-07:00|4/15/2020 3:09:50 PM |4/15/2020 3:09:50 PM-07:00 |
    |2021-05-20 15:30:15|-05:00|5/20/2021 10:30:15 AM|5/20/2021 10:30:15 AM-05:00|
    |2022-06-25 17:45:00|+03:00|6/25/2022 8:45:00 PM |6/25/2022 8:45:00 PM+03:00 |
    +-------------------+------+---------------------+---------------------------+
    

    In the above code, I extracted the timezone offset, the timestamp string without the timezone offset, and combined the timestamp string and offset. Next, I converted the combined string to a timestamp. You can also drop the intermediate columns if needed.