apache-spark-sql

Convert to Date in Spark SQL


I'm trying to convert a string value, in the format "30 May 2024" to a date, in Spark SQL, in Microsoft Fabric.

The initial column, called TITLE, has text in the following format:

"SOME TEXT - 30th May 2024 11:26" "SOME TEXT - 3rd Dec 2022 14:38"

I've successfully got it to the point that it looks like this:

"30 May 2024" "3 Dec 2022"

And now I need to convert it to a date format.

I'm trying to devise the logic using the code below, and have separated the existing logic into a cte and am hoping someone can advise on what I need to do instead of "to_timestamp", or what further wrangling I need to do allow to_timestamp to work.

I'm aware this is all starting to look a bit horrible... something similar worked in SQL Server, using TRY_CONVERT (and CHARINDEX instead of INSTR), and I'm trying to adapt it for Spark SQL:

out = spark.sql(f"""

with cte1 as (
SELECT
    TITLE
    ,SYSTEMONLY as System_Only
    ,IF(SYSTEMONLY = "Yes"
      ,IF(len(TITLE) <= 28 AND TITLE LIKE '&Snapshot%',
        "PLACEHOLDER",
            REPLACE(
                REPLACE(
                  REPLACE(
                      REPLACE(  
                        SUBSTRING(
                            TRIM(
                                SUBSTRING(
                                    TITLE, 
                                    INSTR(
                                      TITLE,' - '
                                    ) + 3,
                                    LEN(TITLE)
                                )
                            ),
                        1,
                  len(
                    TRIM(
                      SUBSTRING(
                        TITLE,
                        INSTR(
                          TITLE,' - '
                          ) + 3,
                        LEN(TITLE)
                      )
                    )
                  ) -6)
                      ,'th','')
                  ,'nd','')
              ,'st','')
          ,'rd','')
      )
      ,NULL
    ) as Budget_Date


FROM LakehouseName.TableName
WHERE SYSTEMONLY = 'Yes'

)

SELECT TITLE, System_Only, Budget_Date,
to_timestamp(Budget_Date,'dd/MM/yyyy') as FOrmatted

FROM cte1

""")
display(out)

Solution

  • You've only given one example, so I'm making some assumptions (2 digits for day, 3 characters for month).

    to_date ('30 May 2024', 'dd MMM yyyy')
    

    That will return (as a date) 2024-05-30

    Depending on your version spark, you may need to run this first:

    spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
    

    EDIT: That format string seems to work with single digit days and full name for month.

     to_date ('1 November 2024', 'dd MMM yyyy')
    

    That returns 2024-11-01.