apache-sparkdatepysparkapache-spark-sqltimestamp

Convert timestamp to date in Spark dataframe


I've seen (here: How to convert Timestamp to Date format in DataFrame?) the way to convert a timestamp in datetype, but,at least for me, it doesn't work.

Here is what I've tried:

# Create dataframe
df_test = spark.createDataFrame([('20170809',), ('20171007',)], ['date',])

# Convert to timestamp
df_test2 = df_test.withColumn('timestamp',func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
.otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd')))\

# Convert timestamp to date again
df_test2.withColumn('date_again', df_test2['timestamp'].cast(stypes.DateType())).show()

But this returns null in the column date_again:

+--------+----------+----------+
|    date| timestamp|date_again|
+--------+----------+----------+
|20170809|1502229600|      null|
|20171007|1507327200|      null|
+--------+----------+----------+

Any idea of what's failing?


Solution

  • Following:

    func.when((df_test.date.isNull() | (df_test.date == '')) , '0')\
      .otherwise(func.unix_timestamp(df_test.date,'yyyyMMdd'))
    

    doesn't work because it is type inconsistent - the first clause returns string while the second clause returns bigint. As a result it will always return NULL if data is NOT NULL and not empty.

    It is also obsolete - SQL functions are NULL and malformed format safe. There is no need for additional checks.

    In [1]: spark.sql("SELECT unix_timestamp(NULL, 'yyyyMMdd')").show()
    +----------------------------------------------+
    |unix_timestamp(CAST(NULL AS STRING), yyyyMMdd)|
    +----------------------------------------------+
    |                                          null|
    +----------------------------------------------+
    
    
    In [2]: spark.sql("SELECT unix_timestamp('', 'yyyyMMdd')").show()
    +--------------------------+
    |unix_timestamp(, yyyyMMdd)|
    +--------------------------+
    |                      null|
    +--------------------------+
    

    And you don't need intermediate step in Spark 2.2 or later:

    from pyspark.sql.functions import to_date
    
    to_date("date", "yyyyMMdd")