scala pyspark timestamp unix-timestamp to-timestamp

String to Timestamp Pyspark / Spark

I have requirement to convert a input string 08-DEC-2011 00.00.00 to timestamp 20111208000000 - The data in the file.

"CLIENTCONTEXTID","SRVR","CLNT","USERNAME","UPDATEDTM"
1202,"jbosswabcd6","100.126.164.172","SUSER",08-DEC-2011 00.00.00
1653,"jbossbabcd4","17.26.164.176","SUSER",08-DEC-2011 00.00.00
1654,"jbossbabcd4","12.26.164.221","SUSER",08-JAN-2011 00.00.00
1655,"jbossbabcd4","17.26.164.223","SuperUser",08-DEC-2011 00.00.00
1656,"jbossbabcd4","17.26.164.221","SUSER",08-DEC-2011 00.00.00
1657,"jbossbabcd4","17.26.164.226","SUSER",08-DEC-2011 00.00.00
1658,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
1659,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
2802,"jbosswabcd1","172.20.19.130","SuperUser",08-DEC-2011 00.00.00

As first step i am trying to convert this to timestamp

from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df.createOrReplaceTempView("tablesss")    
spark.sql('select UPDATEDTM,cast(unix_timestamp(UPDATEDTM,"dd-MMM-YYYY HH.mm.ss") as timestamp) columnn12 from tablesss').show()

I tried another way as well

df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df=df.withColumn("datetype_timestamp",F.to_timestamp(F.col("UPDATEDTM"),"dd-MMM-YYYY HH.mm.ss"))
df.select(F.col("UPDATEDTM"),F.col("datetype_timestamp")).show()

Both are resulting in the below

+--------------------+-------------------+
|           UPDATEDTM| datetype_timestamp|
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|

However if i do it using the scala

val df = spark.read.option("header",true).csv("/hdfs/context.csv")
df.withColumn("datetype_timestamp", to_timestamp(col("UPDATEDTM"),"dd-MMM-yyyy HH.mm.ss")).show(false)

The results yielded is

+--------------------+-------------------+
|UPDATEDTM           |datetype_timestamp |
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00

Not Sure whats the difference between Scala and Pyspark. Same is reacting weird.

After this timestamp i thought of applyting Python strftime to convert to required format.

Please help in identifying the issue and optimal way for this conversion

Solution

Use yyyy(ordinary calendar year) instead of YYYY(week based calendar).

In scala code you are using yyyy and in pyspark YYYY.

Example:

df.show()
#+--------------------+
#|           UPDATEDTM|
#+--------------------+
#|08-DEC-2011 00.00.00|
#+--------------------+

df.withColumn("dd",to_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss')).show()

#using unix_timestamp function
df.withColumn("dd",unix_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss').cast("timestamp")).show()
#+--------------------+-------------------+
#|           UPDATEDTM|                 dd|
#+--------------------+-------------------+
#|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
#+--------------------+-------------------+