scalapysparktimestampunix-timestampto-timestamp

String to Timestamp Pyspark / Spark


I have requirement to convert a input string 08-DEC-2011 00.00.00 to timestamp 20111208000000 - The data in the file.

"CLIENTCONTEXTID","SRVR","CLNT","USERNAME","UPDATEDTM"
1202,"jbosswabcd6","100.126.164.172","SUSER",08-DEC-2011 00.00.00
1653,"jbossbabcd4","17.26.164.176","SUSER",08-DEC-2011 00.00.00
1654,"jbossbabcd4","12.26.164.221","SUSER",08-JAN-2011 00.00.00
1655,"jbossbabcd4","17.26.164.223","SuperUser",08-DEC-2011 00.00.00
1656,"jbossbabcd4","17.26.164.221","SUSER",08-DEC-2011 00.00.00
1657,"jbossbabcd4","17.26.164.226","SUSER",08-DEC-2011 00.00.00
1658,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
1659,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
2802,"jbosswabcd1","172.20.19.130","SuperUser",08-DEC-2011 00.00.00

As first step i am trying to convert this to timestamp

from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df.createOrReplaceTempView("tablesss")    
spark.sql('select UPDATEDTM,cast(unix_timestamp(UPDATEDTM,"dd-MMM-YYYY HH.mm.ss") as timestamp) columnn12 from tablesss').show()

I tried another way as well

df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df=df.withColumn("datetype_timestamp",F.to_timestamp(F.col("UPDATEDTM"),"dd-MMM-YYYY HH.mm.ss"))
df.select(F.col("UPDATEDTM"),F.col("datetype_timestamp")).show()

Both are resulting in the below

+--------------------+-------------------+
|           UPDATEDTM| datetype_timestamp|
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|

However if i do it using the scala

val df = spark.read.option("header",true).csv("/hdfs/context.csv")
df.withColumn("datetype_timestamp", to_timestamp(col("UPDATEDTM"),"dd-MMM-yyyy HH.mm.ss")).show(false)

The results yielded is

+--------------------+-------------------+
|UPDATEDTM           |datetype_timestamp |
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00

Not Sure whats the difference between Scala and Pyspark. Same is reacting weird.

After this timestamp i thought of applyting Python strftime to convert to required format.

Please help in identifying the issue and optimal way for this conversion


Solution

  • Use yyyy(ordinary calendar year) instead of YYYY(week based calendar).

    In scala code you are using yyyy and in pyspark YYYY.

    Example:

    df.show()
    #+--------------------+
    #|           UPDATEDTM|
    #+--------------------+
    #|08-DEC-2011 00.00.00|
    #+--------------------+
    
    df.withColumn("dd",to_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss')).show()
    
    #using unix_timestamp function
    df.withColumn("dd",unix_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss').cast("timestamp")).show()
    #+--------------------+-------------------+
    #|           UPDATEDTM|                 dd|
    #+--------------------+-------------------+
    #|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
    #+--------------------+-------------------+