[SOLVED] PySpark to_timestamp timezone conversion

PySpark to_timestamp timezone conversion

As per this, to_timestamp converts it to local machine's timezone.

My machine timezone is UTC.

When running this :

     select 
to_timestamp("2024-09-20 19-00-00+0530","yyyy-MM-dd HH-mm-ssZ") as ist_offset,
to_timestamp("2024-09-20 19-00-00IST","yyyy-MM-dd HH-mm-ssz") as ist_abbreviation

I'm getting this output :

ist_offset	ist_abbreviation
2024-09-20 13:30:00.000	2024-09-20 17:00:00.000

For ist_offset it generated correct timezone, but for ist_abbreviation it did some conversion but not sure to which timezone and why it's giving different ?

I looked online at pyspark documentation and different blogs as well, but unable to find anything. Can someone help me how to do timezone conversion with abbreviation as timezone as I'm restricted not to use offset or proper timezone like "Asia/Kolkata".

Solution

I feel there might be ambiguity for IST which could be either Indian Standard Time (UTC+5:30) , Irish Standard Time (UTC+01:00), or Israel Standard Time (UTC+02:00)

You either need to add offset or the specify timezone

Maybe a udf like this to be specific -

def parse_timestamp_with_ist(timestamp_str):
  if 'IST' in timestamp_str:
     dt = datetime.strptime(timestamp_str, "%Y-%m-%d %H-%M-%SIST")
     return dt - timedelta(hours=5, minutes=30)
  else:
     return datetime.strptime(timestamp_str, "%Y-%m-%d %H-%M-%S")