I am working in Azure Synapse Analytics, in PySpark. Say I have a PySpark dataframe df
with a partition key 'DATE'
. However, say that 'DATE'
is a string type and we would like to cast it to a date by performing one of the following:
# Option 1
df = df.withColumn('DATE', F.to_date(F.col('DATE'), 'yyyy-MM-dd'))
# Option 2
df = df.withColumn('DATE_NEW', F.to_date(F.col('DATE'), 'yyyy-MM-dd')).repartition('DATE_NEW')
Because 'DATE'
is no longer the same as before, does Option 1 require me to repartition if I want to preserve the same keys as before? If so, and operating under the assumption that the mapping is one-to-one, is there a clever way for this new column to inherit the original partition key without performing shuffles? To be clear, I expect all of the partitions to be completely identical before and after the change.
I also am wondering about the same question for option 2, where we have a brand new column that is derived from 'DATE'. If I want to repartition along that new column, will that introduce shuffle, and if so, is that avoidable?
Interesting question.
For 1: Even though a narrow transformation it is affecting an existing partitioning key, so a reshuffle will occur - automatically. I tried it. Never had to do that to be honest in this way.
For 2: Always a shuffle needed in your context - manually as you have done. Stands to reason as is a new column.