azureapache-sparkpysparkazure-synapse

Modifying Spark Partition Key Without Shuffling


I am working in Azure Synapse Analytics, in PySpark. Say I have a PySpark dataframe df with a partition key 'DATE'. However, say that 'DATE' is a string type and we would like to cast it to a date by performing one of the following:

# Option 1
df = df.withColumn('DATE', F.to_date(F.col('DATE'), 'yyyy-MM-dd'))

# Option 2
df = df.withColumn('DATE_NEW', F.to_date(F.col('DATE'), 'yyyy-MM-dd')).repartition('DATE_NEW')

Because 'DATE' is no longer the same as before, does Option 1 require me to repartition if I want to preserve the same keys as before? If so, and operating under the assumption that the mapping is one-to-one, is there a clever way for this new column to inherit the original partition key without performing shuffles? To be clear, I expect all of the partitions to be completely identical before and after the change.

I also am wondering about the same question for option 2, where we have a brand new column that is derived from 'DATE'. If I want to repartition along that new column, will that introduce shuffle, and if so, is that avoidable?


Solution

  • Interesting question.

    For 1: Even though a narrow transformation it is affecting an existing partitioning key, so a reshuffle will occur - automatically. I tried it. Never had to do that to be honest in this way.

    For 2: Always a shuffle needed in your context - manually as you have done. Stands to reason as is a new column.