pysparkpalantir-foundryfoundry-code-repositoriesfoundry-python-transform

Why don't I see smaller tasks for my requested repartitioning?


I have a dataset I want to repartition evenly into 10 buckets per unique value of a column, and I want to size this result into a large number of partitions so that each is small.

col_1 is guaranteed to be one of the values in ["CREATE", "UPDATE", "DELETE"]

My code looks like the following:

df.show()
"""
+------+-----+-----+
| col_1|col_2|index|
+------+-----+-----+
|CREATE|    0|    0|
|CREATE|    0|    1|
|UPDATE|    0|    2|
|UPDATE|    0|    3|
|DELETE|    0|    4|
|DELETE|    0|    5|
|CREATE|    0|    6|
|CREATE|    0|    7|
|CREATE|    0|    8|
+------+-----+-----+
"""
df = df.withColumn(
  "partition_column",
  F.concat(
    F.col("col_1"),
    F.round( # Pick a random number between 0 and 9
      F.random() * F.lit(10),
      0
    )
  )
)

df = df.repartition(1000, F.col("partition_col"))

I see that most of my tasks run and finish with zero rows of data, I would expect the data to be evenly distributed on my partition_col into 1000 partitions?


Solution

  • It's important to understand that the mechanism Spark uses to distribute its data is based upon the hash value of the columns you provide to the repartition() call.

    In this case, you have one column with random values between 0 and 9, combined with another column that only ever has one of 3 different values in it.

    Therefore, you'll have 10 * 3 unique combinations of values going into the repartition() call. This means that when you call the underlying hash on this column, you'll only ever have 30 unique values from which Spark will do its modulus 1000 on top of it. Therefore, the most number of partitions you will ever have is 30.

    You'll need to distribute your data into a greater number of random values if you want to go above partition counts of 30, or figure out another partitioning strategy entirely :)