apache-sparkpysparkdatabrickswindow-functionsspark-window-function

Number of Tasks - for Window function without PARTITION BY statement


As per the documentation (https://docs.databricks.com/en/optimizations/spark-ui-guide/one-spark-task.html) , Window function without PARTITION BY statement results in single task on Spark.

Is this true, given Spark does distributed parallel processing, is not spark first performs window aggregation (for ex: max(date) over (order by some_column) or row_number() over(order by date_col) ) at partition level and later together from all partitions? why it results in single task?


Solution

  • Window functions based on global ordering cannot be calculated with multiple partitions.

    For example: row_number() over(order by date_col) at some row depends on the count of ALL rows across ALL partitions with lower date_col. Therefore all data need to be gathered in a single partition, sorted and then assigned row number one by one.

    There could be some tricks to overcome this. You could precalculate rolling sums of partition counts and add them to partition-local row numbers. This approach is used in this Dataset.withRowNumbers function. But you would need to reimplement it for other functions like max.