pythonpysparkdatabricksdistributed-computingspark-ui

Pyspark get max value of column of a csv the quickest way possible


I am trying to get the max value of a column using this: df.agg(max(col('some_integer_column')),min(col('some_integer_column')))

The df is a csv file. Which I know if it was a parquet/delta it would be much easier and faster. As the csv file needs to shuffle data because it doesn't have the metadata stats that a parquet/delta has. But I am not interested in rewriting the csv as parquet/delta

So in my df from the csv, I checked the execution plan from that command, and I see it does some exchange of partitions. While I know it theoretically needs to do so because data is scattered across partitions. Can't there just exist a quicker way to minimize the shuffle?

Like letting each executor check for each of his partitions what is the maximum value within each partition. And then share that value in the exchange. For example, if I have 200 partitions, then I can get 200 values. So now I just have to shuffle 200 values and get the max of 200 values.

Instead of shuffling all the data inside the 200 partitions which is what I understand this execution plan is doing:

Execution Plan


Solution

  • Like letting each executor check for each of his partitions what is the maximum value within each partition. And then share that value in the exchange.

    This is exactly what happens here and you can observe on your plan:

    So in the end - this plan looks reasonable and there's not much to optimize here further.