apache-sparkpyspark

Is there any preference on the order of select and filter in spark?


We have two approaches to selecting and filtering data from spark data frame df. First:

df = df.filter("filter definition").select('col1', 'col2', 'col3')

and second:

df = df.select('col1', 'col2', 'col3').filter("filter definition")

Suppose we want to call the action of count after that. Which one is more performant if we can swap the place of filter and select in spark (i.e., in the definition of the filter we used from the selected columns and not more)? Why? Is there any difference between the filter and select swapping for different actions?


Solution

  • Spark ( in and above 1.6 version) uses catalyst optimiser for queries, so less performant query will be transformed to the efficient one.

    enter image description here

    Just to confirm you can call explain(true) on dataframe to check its optimised plan which are the same for both the queries.

    Query1 plan: enter image description here

    Query2 plan: enter image description here

    PS: New changes are introduction of cost based optimiser.